Storing large files in hadoop HDFS?
I need to store a large file of about 10TB on HDFS. What I need to understand is how HDFS will store the file. Let’s say the replication factor for the cluster is 3, and I have a 10-node cluster with more than 10 TB of disk space on each node, i.e. the total capacity of the cluster is more than 100 TB.
Now HDFS randomly selects three nodes and stores files on them. Then it’s as simple as it sounds. Please confirm?
Or whether HDFS splits files – let’s say split into 10 1TB splits and then stores each split on a randomly selected 3 nodes. Splitting is also possible and, if so, whether it is the configuration aspect to enable it.
If HDFS has to split binary or text files – how it is split. Simply press byte.
Solution
Yes, it splits files (default is 128mb block). Each block will be stored on 3 random nodes. As a result, you will have 30TB of data evenly distributed across your 10 nodes.