HDFS was based on a paper Google published about their Google File System.
It runs on top of the existing file systems on each node in a Hadoop cluster.
It is a Java-based file system.
It provides scalable and reliable data storage.
It is highly fault-tolerant and is designed to be deployed on low-cost hardware.
It provides high throughput access to application data.
It is suitable for applications that have large data sets.
It was designed to span large clusters of commodity servers.
Hadoop data in Blocks
Hadoop uses blocks to store a file or parts of a file.
Blocks are large. They default to 64 megabytes each and most systems run with block sizes of 128 megabytes or larger.
A Hadoop block is a file on the underlying filesystem. Since the underlying filesystem stores files as blocks, one Hadoop block may consist of many blocks in the underlying file system.
Advantages of Blocks
They are fixed in size. Easy to calculate how many can fit on a disk.
A large file can store easily by spreading over multiple nodes by splitting in blocks.
HDFS blocks also don’t waste space.
Blocks fit well with replication, which allows HDFS to be fault tolerant and available on commodity hardware.
Each block is replicated to multiple nodes. For example, block 1 is stored on node 1, node 2 and node 4.
This allows for node failure without data loss. If node 1 crashes, node 2 and node 4 still run and has block 1’s data.
Hadoop’s configuration or even setting the replication factor for each individual file.
HDFS in Main Nodes – The Name Node and The Data Nodes
Other Nodes – Secondary Name Node, Checkpoint Node and Backup Node.
There is only one Name Node in the cluster. It stores metadata for a file.
It is also responsible for the filesystem namespace.
It should also have as much RAM as possible because it keeps the entire filesystem metadata in memory.
It should be the best enterprise hardware for maximum reliability.
A typical HDFS cluster has many Data Nodes.
They store the blocks of data.
Each Data Node also reports to the Name Node periodically with the list of blocks it stores.
The Data Nodes are designed to run on commodity hardware and replication is provided at the software layer.