HDFS – Hadoop Distributed File System

HDFS was based on a paper Google published about their Google File System.

It runs on top of the existing file systems on each node in a Hadoop cluster.

It is a Java-based file system.

It provides scalable and reliable data storage.

It is highly fault-tolerant and is designed to be deployed on low-cost hardware.

It provides high throughput access to application data.

It is suitable for applications that have large data sets.

It was designed to span large clusters of commodity servers.

Hadoop data in Blocks

Hadoop uses blocks to store a file or parts of a file.

Blocks are large. They default to 64 megabytes each and most systems run with block sizes of 128 megabytes or larger.

A Hadoop block is a file on the underlying filesystem. Since the underlying filesystem stores files as blocks, one Hadoop block may consist of many blocks in the underlying file system.

Advantages of Blocks

They are fixed in size. Easy to calculate how many can fit on a disk.

A large file can store easily by spreading over multiple nodes by splitting in blocks.

HDFS blocks also don’t waste space.

Blocks fit well with replication, which allows HDFS to be fault tolerant and available on commodity hardware.

Each block is replicated to multiple nodes. For example, block 1 is stored on node 1, node 2 and node 4.

This allows for node failure without data loss. If node 1 crashes, node 2 and node 4 still run and has block 1’s data.

Hadoop’s configuration or even setting the replication factor for each individual file.

HDFS in Main Nodes – The Name Node and The Data Nodes

Other Nodes – Secondary Name Node, Checkpoint Node and Backup Node.

There is only one Name Node in the cluster. It stores metadata for a file.

It is also responsible for the filesystem namespace.

It should also have as much RAM as possible because it keeps the entire filesystem metadata in memory.

It should be the best enterprise hardware for maximum reliability.

Data Node

A typical HDFS cluster has many Data Nodes.

They store the blocks of data.

Each Data Node also reports to the Name Node periodically with the list of blocks it stores.

The Data Nodes are designed to run on commodity hardware and replication is provided at the software layer.

<< Hadoop’s Architecture

Vikas Jindal

Advertisements

One thought on “HDFS – Hadoop Distributed File System

  1. Pingback: Hadoop – Questions & Answers | vikkasjindal

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s