Hadoop – Questions & Answers

Introduction: I am writing here questions and answers those i got from internet and some i formed from my study materials.

hadoop-elephant

Q 1: What is Hadoop?

Ans: Hadoop is the most popular platform for big data analysis. Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. The Hadoop ecosystem is huge and involves many supporting frameworks and tools to effectively run and manage it. Hadoop is part of the Apache project sponsored by the Apache Software Foundation.

Q 2: What is HDFS?

HDFS was based on a paper Google published about their Google File System.

It runs on top of the existing file systems on each node in a Hadoop cluster.

More…

Q 3: What is MapReduce?

MapReduce

Q 4: What is a JobTracker in Hadoop? How many instances of JobTracker run on a Hadoop Cluster?

HDFS

Q 5: What is a Task Tracker in Hadoop? How many instances of TaskTracker run on a Hadoop Cluster?

Q 6: What is HIVE?

Q 7: What is PIG?

Pig or Pig Latin is a language.

It helps analyst to concentrate on analytic work by removing map-reduce programming complexity.

PIG is high-level language and it converts its operators into MapReduce code.

More…

Q 8: What is HBase?

Q 9: What is replication factor in HDFS?

Q 10: What is Master-Worker Pattern?

Q 11: In HDFS, Why does system reconstruct block location information every time on start up?

Q 12: What is POSIX (Portable Operating System Interface)?

Q 13: What is data locality optimization?

Q 14: What is the meaning of streaming data access pattern?

What Is Hadoop?

The Hadoop (Apache) project develops open-source software for reliable, scalable, distributed computing.

The Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
  • Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
  • Avro: A data serialization system.
  • Cassandra: A scalable multi-master database with no single points of failure.
  • Chukwa: A data collection system for managing large distributed systems.
  • HBase: A scalable, distributed database that supports structured data storage for large tables.
  • Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout: A Scalable machine learning and data mining library.
  • Pig: A high-level data-flow language and execution framework for parallel computation.
  • Spark: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
  • Tez: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive, Pig and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop MapReduce as the underlying execution engine.
  • ZooKeeper: A high-performance coordination service for distributed applications.

Code Performance: if-if-if or if-else if-else if or switch case

If we have condition where we have to select one option among number of options then we can adopt following control statements:

First Way and Worst Way

string strMsg = “”;

if(color==”black”) {

strMsg = “Color is black.”;

}

if(color==”blue”) {

strMsg = “Color is blue.”;

}

if(color==”red”) {

strMsg = “Color is red.”;

}

Second Way and Worst Way

string strMsg = “”;

if(color==”black”) {

strMsg = “Color is black.”;

}

else if(color==”blue”) {

strMsg = “Color is blue.”;

}

else if(color==”red”) {

strMsg = “Color is red.”;

}

Third Way and Best Way

string strMsg = “”;

switch (caseSwitch)
{
case “black” : {

strMsg = “Color is black.”;

break;

}

case “blue” : {

strMsg = “Color is blue.”;

break;

}

case “red” : {

strMsg = “Color is red.”;

break;

}

default;

{

strMsg = “No Color.”;

}

}

In above,  3 control statements switch case is best.

Vikas Jindal

Hadoop – PIG

Pig or Pig Latin is a language.

It helps analyst to concentrate on analytic work by removing map-reduce programming complexity.

PIG is high-level language and it converts its operators into MapReduce code.

So with the help of PIG, a without knowledge of Java and MapReduce programming, a developer can use SORT and FILTER operators.

User defined function is also a very strong feature of PIG. (For example: In my first program, I put title in front of first name based on gender and age.) UDF can write in JAVA language.

Pig can run in interactive mode and script mode. (In script mode, we write commands in file and then executes)

In interactive mode, Grunt shell is invoked.

Pig gives a one more facility, developer can run PIG in local file system in local mode. It is useful for testing small amount of data.

When PIG runs on a Hadoop cluster or MapReduce mode, PIG statements convert into MapReduce code.

Vikas Jindal

HDFS – Hadoop Distributed File System

HDFS was based on a paper Google published about their Google File System.

It runs on top of the existing file systems on each node in a Hadoop cluster.

It is a Java-based file system.

It provides scalable and reliable data storage.

It is highly fault-tolerant and is designed to be deployed on low-cost hardware.

It provides high throughput access to application data.

It is suitable for applications that have large data sets.

It was designed to span large clusters of commodity servers.

Hadoop data in Blocks

Hadoop uses blocks to store a file or parts of a file.

Blocks are large. They default to 64 megabytes each and most systems run with block sizes of 128 megabytes or larger.

A Hadoop block is a file on the underlying filesystem. Since the underlying filesystem stores files as blocks, one Hadoop block may consist of many blocks in the underlying file system.

Advantages of Blocks

They are fixed in size. Easy to calculate how many can fit on a disk.

A large file can store easily by spreading over multiple nodes by splitting in blocks.

HDFS blocks also don’t waste space.

Blocks fit well with replication, which allows HDFS to be fault tolerant and available on commodity hardware.

Each block is replicated to multiple nodes. For example, block 1 is stored on node 1, node 2 and node 4.

This allows for node failure without data loss. If node 1 crashes, node 2 and node 4 still run and has block 1’s data.

Hadoop’s configuration or even setting the replication factor for each individual file.

HDFS in Main Nodes – The Name Node and The Data Nodes

Other Nodes – Secondary Name Node, Checkpoint Node and Backup Node.

There is only one Name Node in the cluster. It stores metadata for a file.

It is also responsible for the filesystem namespace.

It should also have as much RAM as possible because it keeps the entire filesystem metadata in memory.

It should be the best enterprise hardware for maximum reliability.

Data Node

A typical HDFS cluster has many Data Nodes.

They store the blocks of data.

Each Data Node also reports to the Name Node periodically with the list of blocks it stores.

The Data Nodes are designed to run on commodity hardware and replication is provided at the software layer.

<< Hadoop’s Architecture

Vikas Jindal

Hadoop – Some Terms

Node – A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data.

Rack – A rack is a collection of 30+ nodes. They are physically stored close together and are all connected to the same network switch. Network bandwidth between any two nodes in rack is greater than bandwidth between two nodes on different racks.

Cluster  – A Hadoop Cluster is a collection of racks.

<< Big Data

Vikas Jindal