Big Data – Introduction

This is the era of data or in other words we are living in the age of data. Now Data storages are very cheap and we have lot of easy techniques to capture the data.

People upload videos, take pictures on their smart phones, text friends, update their facebook status, leave comments around the web, clicks on ads and so forth.

Machines are also intelligent and capable generating and keeping more and more data.

Google, Yahoo, Amazon and Microsoft all are facing challenges of the exponential growth of data.

Existing tools are becoming inadequate to process this big data.These tools are not able to go through TeraBytes and PetaBytes of data to figure out which websites are popular, what products are in demand and what kind of ads are appealing to people.

Hadoop (Mapreduce + Hadoop Distributed File System) is the answer of this problem.

Hadoop is not a Big Data. Hadoop is open source analytic software.

Google published Mapreduce in 1990.

Doug Cutting started development of Hadoop System and named it after his son’s toy elephant.

Hadoop uses Google’s MapReduce and Google File System technologies as its foundation.

It is optimized to handle massive quantities of data which could be structured, unstructured or semi-structured, using commodity hardware, that is, relatively inexpensive computers.

This massive parallel processing is done with great performance. However, it is a batch operation handling massive quantities of data, so the response time is not immediate.

Yahoo is the first company which started funding for Hadoop.

Hadoop is not good to process transactions because it is random access.

It is not good when the work cannot be parallelized.

It is not good for low latency data access.

Not good for processing lots of small files.

And not good for intensive calculations with little data.

It is NOT a replacement for a relational database system.

Hadoop is not suitable for OnLine Analytical Processing or Decision Support System workloads where data are sequentially accessed on structured data like a relational database, to generate reports that provide business intelligence.

Open source projects related with Hadoop:

Eclipse is a popular IDE donated by IBM to the open source community.

Lucene is a text search engine library written in Java.

Hbase is the Hadoop database.

Hive provides data warehousing tools to extract, transform and load data, and query this data stored in Hadoop files.

Pig is a platform for analyzing large data sets. It is a high level language for expressing data analysis.

Zoo Keeper is a centralized configuration service and naming registry for large distributed systems.

Avro is a data serialization system.

UIMA is the architecture for the development, discovery, composition and deployment for the analysis of unstructured data.

Vikas Jindal


2 thoughts on “Big Data – Introduction

  1. Pingback: Hadoop – Some Terms | vikkasjindal

  2. Pingback: Hadoop – PIG | vikkasjindal

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s