HDFS Job command

Command to interact with Map Reduce Jobs.

hadoop job  [-submit <job-file>] | [-status <job-id>] | [-counter <job-id> <group-name> <counter-name>] | [-kill <job-id>] | [-events <job-id> <from-event-#> <#-of-events>] | [-history [all] <jobOutputDir>] | [-list [all]] | [-kill-task <task-id>] | [-fail-task <task-id>] | [-set-priority <job-id> <priority>]

COMMAND_OPTION Description
-submit <job-file> Submits the job.
-status <job-id> Prints the map and reduce completion percentage and all job counters.
-counter <job-id> <group-name> <counter-name> Prints the counter value.
-kill <job-id> Kills the job.
-events <job-id> <from-event-#> <#-of-events> Prints the events’ details received by jobtracker for the given range.
-history [all] <jobOutputDir> -history <jobOutputDir> prints job details, failed and killed tip details. More details about the job such as successful tasks and task attempts made for each task can be viewed by specifying the [all] option.
-list [all] -list all displays all jobs. -list displays only jobs which are yet to complete.
-kill-task <task-id> Kills the task. Killed tasks are NOT counted against failed attempts.
-fail-task <task-id> Fails the task. Failed tasks are counted against failed attempts.
-set-priority <job-id> <priority> Changes the priority of the job. Allowed priority values are VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW

Source: https://hadoop.apache.org/docs/r1.2.1/commands_manual.html

Advertisements

HDFS fsck command

HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data.

HDFS supports the fsck command to check for various inconsistencies. It it is designed for reporting problems with various files, for example, missing blocks for a file or under-replicated blocks. Unlike a traditional fsck utility for native file systems, this command does not correct the errors it detects. Normally NameNode automatically corrects most of the recoverable failures. By default fsck ignores open files but provides an option to select all files during reporting. The HDFS fsck command is not a Hadoop shell command. It can be run as ‘bin/hadoop fsck‘.

Runs a HDFS filesystem checking utility.

hadoop fsck <path> [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]

COMMAND_OPTION Description
<path> Start checking from this path.
-move Move corrupted files to /lost+found
-delete Delete corrupted files.
-openforwrite Print out files opened for write.
-files Print out files being checked.
-blocks Print out block report.
-locations Print out locations for every block.
-racks Print out network topology for data-node locations.

Source – https://hadoop.apache.org/docs/r1.2.1/commands_manual.html

How to remove duplicate rows from Hive table?

Scenario

Have table with duplicate rows in hive table and Want to remove these duplicate rows from hive table.

Approach

Steps:

1) Create a new table from old table (with same structure).
2) Copy distinct rows in new table from existing table.

select col1,col2,col3,col4,max(<duplicate column>) as <name of duplicate column> from <table name> group by col1,col2,col3,col4;
3) Delete old table.

4) Rename new table to old one.

This is a new approach.

Another approach

We can follow an old approach also that databases use while deleting rows.

Steps:

  1. Create a new temp table from old table (with same structure).
  2. Create a new lookup table (with id and flag columns).
  3. Copy ids of duplicate rows.
  4. Start copy rows from old to new table while copying take following steps
    1. Select row from old table and check existence in lookup table.
    2. If exists and flag is not set then
    3. set flag and copy in new table
    4. else skip
  5. Delete or drop old table.
  6. Delete or drop lookup table.
  7. Rename new table to old table.

 

Vikas Jindal

 

 

Hive – Useful Commands

Hive is a data warehousing infrastructure based on Apache Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware.

Q. Which directory is created after creating Hive Table?

To see table primary info of Hive table, use describe table_name;

To see more detailed information about the table, use describe extended table_name;

To see code in a clean manner use describe formatted table_name; command to see all information. also describe all details in a clean manner.

Q. How to generate the create statement for an existing hive table?

show create table

Q. How to see query plan in Hive?

explain select * from

Q. How to run Query from command line?

hive -e ‘select a.col from’

Q. How to dump data out from a query into a file using silent mode?

hive -S -e ‘select a.col from tab1 a’ > a.txt

Q How to list all databases?

Show databases;

Q How to list all tables?

Show tables;

Q How to list all partitions in table?

Show Partitions;

Q How to delete partition?

ALTER TABLE <tablename> Drop PARTITION (<partitionname>);

Example

ALTER TABLE db1.person Drop PARTITION (year=2016, month=01);

Q How to rename a table?

ALTER TABLE RENAME TO ;

Q How to check locks on a table?

Show locks;

Q How to lock on a table?

Lock table;

Q How to unlock on a table?

Unlock table;

Q How to check concurrency in hive?

set hive.support.concurrency;

Q How to check indexes in hive table?

SHOW INDEX ON;

SHOW CREATE TABLE – shows the CREATE TABLE statement that creates a given table, or the CREATE VIEW statement that creates a given view.

SHOW CREATE TABLE ([db_name.]<table_name|view_name>);

Q How to copy a hive table?

create table . as select * from .<tablename>;

Hadoop – Questions & Answers

Introduction: I am writing here questions and answers those i got from internet and some i formed from my study materials.

hadoop-elephant

Q 1: What is Hadoop?

Ans: Hadoop is the most popular platform for big data analysis. Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. The Hadoop ecosystem is huge and involves many supporting frameworks and tools to effectively run and manage it. Hadoop is part of the Apache project sponsored by the Apache Software Foundation.

Q 2: What is HDFS?

HDFS was based on a paper Google published about their Google File System.

It runs on top of the existing file systems on each node in a Hadoop cluster.

More…

Q 3: What is MapReduce?

MapReduce

Q 4: What is a JobTracker in Hadoop? How many instances of JobTracker run on a Hadoop Cluster?

HDFS

Q 5: What is a Task Tracker in Hadoop? How many instances of TaskTracker run on a Hadoop Cluster?

Q 6: What is HIVE?

Q 7: What is PIG?

Pig or Pig Latin is a language.

It helps analyst to concentrate on analytic work by removing map-reduce programming complexity.

PIG is high-level language and it converts its operators into MapReduce code.

More…

Q 8: What is HBase?

Q 9: What is replication factor in HDFS?

Q 10: What is Master-Worker Pattern?

Q 11: In HDFS, Why does system reconstruct block location information every time on start up?

Q 12: What is POSIX (Portable Operating System Interface)?

Q 13: What is data locality optimization?

Q 14: What is the meaning of streaming data access pattern?

Big Data – Introduction

This is the era of data or in other words we are living in the age of data. Now Data storages are very cheap and we have lot of easy techniques to capture the data.

People upload videos, take pictures on their smart phones, text friends, update their facebook status, leave comments around the web, clicks on ads and so forth.

Machines are also intelligent and capable generating and keeping more and more data.

Google, Yahoo, Amazon and Microsoft all are facing challenges of the exponential growth of data.

Existing tools are becoming inadequate to process this big data.These tools are not able to go through TeraBytes and PetaBytes of data to figure out which websites are popular, what products are in demand and what kind of ads are appealing to people.

Hadoop (Mapreduce + Hadoop Distributed File System) is the answer of this problem.

Hadoop is not a Big Data. Hadoop is open source analytic software.

Google published Mapreduce in 1990.

Doug Cutting started development of Hadoop System and named it after his son’s toy elephant.

Hadoop uses Google’s MapReduce and Google File System technologies as its foundation.

It is optimized to handle massive quantities of data which could be structured, unstructured or semi-structured, using commodity hardware, that is, relatively inexpensive computers.

This massive parallel processing is done with great performance. However, it is a batch operation handling massive quantities of data, so the response time is not immediate.

Yahoo is the first company which started funding for Hadoop.

Hadoop is not good to process transactions because it is random access.

It is not good when the work cannot be parallelized.

It is not good for low latency data access.

Not good for processing lots of small files.

And not good for intensive calculations with little data.

It is NOT a replacement for a relational database system.

Hadoop is not suitable for OnLine Analytical Processing or Decision Support System workloads where data are sequentially accessed on structured data like a relational database, to generate reports that provide business intelligence.

Open source projects related with Hadoop:

Eclipse is a popular IDE donated by IBM to the open source community.

Lucene is a text search engine library written in Java.

Hbase is the Hadoop database.

Hive provides data warehousing tools to extract, transform and load data, and query this data stored in Hadoop files.

Pig is a platform for analyzing large data sets. It is a high level language for expressing data analysis.

Zoo Keeper is a centralized configuration service and naming registry for large distributed systems.

Avro is a data serialization system.

UIMA is the architecture for the development, discovery, composition and deployment for the analysis of unstructured data.

Vikas Jindal