HDFS fsck command

HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data.

HDFS supports the fsck command to check for various inconsistencies. It it is designed for reporting problems with various files, for example, missing blocks for a file or under-replicated blocks. Unlike a traditional fsck utility for native file systems, this command does not correct the errors it detects. Normally NameNode automatically corrects most of the recoverable failures. By default fsck ignores open files but provides an option to select all files during reporting. The HDFS fsck command is not a Hadoop shell command. It can be run as ‘bin/hadoop fsck‘.

Runs a HDFS filesystem checking utility.

hadoop fsck <path> [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]

COMMAND_OPTION Description
<path> Start checking from this path.
-move Move corrupted files to /lost+found
-delete Delete corrupted files.
-openforwrite Print out files opened for write.
-files Print out files being checked.
-blocks Print out block report.
-locations Print out locations for every block.
-racks Print out network topology for data-node locations.

Source – https://hadoop.apache.org/docs/r1.2.1/commands_manual.html

How to remove duplicate rows from Hive table?

Scenario

Have table with duplicate rows in hive table and Want to remove these duplicate rows from hive table.

Approach

Steps:

1) Create a new table from old table (with same structure).
2) Copy distinct rows in new table from existing table.

select col1,col2,col3,col4,max(<duplicate column>) as <name of duplicate column> from <table name> group by col1,col2,col3,col4;
3) Delete old table.

4) Rename new table to old one.

This is a new approach.

Another approach

We can follow an old approach also that databases use while deleting rows.

Steps:

  1. Create a new temp table from old table (with same structure).
  2. Create a new lookup table (with id and flag columns).
  3. Copy ids of duplicate rows.
  4. Start copy rows from old to new table while copying take following steps
    1. Select row from old table and check existence in lookup table.
    2. If exists and flag is not set then
    3. set flag and copy in new table
    4. else skip
  5. Delete or drop old table.
  6. Delete or drop lookup table.
  7. Rename new table to old table.

 

Vikas Jindal

 

 

Hive – Useful Commands

Hive is a data warehousing infrastructure based on Apache Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware.

Q. Which directory is created after creating Hive Table?

To see table primary info of Hive table, use describe table_name;

To see more detailed information about the table, use describe extended table_name;

To see code in a clean manner use describe formatted table_name; command to see all information. also describe all details in a clean manner.

Q. How to generate the create statement for an existing hive table?

show create table

Q. How to see query plan in Hive?

explain select * from

Q. How to run Query from command line?

hive -e ‘select a.col from’

Q. How to dump data out from a query into a file using silent mode?

hive -S -e ‘select a.col from tab1 a’ > a.txt

Q How to list all databases?

Show databases;

Q How to list all tables?

Show tables;

Q How to list all partitions in table?

Show Partitions;

Q How to delete partition?

ALTER TABLE <tablename> Drop PARTITION (<partitionname>);

Example

ALTER TABLE db1.person Drop PARTITION (year=2016, month=01);

Q How to rename a table?

ALTER TABLE RENAME TO ;

Q How to check locks on a table?

Show locks;

Q How to lock on a table?

Lock table;

Q How to unlock on a table?

Unlock table;

Q How to check concurrency in hive?

set hive.support.concurrency;

Q How to check indexes in hive table?

SHOW INDEX ON;

SHOW CREATE TABLE – shows the CREATE TABLE statement that creates a given table, or the CREATE VIEW statement that creates a given view.

SHOW CREATE TABLE ([db_name.]<table_name|view_name>);

Q How to copy a hive table?

create table . as select * from .<tablename>;

Scala – Scalable Language

Scala is a hybrid functional programming language.

Scala was created by Martin Odersky and first released in 2003.

In Scala, Everything is an object.It is a pure object oriented language.

Scala is also a functional language. Every function is a value and every value is an object. It provides facility to define anonymous functions. It allows nested functions.

Scala is statically typed language, no need to provide type information.

Scala is compiled into Java Byte Code which is executed by the Java Virtual Machine.

In Scala, developer can use all Java classes.

Scala – First Program

Vikas Jindal

Hadoop – Questions & Answers

Introduction: I am writing here questions and answers those i got from internet and some i formed from my study materials.

hadoop-elephant

Q 1: What is Hadoop?

Ans: Hadoop is the most popular platform for big data analysis. Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. The Hadoop ecosystem is huge and involves many supporting frameworks and tools to effectively run and manage it. Hadoop is part of the Apache project sponsored by the Apache Software Foundation.

Q 2: What is HDFS?

HDFS was based on a paper Google published about their Google File System.

It runs on top of the existing file systems on each node in a Hadoop cluster.

More…

Q 3: What is MapReduce?

MapReduce

Q 4: What is a JobTracker in Hadoop? How many instances of JobTracker run on a Hadoop Cluster?

HDFS

Q 5: What is a Task Tracker in Hadoop? How many instances of TaskTracker run on a Hadoop Cluster?

Q 6: What is HIVE?

Q 7: What is PIG?

Pig or Pig Latin is a language.

It helps analyst to concentrate on analytic work by removing map-reduce programming complexity.

PIG is high-level language and it converts its operators into MapReduce code.

More…

Q 8: What is HBase?

Q 9: What is replication factor in HDFS?

Q 10: What is Master-Worker Pattern?

Q 11: In HDFS, Why does system reconstruct block location information every time on start up?

Q 12: What is POSIX (Portable Operating System Interface)?

Q 13: What is data locality optimization?

Q 14: What is the meaning of streaming data access pattern?

What Is Hadoop?

The Hadoop (Apache) project develops open-source software for reliable, scalable, distributed computing.

The Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
  • Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
  • Avro: A data serialization system.
  • Cassandra: A scalable multi-master database with no single points of failure.
  • Chukwa: A data collection system for managing large distributed systems.
  • HBase: A scalable, distributed database that supports structured data storage for large tables.
  • Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout: A Scalable machine learning and data mining library.
  • Pig: A high-level data-flow language and execution framework for parallel computation.
  • Spark: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
  • Tez: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive, Pig and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop MapReduce as the underlying execution engine.
  • ZooKeeper: A high-performance coordination service for distributed applications.