How to remove duplicate rows from Hive table?


Have table with duplicate rows in hive table and Want to remove these duplicate rows from hive table.



1) Create a new table from old table (with same structure).
2) Copy distinct rows in new table from existing table.

select col1,col2,col3,col4,max(<duplicate column>) as <name of duplicate column> from <table name> group by col1,col2,col3,col4;
3) Delete old table.

4) Rename new table to old one.

This is a new approach.

Another approach

We can follow an old approach also that databases use while deleting rows.


  1. Create a new temp table from old table (with same structure).
  2. Create a new lookup table (with id and flag columns).
  3. Copy ids of duplicate rows.
  4. Start copy rows from old to new table while copying take following steps
    1. Select row from old table and check existence in lookup table.
    2. If exists and flag is not set then
    3. set flag and copy in new table
    4. else skip
  5. Delete or drop old table.
  6. Delete or drop lookup table.
  7. Rename new table to old table.


Vikas Jindal




Hive – Useful Commands

Hive is a data warehousing infrastructure based on Apache Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware.

Q. Which directory is created after creating Hive Table?

To see table primary info of Hive table, use describe table_name;

To see more detailed information about the table, use describe extended table_name;

To see code in a clean manner use describe formatted table_name; command to see all information. also describe all details in a clean manner.

Q. How to generate the create statement for an existing hive table?

show create table

Q. How to see query plan in Hive?

explain select * from

Q. How to run Query from command line?

hive -e ‘select a.col from’

Q. How to dump data out from a query into a file using silent mode?

hive -S -e ‘select a.col from tab1 a’ > a.txt

Q How to list all databases?

Show databases;

Q How to list all tables?

Show tables;

Q How to list all partitions in table?

Show Partitions;

Q How to delete partition?

ALTER TABLE <tablename> Drop PARTITION (<partitionname>);


ALTER TABLE db1.person Drop PARTITION (year=2016, month=01);

Q How to rename a table?


Q How to check locks on a table?

Show locks;

Q How to lock on a table?

Lock table;

Q How to unlock on a table?

Unlock table;

Q How to check concurrency in hive?


Q How to check indexes in hive table?


SHOW CREATE TABLE – shows the CREATE TABLE statement that creates a given table, or the CREATE VIEW statement that creates a given view.

SHOW CREATE TABLE ([db_name.]<table_name|view_name>);

Q How to copy a hive table?

create table . as select * from .<tablename>;

Hadoop – Questions & Answers

Introduction: I am writing here questions and answers those i got from internet and some i formed from my study materials.


Q 1: What is Hadoop?

Ans: Hadoop is the most popular platform for big data analysis. Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. The Hadoop ecosystem is huge and involves many supporting frameworks and tools to effectively run and manage it. Hadoop is part of the Apache project sponsored by the Apache Software Foundation.

Q 2: What is HDFS?

HDFS was based on a paper Google published about their Google File System.

It runs on top of the existing file systems on each node in a Hadoop cluster.


Q 3: What is MapReduce?


Q 4: What is a JobTracker in Hadoop? How many instances of JobTracker run on a Hadoop Cluster?


Q 5: What is a Task Tracker in Hadoop? How many instances of TaskTracker run on a Hadoop Cluster?

Q 6: What is HIVE?

Q 7: What is PIG?

Pig or Pig Latin is a language.

It helps analyst to concentrate on analytic work by removing map-reduce programming complexity.

PIG is high-level language and it converts its operators into MapReduce code.


Q 8: What is HBase?

Q 9: What is replication factor in HDFS?

Q 10: What is Master-Worker Pattern?

Q 11: In HDFS, Why does system reconstruct block location information every time on start up?

Q 12: What is POSIX (Portable Operating System Interface)?

Q 13: What is data locality optimization?

Q 14: What is the meaning of streaming data access pattern?

Lucene – An Introduction

Lucene is an open source java based search library.

Lucene is very popular and fast search library used in java based application to add document search capability to any kind of application in a very simple and efficient way.

Lucene  is scalable and high-performance library used to index and search virtually any kind of text.

Lucene library provides the core operations which are required by any search application.

Lucene provides a simple-to-use API, which will provide powerful indexing and searching capability.



Components of Lucene

1) Index, 2) Document, 3) Field


Index: Lucene index is the data store that holds all the indexed documents; queries are also executed against the index to fetch the documents.

Document: A document is the default building block for a Lucene index.

Documents can be compared to records in a table. Each document holds a number of fields upon which queries can be executed.

Field: Each Lucene document comprises of one or more fields; it is not necessary that all the fields are indexed, fields can also be stored without indexing.

Lucene search works based on the index, so it is necessary to have the index updated with the latest content to get the best search results.

Vikas Jindal

Big Data – Introduction

This is the era of data or in other words we are living in the age of data. Now Data storages are very cheap and we have lot of easy techniques to capture the data.

People upload videos, take pictures on their smart phones, text friends, update their facebook status, leave comments around the web, clicks on ads and so forth.

Machines are also intelligent and capable generating and keeping more and more data.

Google, Yahoo, Amazon and Microsoft all are facing challenges of the exponential growth of data.

Existing tools are becoming inadequate to process this big data.These tools are not able to go through TeraBytes and PetaBytes of data to figure out which websites are popular, what products are in demand and what kind of ads are appealing to people.

Hadoop (Mapreduce + Hadoop Distributed File System) is the answer of this problem.

Hadoop is not a Big Data. Hadoop is open source analytic software.

Google published Mapreduce in 1990.

Doug Cutting started development of Hadoop System and named it after his son’s toy elephant.

Hadoop uses Google’s MapReduce and Google File System technologies as its foundation.

It is optimized to handle massive quantities of data which could be structured, unstructured or semi-structured, using commodity hardware, that is, relatively inexpensive computers.

This massive parallel processing is done with great performance. However, it is a batch operation handling massive quantities of data, so the response time is not immediate.

Yahoo is the first company which started funding for Hadoop.

Hadoop is not good to process transactions because it is random access.

It is not good when the work cannot be parallelized.

It is not good for low latency data access.

Not good for processing lots of small files.

And not good for intensive calculations with little data.

It is NOT a replacement for a relational database system.

Hadoop is not suitable for OnLine Analytical Processing or Decision Support System workloads where data are sequentially accessed on structured data like a relational database, to generate reports that provide business intelligence.

Open source projects related with Hadoop:

Eclipse is a popular IDE donated by IBM to the open source community.

Lucene is a text search engine library written in Java.

Hbase is the Hadoop database.

Hive provides data warehousing tools to extract, transform and load data, and query this data stored in Hadoop files.

Pig is a platform for analyzing large data sets. It is a high level language for expressing data analysis.

Zoo Keeper is a centralized configuration service and naming registry for large distributed systems.

Avro is a data serialization system.

UIMA is the architecture for the development, discovery, composition and deployment for the analysis of unstructured data.

Vikas Jindal