How to remove duplicate rows from Hive table?

Scenario

Have table with duplicate rows in hive table and Want to remove these duplicate rows from hive table.

Approach

Steps:

1) Create a new table from old table (with same structure).
2) Copy distinct rows in new table from existing table.

select col1,col2,col3,col4,max(<duplicate column>) as <name of duplicate column> from <table name> group by col1,col2,col3,col4;
3) Delete old table.

4) Rename new table to old one.

This is a new approach.

Another approach

We can follow an old approach also that databases use while deleting rows.

Steps:

  1. Create a new temp table from old table (with same structure).
  2. Create a new lookup table (with id and flag columns).
  3. Copy ids of duplicate rows.
  4. Start copy rows from old to new table while copying take following steps
    1. Select row from old table and check existence in lookup table.
    2. If exists and flag is not set then
    3. set flag and copy in new table
    4. else skip
  5. Delete or drop old table.
  6. Delete or drop lookup table.
  7. Rename new table to old table.

 

Vikas Jindal

 

 

Hive – Useful Commands

Hive is a data warehousing infrastructure based on Apache Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware.

Q. Which directory is created after creating Hive Table?

To see table primary info of Hive table, use describe table_name;

To see more detailed information about the table, use describe extended table_name;

To see code in a clean manner use describe formatted table_name; command to see all information. also describe all details in a clean manner.

Q. How to generate the create statement for an existing hive table?

show create table

Q. How to see query plan in Hive?

explain select * from

Q. How to run Query from command line?

hive -e ‘select a.col from’

Q. How to dump data out from a query into a file using silent mode?

hive -S -e ‘select a.col from tab1 a’ > a.txt

Q How to list all databases?

Show databases;

Q How to list all tables?

Show tables;

Q How to list all partitions in table?

Show Partitions;

Q How to delete partition?

ALTER TABLE <tablename> Drop PARTITION (<partitionname>);

Example

ALTER TABLE db1.person Drop PARTITION (year=2016, month=01);

Q How to rename a table?

ALTER TABLE RENAME TO ;

Q How to check locks on a table?

Show locks;

Q How to lock on a table?

Lock table;

Q How to unlock on a table?

Unlock table;

Q How to check concurrency in hive?

set hive.support.concurrency;

Q How to check indexes in hive table?

SHOW INDEX ON;

SHOW CREATE TABLE – shows the CREATE TABLE statement that creates a given table, or the CREATE VIEW statement that creates a given view.

SHOW CREATE TABLE ([db_name.]<table_name|view_name>);

Q How to copy a hive table?

create table . as select * from .<tablename>;