How to remove duplicate rows from Hive table?

Scenario

Have table with duplicate rows in hive table and Want to remove these duplicate rows from hive table.

Approach

Steps:

1) Create a new table from old table (with same structure).
2) Copy distinct rows in new table from existing table.

select col1,col2,col3,col4,max(<duplicate column>) as <name of duplicate column> from <table name> group by col1,col2,col3,col4;
3) Delete old table.

4) Rename new table to old one.

This is a new approach.

Another approach

We can follow an old approach also that databases use while deleting rows.

Steps:

  1. Create a new temp table from old table (with same structure).
  2. Create a new lookup table (with id and flag columns).
  3. Copy ids of duplicate rows.
  4. Start copy rows from old to new table while copying take following steps
    1. Select row from old table and check existence in lookup table.
    2. If exists and flag is not set then
    3. set flag and copy in new table
    4. else skip
  5. Delete or drop old table.
  6. Delete or drop lookup table.
  7. Rename new table to old table.

 

Vikas Jindal