How to remove duplicate rows from Hive table?

Scenario

Have table with duplicate rows in hive table and Want to remove these duplicate rows from hive table.

Approach

Steps:

1) Create a new table from old table (with same structure).
2) Copy distinct rows in new table from existing table.

select col1,col2,col3,col4,max(<duplicate column>) as <name of duplicate column> from <table name> group by col1,col2,col3,col4;
3) Delete old table.

4) Rename new table to old one.

This is a new approach.

Another approach

We can follow an old approach also that databases use while deleting rows.

Steps:

  1. Create a new temp table from old table (with same structure).
  2. Create a new lookup table (with id and flag columns).
  3. Copy ids of duplicate rows.
  4. Start copy rows from old to new table while copying take following steps
    1. Select row from old table and check existence in lookup table.
    2. If exists and flag is not set then
    3. set flag and copy in new table
    4. else skip
  5. Delete or drop old table.
  6. Delete or drop lookup table.
  7. Rename new table to old table.

 

Vikas Jindal

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s