Have table with duplicate rows in hive table and Want to remove these duplicate rows from hive table.
1) Create a new table from old table (with same structure).
2) Copy distinct rows in new table from existing table.
select col1,col2,col3,col4,max(<duplicate column>) as <name of duplicate column> from <table name> group by col1,col2,col3,col4;
3) Delete old table.
4) Rename new table to old one.
This is a new approach.
We can follow an old approach also that databases use while deleting rows.
- Create a new temp table from old table (with same structure).
- Create a new lookup table (with id and flag columns).
- Copy ids of duplicate rows.
- Start copy rows from old to new table while copying take following steps
- Select row from old table and check existence in lookup table.
- If exists and flag is not set then
- set flag and copy in new table
- else skip
- Delete or drop old table.
- Delete or drop lookup table.
- Rename new table to old table.