How To Remove Duplicate Records In SQL?

How To Remove Duplicate Records In SQL?

5 mins read756 Views Comment
Updated on Jan 30, 2023 13:30 IST

We often come across situations in ORACLE PL/SQL and SQL, wherein, although indexes, clustered, non-clustered indexes, primary, unique, and referential integrity constraints are enforced upon, as a precaution to enhance consistency and improve performance, integrity, and concurrency, there are numerous situations like Data import, ETL, Data migration from one DB to another, wherein, removal of duplicate rows is mandatory.

2022_07_Introduction-to-Data-Analysis-Using-SQL.jpg

In this article, we will explore different ways to remove duplicate records in SQL, which can create confusion and provide an inaccurate output of the data required. Here are a few simple methods, which I have explained to be thoroughly familiar and familiarize yourself with the selection and deletion of duplicate records in SQL.

Table of Contents (TOC)

A simple explanation how to remove duplicate record in SQL

Here is an illustration below: 

Employees table(Emp1, with Name1 and Age1 being the columns here): 

This is just a simple illustration for your convenience and easy understanding. 

  • Anitha 45  
  • Harris 30  
  • Lilly 40  
  • Harris 30 
  • Anitha 45 

Hence, in this above illustration, you could identify certain names, Anitha and Harris, which are typically repeated and so as the age. So, How can we quickly remove these? 

You can also explore: FOREIGN KEY IN SQL 

Retrieving duplicate records is the first step: 

Here with enclosed is the query below: 

select Name1,Age1,count(*) from Emp1 group by Name1,Age1 
having(count(*) >1); 

This would render the results (All duplicates) 

Employees(Emp1) Table: 

  • Anitha 45 
  • Harris 30 
  • Harris 30 
  • Anitha 45 

With the sole purpose (Ref: using rowid here) to select only unique/distinct records from the table, for example, if a table has 3 or more records with the same name and age, we could use rowid and select the query with a not in operator to be greater than the inner query which checks the min(rowid).

So, the minimal number of distinct/unique rows are selected and the other Max rowid duplicate rows are deleted. There ought to be a primary key on each table or a candidate key for a combination of columns that uniquely identifies each row on the table. 

You can also explore: SQL RIGHT JOIN – Example and Syntax

If there is no primary key on the table, you typically tend to use ROWID instead of the primary key, since it uniquely identifies each row in a table. 

Y, as a variable here, maybe a primary key or a ROWID.  MIN(y) uniquely identifies/refers to a row/rows in a group. MAX(y) is another function to subsequently identify rows in a group.

If the intention is to have groups of rows, and the intent is to keep exactly one row in each group and delete all the others, then we use MIN(y) (or MAX (y)) to approximately pick which row to keep, and then delete all the others. 

Delete from Emp1 
Where sal 
not in(select min(sal) from Emp1 group by sal order by Name1); 
Delete from Emp1 where sal is not in(select max(sal) from Emp1 group by sal order by Name1); 

So as mentioned above, min and Max coupled with the NOT IN is used Along with the delete statement, to delete relevant minimum and maximum records in the table Emp1. 

An example of how ideally is a Rowid. A query would be: 

SELECT rowid,Name1 FROM Emp1 WHERE Name1="Lilly” ;  

A user can access a row quickly and easily using its row ID. 

Recommended online courses

Best-suited Data Analytics courses for you

Learn Data Analytics with these high-rated online courses

Free
29 hours
1.18 L
12 months
Free
6 hours
97 K
4 months
40 K
160 hours
2.25 L
3 years
Free
1 hours
– / –
35 hours

Remove duplicate records using Row Id

Similarly, To eliminate duplicates rows in a table using row id: 

DELETE FROM table_name A  
WHERE ROWID > (SELECT min(ROWID) FROM table_name B WHERE A.col = B.col); 
delete from Emp1 a 
where rowid > (select min(rowid) from Emp1 b 
where b. Name1=a. Name1 
and b. Age1=a. Age1); 

So, here goes the output: 

  • Name1 Age1 
  • Anitha 45 
  • Harris 30 
  • Lilly 40  

Remove duplicate records using CTE

Another quick deletion of duplicates is by using CTE (Common Table Expressions): 

WITH CTE AS( SELECT Name1,Age1 
R = ROW_NUMBER() OVER (PARTITION BY Name1,Age1 ORDER BY Name1) 
FROM Emp1) 
DELETE FROM CTE WHERE R > 1 

Use a Common Table Expression (CTE) in SQL to group duplicate rows using OVER PARTITION BY, and then generate a unique number for each row of the group using Row_number(). Post this, delete all the duplicates in the groups, and check it for the count of duplicate records greater than 1 (That is, same records greater than 1 count) using the row_number function in this scenario. This would delete all DUPLICATE RECORDS.  

You can also explore: How to use GROUP BY in SQL?

Remove duplicate roecords using analytical functions

You can delete the duplicate rows using this query: 

DELETE FROM EMP1 
WHERE ROWID IN 
(SELECT ID1 FROM 
(SELECT ROWID ID1,ROW_NUMBER() OVER (PARTITION BY Name1,Age1 ORDER BY ROWID) R2 FROM EMP1)  
WHERE R2>1); 

Just for reference purpose: 

The only difference between RANK and DENSE_RANK FUNCTIONS is that when the same number/grade/salary/score is repeated twice or multiple times, RANK will skip the next rank value and would assign the next (After the skipped subsequent) value to the number wherein DENSE_RANK would assign the next chronological value. In the case of row_number, it would be the same rank value. 

For instance, if score/grade 80 is repeated twice for A and B students, then the RANK function for A is 2 as in this scenario, already a student C has scored 100 and the rank is 1 for C. So in this case, B would be assigned 4 by the RANK function(3 skipped) whereas DENSE_RANK would assign 3 which would be the next chronological value to B. 

In the case of Row_number(), the rank for A and B would be the same 2 only. 

Student Score RANK DENSE_RANK 

A 80 2 2 
B 80 4 3 
C 100 1 1 

The above is just a piece of reference for differences among Rank, dense_rank, and row_number. 

Here you can use both RANK() and DENSE_RANK() since both will give unique records when ordered by rowid. 

Remove duplicate records using DENSE_RANK and RANK Functions

DENSE_RANK and RANK functions to remove duplicate records as follows: 

DELETE FROM Emp1  
WHERE rowid IN 
( SELECT id1 from 
    ( SELECT rowid AS id1, 
        RANK() OVER (PARTITION BY Name1,age1 ORDER BY rowid) AS rr 
        FROM Emp1 
  ) 
    WHERE rr > 1); 
DELETE FROM Emp1  
WHERE rowid IN 
( SELECT id1 from 
    ( SELECT rowid AS id1, 
        DENSE_RANK() OVER (PARTITION BY Name1,age1 ORDER BY row ID) AS r1 
        FROM Emp1 
  ) 
    WHERE r1 > 1); 

Similarly, for row_number() in the above mentioned example can be replaced with row_number() as follows: 

DELETE FROM Emp1  
WHERE row ID IN 
( SELECT id1 from 
    ( SELECT row ID AS id1, 
        ROW_NUMBER() OVER (PARTITION BY Name1,age1 ORDER BY row ID) AS r1 
        FROM Emp1 
  ) 
    WHERE r1 > 1); 

These are the various methods to select and delete duplicate records in SQL. 

About the Author
This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski Read Full Bio