Friday, July 30, 2010

SQL query for identifying Duplicates


Identifying duplicates is quite a cumbersome task when it comes to large databases. There are many ways in which one can identify duplicates present in a table. One of the popular ways of identifying duplicates is by grouping the data and counting the number of records for that group. To explain this, consider the table fact table below.



Employee table

Employee name
Company
Salary
Designation
Adam
ABC Corp
10000
Senior tester
Bill
IJK Corp
20000
Test Engg
Adam
PQR Corp
30000
Developer
Dave
TUV Corp
40000
Manager
Adam
XYZ Corp
50000
Lead


Here we can notice that Employee Name Adam is repeated several times. In order to identify the duplicate records, the following query can be helpful
SELECT E.[Employee Name], Company, Salary, Designation FROM EmployeeTable E INNER JOIN
(SELECT [Employee Name],
COUNT([Employee Name]) AS [Number of Occurances] FROM EmployeeTable GROUP BY [Employee Name] HAVING COUNT([Employee Name]) > 1)
AS A ON A.[Employee Name] = E.[Employee Name]


There are few Ranking functions that are made available in SQL Server 2005 onwards which makes identifying Duplicates more easier, consider the query below:

SELECT [Employee Name], Company, Salary, Designation, ROW_NUMBER() OVER (PARTITION BY [Employee Name] ORDER BY [Employee Name]) AS [Row Id] FROM EmployeeTable
In the above query, the derived column [Row Id], will have the rank allotted to each row and if there are more than one occurrences of a row, the ROW_NUMBER() function will increment the rank of the row, by which one can identify the number of times the row is repeated.

No comments:

Post a Comment


All Rights Reserved @ Raju Das