Identifying duplicates is quite a cumbersome task when it comes to large databases. There are many ways in which one can identify duplicates present in a table. One of the popular ways of identifying duplicates is by grouping the data and counting the number of records for that group. To explain this, consider the table fact table below.
Employee table
Employee name | Company | Salary | Designation |
Adam | ABC Corp | 10000 | Senior tester |
Bill | IJK Corp | 20000 | Test Engg |
Adam | PQR Corp | 30000 | Developer |
Dave | TUV Corp | 40000 | Manager |
Adam | XYZ Corp | 50000 | Lead |
Here we can notice that Employee Name Adam is repeated several times. In order to identify the duplicate records, the following query can be helpful
SELECT E.[Employee Name], Company, Salary, Designation FROM EmployeeTable E INNER JOIN
(SELECT [Employee Name],
COUNT([Employee Name]) AS [Number of Occurances] FROM EmployeeTable GROUP BY [Employee Name] HAVING COUNT([Employee Name]) > 1)
AS A ON A.[Employee Name] = E.[Employee Name]
There are few Ranking functions that are made available in SQL Server 2005 onwards which makes identifying Duplicates more easier, consider the query below:
SELECT [Employee Name], Company, Salary, Designation, ROW_NUMBER() OVER (PARTITION BY [Employee Name] ORDER BY [Employee Name]) AS [Row Id] FROM EmployeeTable
In the above query, the derived column [Row Id], will have the rank allotted to each row and if there are more than one occurrences of a row, the ROW_NUMBER() function will increment the rank of the row, by which one can identify the number of times the row is repeated.
No comments:
Post a Comment