BI Testing and DW/SQL knowledge: SQL query for identifying Duplicates

Identifying duplicates is quite a cumbersome task when it comes to large databases. There are many ways in which one can identify duplicates present in a table. One of the popular ways of identifying duplicates is by grouping the data and counting the number of records for that group. To explain this, consider the table fact table below.

Employee table

Employee name	Company	Salary	Designation
Adam	ABC Corp	10000	Senior tester
Bill	IJK Corp	20000	Test Engg
Adam	PQR Corp	30000	Developer
Dave	TUV Corp	40000	Manager
Adam	XYZ Corp	50000	Lead

Here we can notice that Employee Name Adam is repeated several times. In order to identify the duplicate records, the following query can be helpful
SELECT E.[Employee Name], Company, Salary, Designation FROM EmployeeTable E INNER JOIN
(SELECT [Employee Name],
COUNT([Employee Name]) AS [Number of Occurances] FROM EmployeeTable GROUP BY [Employee Name] HAVING COUNT([Employee Name]) > 1)
AS A ON A.[Employee Name] = E.[Employee Name]

There are few Ranking functions that are made available in SQL Server 2005 onwards which makes identifying Duplicates more easier, consider the query below:
SELECT [Employee Name], Company, Salary, Designation, ROW_NUMBER() OVER (PARTITION BY [Employee Name] ORDER BY [Employee Name]) AS [Row Id] FROM EmployeeTable

In the above query, the derived column [Row Id], will have the rank allotted to each row and if there are more than one occurrences of a row, the ROW_NUMBER() function will increment the rank of the row, by which one can identify the number of times the row is repeated.

BI Testing and DW/SQL knowledge

Friday, July 30, 2010

SQL query for identifying Duplicates

No comments:

Post a Comment

Easy Search Bar

Labels

Briefing abt myself

Followers