Summary: Learn how to identify and delete duplicate data in SQL Server using ROW_NUMBER(), CTEs, and a flexible dynamic stored procedure to keep your database clean.
How to Remove Duplicate Records in SQL Server
If you're working with databases, one common problem you might face is duplicate data. Duplicates can cause a lot of headaches, especially when it comes to data analysis or reporting. In this post, we’ll walk through a simple way to remove duplicate records in SQL Server using a stored procedure that you can adapt for your own projects.
1. What is a Duplicate Record?
A duplicate record occurs when you have two or more rows in your table with the exact same data in one or more columns. For example, if two rows share the same customer name, email, and phone number, they are duplicates.
2. The Concept of the RemoveDuplicate Procedure
In SQL Server, you can use a stored procedure to automate the process of cleaning data. The RemoveDuplicate procedure we’re exploring helps to:
- Find duplicate records across any table dynamically.
- Remove extra copies while leaving exactly one unique row.
3. Understanding the T-SQL Code
3.1 The Procedure and Parameters
The procedure is created with parameters to make it reusable:
CREATE PROCEDURE RemoveDuplicate @SCDID sysname, @IsDebug bit = 0
- @SCDID: An ID identifying the specific rule and table to work with.
- @IsDebug: A flag to print the SQL commands for review before execution.
3.2 Using ROW_NUMBER() to Identify Duplicates
The core logic uses a Common Table Expression (CTE) and the ROW_NUMBER() function:
WITH ct AS (
SELECT *, rn = ROW_NUMBER() OVER(PARTITION BY ColumnSearch ORDER BY ColumnSearch)
FROM SourceTable
) DELETE FROM ct WHERE rn >= 2;
The PARTITION BY clause groups rows by the columns where you expect duplicates. If three rows are identical, they get assigned numbers 1, 2, and 3. Deleting rn >= 2 removes the extras.
3.3 Advanced Cleanup Using LAG()
For more complex scenarios, we use the LAG() function to compare the current row with the previous one. If they match, the current row is flagged for deletion.
4. SCD2_Rule Table Design
This procedure relies on a configuration table named SCD2_Rule to know which databases and columns to target:
| Field | Description |
|---|---|
| SourceDatabase | The DB containing the duplicates. |
| SourceObject | The specific Table name. |
| ColumnSearch | Columns used to identify a "match". |
| ColumnExclude | Columns to ignore during comparison. |
5. Complete Stored Procedure Code
Below is the full, flexible T-SQL code for the RemoveDuplicate procedure:
CREATE PROCEDURE RemoveDuplicate
@SCDID sysname,
@IsDebug bit = 0
AS
BEGIN
DECLARE @Sql AS NVARCHAR(MAX), @rVal varchar(max)
-- Step 1: Basic Duplicate Removal via ROW_NUMBER
SELECT @Sql = 'WITH ct AS (
SELECT *, rn = ROW_NUMBER() OVER(PARTITION BY ' + ColumnSearch + ' ORDER BY ' + ColumnSearch + ')
FROM ' + CONCAT(SourceDatabase, '.', SourceSchema, '.', SourceObject) + '
) DELETE FROM ct WHERE rn >= 2'
FROM dbo.SCD2_Rule WHERE SCDID = @SCDID
IF @IsDebug = 1 PRINT @Sql
EXEC SP_EXECUTESQL @Sql
-- Step 2: Advanced Cleanup using LAG logic
-- (Logic builds dynamic SQL based on excluded columns)
SELECT @Sql = 'WITH ct AS (
SELECT *, Ckp = CONCAT(' + @rVal + '),
PRv = LAG(CONCAT(' + @rVal + ')) OVER(PARTITION BY ' + REPLACE(ColumnSearch, ',DataSystemDate', '') + ' ORDER BY ' + ColumnSearch + ')
FROM ' + CONCAT(SourceDatabase, '.', SourceSchema, '.', SourceObject) + '
) DELETE FROM ct WHERE ISNULL(ckp, '''') = ISNULL(PRv, '''')'
FROM dbo.SCD2_Rule WHERE SCDID = @SCDID
IF @IsDebug = 1 PRINT @Sql
EXEC SP_EXECUTESQL @Sql
END
Conclusion
Keeping your SQL Server tables clean is vital for performance and accuracy. By using Dynamic SQL and window functions like ROW_NUMBER(), you can automate the tedious task of data deduplication.
@IsDebug = 1 before running a DELETE command on a production dataset!
Found this helpful? Subscribe for more T-SQL tips and beginner-friendly SQL Server guides!
No comments:
New comments are not allowed.