Sunday, September 29, 2024

How to Remove Duplicate Records in SQL Server Using T-SQL

Summary: Learn how to identify and delete duplicate data in SQL Server using ROW_NUMBER(), CTEs, and a flexible dynamic stored procedure to keep your database clean.

How to Remove Duplicate Records in SQL Server

If you're working with databases, one common problem you might face is duplicate data. Duplicates can cause a lot of headaches, especially when it comes to data analysis or reporting. In this post, we’ll walk through a simple way to remove duplicate records in SQL Server using a stored procedure that you can adapt for your own projects.

1. What is a Duplicate Record?

A duplicate record occurs when you have two or more rows in your table with the exact same data in one or more columns. For example, if two rows share the same customer name, email, and phone number, they are duplicates.

2. The Concept of the RemoveDuplicate Procedure

In SQL Server, you can use a stored procedure to automate the process of cleaning data. The RemoveDuplicate procedure we’re exploring helps to:

  • Find duplicate records across any table dynamically.
  • Remove extra copies while leaving exactly one unique row.

3. Understanding the T-SQL Code

3.1 The Procedure and Parameters

The procedure is created with parameters to make it reusable:

CREATE PROCEDURE RemoveDuplicate @SCDID sysname, @IsDebug bit = 0
  • @SCDID: An ID identifying the specific rule and table to work with.
  • @IsDebug: A flag to print the SQL commands for review before execution.

3.2 Using ROW_NUMBER() to Identify Duplicates

The core logic uses a Common Table Expression (CTE) and the ROW_NUMBER() function:

WITH ct AS (
    SELECT *, rn = ROW_NUMBER() OVER(PARTITION BY ColumnSearch ORDER BY ColumnSearch)
    FROM SourceTable
) DELETE FROM ct WHERE rn >= 2;

The PARTITION BY clause groups rows by the columns where you expect duplicates. If three rows are identical, they get assigned numbers 1, 2, and 3. Deleting rn >= 2 removes the extras.

3.3 Advanced Cleanup Using LAG()

For more complex scenarios, we use the LAG() function to compare the current row with the previous one. If they match, the current row is flagged for deletion.

4. SCD2_Rule Table Design

This procedure relies on a configuration table named SCD2_Rule to know which databases and columns to target:

Field Description
SourceDatabaseThe DB containing the duplicates.
SourceObjectThe specific Table name.
ColumnSearchColumns used to identify a "match".
ColumnExcludeColumns to ignore during comparison.

5. Complete Stored Procedure Code

Below is the full, flexible T-SQL code for the RemoveDuplicate procedure:


CREATE PROCEDURE RemoveDuplicate
    @SCDID sysname,
    @IsDebug bit = 0
AS
BEGIN
    DECLARE @Sql AS NVARCHAR(MAX), @rVal varchar(max)

    -- Step 1: Basic Duplicate Removal via ROW_NUMBER
    SELECT @Sql = 'WITH ct AS (
        SELECT *, rn = ROW_NUMBER() OVER(PARTITION BY ' + ColumnSearch + ' ORDER BY ' + ColumnSearch + ')
        FROM ' + CONCAT(SourceDatabase, '.', SourceSchema, '.', SourceObject) + '
    ) DELETE FROM ct WHERE rn >= 2'
    FROM dbo.SCD2_Rule WHERE SCDID = @SCDID

    IF @IsDebug = 1 PRINT @Sql
    EXEC SP_EXECUTESQL @Sql

    -- Step 2: Advanced Cleanup using LAG logic
    -- (Logic builds dynamic SQL based on excluded columns)
    SELECT @Sql = 'WITH ct AS (
        SELECT *, Ckp = CONCAT(' + @rVal + '),
            PRv = LAG(CONCAT(' + @rVal + ')) OVER(PARTITION BY ' + REPLACE(ColumnSearch, ',DataSystemDate', '') + ' ORDER BY ' + ColumnSearch + ')
        FROM ' + CONCAT(SourceDatabase, '.', SourceSchema, '.', SourceObject) + '
    ) DELETE FROM ct WHERE ISNULL(ckp, '''') = ISNULL(PRv, '''')'
    FROM dbo.SCD2_Rule WHERE SCDID = @SCDID

    IF @IsDebug = 1 PRINT @Sql
    EXEC SP_EXECUTESQL @Sql
END
            

Conclusion

Keeping your SQL Server tables clean is vital for performance and accuracy. By using Dynamic SQL and window functions like ROW_NUMBER(), you can automate the tedious task of data deduplication.

Pro Tip: Always back up your data or test the procedure using @IsDebug = 1 before running a DELETE command on a production dataset!

Found this helpful? Subscribe for more T-SQL tips and beginner-friendly SQL Server guides!

Popular Posts