Sunday, September 29, 2024

How to Remove Duplicate Records in SQL Server

How to Remove Duplicate Records in SQL Server

If you're working with databases, one common problem you might face is duplicate data. Duplicates can cause a lot of headaches, especially when it comes to data analysis or reporting. But don’t worry! In this post, we’ll walk through a simple way to remove duplicate records in SQL Server using a stored procedure that you can use in your own projects.

We’ll break down the SQL code step-by-step to make sure you understand how it works. By the end of this tutorial, you’ll be able to use this technique to keep your data clean and organized.

Step 1: What is a Duplicate Record?

A duplicate record is when you have two or more rows in your table with the exact same data in one or more columns. For example, if you have a list of customers and two customers have the same name, email, and phone number, those would be considered duplicates.

Step 2: The Concept of the RemoveDuplicate Procedure

In SQL Server, you can use a stored procedure to automate the process of removing duplicates. A stored procedure is like a saved set of SQL commands that can be executed later. The RemoveDuplicate procedure that we’ll look at helps to:

  • Find duplicate records in a table.
  • Remove the extra copies, leaving only one unique row.

Step 3: Understanding the Code

Let’s break down the SQL code that removes duplicates. Don’t worry if it seems complex at first. We’ll explain it in a way that’s easy to follow.

Step 3.1: The Procedure and Parameters

The procedure is created with the following parameters:

CREATE PROCEDURE RemoveDuplicate @SCDID sysname, @IsDebug bit = 0
  • @SCDID: This is an ID that identifies the specific rule you want to apply for removing duplicates. It helps the procedure know which table to work with.
  • @IsDebug: This is a flag that lets you print out the SQL commands for debugging. If you're just starting out, you can use this to see the queries that will be executed.

Step 3.2: Using ROW_NUMBER() to Identify Duplicates

The first part of the code uses a Common Table Expression (CTE) and the ROW_NUMBER() function to label rows:

WITH ct AS (
    SELECT *, rn = ROW_NUMBER() OVER(PARTITION BY ColumnSearch ORDER BY ColumnSearch)
    FROM SourceDatabase.SourceSchema.SourceObject
)

The ROW_NUMBER() function assigns a unique number to each row, starting from 1. The key part is PARTITION BY, which groups rows based on a specific column. For example, if you're looking for duplicates in the "Email" column, all rows with the same email will be grouped together.

After the row numbers are assigned, we can delete the duplicates.

Step 3.3: Removing Duplicates

Now that the rows are numbered, we can delete the duplicates by selecting rows with rn >= 2 (anything that’s not the first occurrence):

DELETE FROM ct WHERE rn >= 2;

This means that if there are multiple rows with the same data in the ColumnSearch, all except the first one will be deleted.

Step 3.4: Handling Column Exclusions

Sometimes, you don’t want to compare all columns for duplicates. In that case, we use the ColumnExclude parameter to exclude certain columns from the comparison. The code dynamically builds a list of column names, excluding any columns specified in ColumnExclude:

SELECT @rOut = STRING_AGG(c.name, ',')
    FROM SourceDatabase.sys.COLUMNS c
    WHERE c.name NOT IN (@ColumnExclude)

Step 3.5: Final Cleanup Using LAG()

In the final step, we use the LAG() function, which allows us to compare each row with the previous row in a group. If the current row has the same data as the previous one, it’s considered a duplicate and will be removed:

WITH ct AS (
    SELECT *, Ckp = CONCAT(@rval),
           PRv = LAG(CONCAT(@rval)) OVER(PARTITION BY ColumnSearch ORDER BY ColumnSearch)
    FROM SourceDatabase.SourceSchema.SourceObject
)
DELETE FROM ct WHERE ISNULL(ckp, '') = ISNULL(PRv, '');

Step 4: Executing the Procedure

Once you understand how the procedure works, you can execute it to remove duplicates. Simply call the procedure with the SCDID (rule ID) and @IsDebug parameter (set to 1 for debugging):

EXEC RemoveDuplicate @SCDID = 'yourRuleID', @IsDebug = 1;

If @IsDebug is set to 1, the procedure will print out the SQL commands it is going to execute, so you can check them before they run.

Step 5: Original Table Design for SCD2_Rule

Here’s the original design of the SCD2_Rule table used in the stored procedure. This table contains information about your source and target tables and the columns used for comparison:

SCDID SourceDatabase SourceSchema SourceObject TargetDatabase TargetSchema TargetObject ColumnSearch ColumnExclude
sysname sysname sysname sysname sysname sysname sysname nvarchar(-1) nvarchar(-1)

Step 6: Complete SQL Stored Procedure Code

Here’s the full SQL code for the RemoveDuplicate procedure, including everything we’ve discussed so far:


CREATE PROCEDURE RemoveDuplicate
    @SCDID sysname,
    @IsDebug bit = 0
AS
BEGIN
    DECLARE @Sql AS NVARCHAR(MAX), @rVal varchar(max)

    -- Fetch the SQL query template for the given SCDID
    SELECT @Sql = 'WITH ct AS (
        SELECT *, rn = ROW_NUMBER() OVER(PARTITION BY ' + ColumnSearch + ' ORDER BY ' + ColumnSearch + ')
        FROM ' + CONCAT(SourceDatabase, '.', SourceSchema, '.', SourceObject) + '
    ) DELETE FROM ct WHERE rn >= 2'
    FROM dbo.SCD2_Rule WHERE SCDID = @SCDID

    -- Debugging: Print the dynamic SQL if IsDebug is set to 1
    IF @IsDebug = 1
        PRINT @Sql

    -- Execute the dynamic SQL to remove duplicates
    EXEC SP_EXECUTESQL @Sql

    -- Other duplicate handling: Exclude specified columns
    SELECT @Sql = 'SELECT @rOut = STRING_AGG(c.name, '','')
        FROM ' + SourceDatabase + '.sys.COLUMNS c
        JOIN ' + SourceDatabase + '.sys.tables t ON c.object_id = t.object_id
        WHERE t.name = ''' + SourceObject + ''' 
        AND c.name NOT IN (' + b.a + ') AND c.name <> ''DataValidTo'''
    FROM dbo.SCD2_Rule a
    OUTER APPLY (
        SELECT STRING_AGG(''' + value + ''', ',') a
        FROM STRING_SPLIT(a.ColumnSearch + ',' + a.ColumnExclude, ',') b1
    ) b
    WHERE SCDID = @SCDID

    -- Debugging: Print the dynamic SQL for column exclusions
    IF @IsDebug = 1
        PRINT CONCAT('@Sql : ', @Sql)

    -- Execute the query for excluding columns
    EXEC sp_executesql @Sql, N'@rOut varchar(max) OUTPUT', @rOut=@rVal OUTPUT;

    -- Debugging: Print the result of excluded columns
    IF @IsDebug = 1
        PRINT CONCAT('@rVal : ', @rVal)

    -- Final cleanup: Remove records with identical values in key columns
    SELECT @Sql = 'WITH ct AS (
        SELECT *, Ckp = CONCAT(' + @rVal + '),
            PRv = LAG(CONCAT(' + @rVal + ')) OVER(PARTITION BY ' + REPLACE(ColumnSearch, ',DataSystemDate', '') + ' ORDER BY ' + ColumnSearch + ')
        FROM ' + CONCAT(SourceDatabase, '.', SourceSchema, '.', SourceObject) + '
    ) DELETE FROM ct WHERE ISNULL(ckp, '''') = ISNULL(PRv, '''')'
    FROM dbo.SCD2_Rule WHERE SCDID = @SCDID

    -- Debugging: Print the final SQL for cleanup
    IF @IsDebug = 1
        PRINT @Sql

    -- Execute the final cleanup query
    EXEC SP_EXECUTESQL @Sql
END
        

Step 7: Conclusion

You’ve just learned how to remove duplicate records from your SQL Server tables using a stored procedure. The RemoveDuplicate procedure is powerful because it allows you to automate the process of cleaning up your data.

Key Takeaways:

  • ROW_NUMBER() is used to assign a unique number to each row based on certain columns.
  • LAG() compares rows to detect duplicates.
  • Dynamic SQL is used to make the procedure flexible for different tables and columns.
  • Debugging is made easy with the @IsDebug flag to see the SQL commands before they run.

Now you can keep your SQL Server tables clean by removing duplicate data with ease. If you have any questions or need further clarification, feel free to ask in the comments!

Bonus Tip:

If you're working with large datasets, be sure to test the procedure on a smaller subset of your data before running it on your production database. Always back up your data first!

If you found this tutorial helpful, be sure to share it with your fellow SQL enthusiasts or subscribe for more beginner-friendly guides!

No comments:

Popular Posts