Sunday, September 29, 2024

How to Remove Duplicate Records in SQL Server Using T-SQL

Summary: Learn how to identify and delete duplicate data in SQL Server using ROW_NUMBER(), CTEs, and a flexible dynamic stored procedure to keep your database clean.

How to Remove Duplicate Records in SQL Server

If you're working with databases, one common problem you might face is duplicate data. Duplicates can cause a lot of headaches, especially when it comes to data analysis or reporting. In this post, we’ll walk through a simple way to remove duplicate records in SQL Server using a stored procedure that you can adapt for your own projects.

1. What is a Duplicate Record?

A duplicate record occurs when you have two or more rows in your table with the exact same data in one or more columns. For example, if two rows share the same customer name, email, and phone number, they are duplicates.

2. The Concept of the RemoveDuplicate Procedure

In SQL Server, you can use a stored procedure to automate the process of cleaning data. The RemoveDuplicate procedure we’re exploring helps to:

  • Find duplicate records across any table dynamically.
  • Remove extra copies while leaving exactly one unique row.

3. Understanding the T-SQL Code

3.1 The Procedure and Parameters

The procedure is created with parameters to make it reusable:

CREATE PROCEDURE RemoveDuplicate @SCDID sysname, @IsDebug bit = 0
  • @SCDID: An ID identifying the specific rule and table to work with.
  • @IsDebug: A flag to print the SQL commands for review before execution.

3.2 Using ROW_NUMBER() to Identify Duplicates

The core logic uses a Common Table Expression (CTE) and the ROW_NUMBER() function:

WITH ct AS (
    SELECT *, rn = ROW_NUMBER() OVER(PARTITION BY ColumnSearch ORDER BY ColumnSearch)
    FROM SourceTable
) DELETE FROM ct WHERE rn >= 2;

The PARTITION BY clause groups rows by the columns where you expect duplicates. If three rows are identical, they get assigned numbers 1, 2, and 3. Deleting rn >= 2 removes the extras.

3.3 Advanced Cleanup Using LAG()

For more complex scenarios, we use the LAG() function to compare the current row with the previous one. If they match, the current row is flagged for deletion.

4. SCD2_Rule Table Design

This procedure relies on a configuration table named SCD2_Rule to know which databases and columns to target:

Field Description
SourceDatabaseThe DB containing the duplicates.
SourceObjectThe specific Table name.
ColumnSearchColumns used to identify a "match".
ColumnExcludeColumns to ignore during comparison.

5. Complete Stored Procedure Code

Below is the full, flexible T-SQL code for the RemoveDuplicate procedure:


CREATE PROCEDURE RemoveDuplicate
    @SCDID sysname,
    @IsDebug bit = 0
AS
BEGIN
    DECLARE @Sql AS NVARCHAR(MAX), @rVal varchar(max)

    -- Step 1: Basic Duplicate Removal via ROW_NUMBER
    SELECT @Sql = 'WITH ct AS (
        SELECT *, rn = ROW_NUMBER() OVER(PARTITION BY ' + ColumnSearch + ' ORDER BY ' + ColumnSearch + ')
        FROM ' + CONCAT(SourceDatabase, '.', SourceSchema, '.', SourceObject) + '
    ) DELETE FROM ct WHERE rn >= 2'
    FROM dbo.SCD2_Rule WHERE SCDID = @SCDID

    IF @IsDebug = 1 PRINT @Sql
    EXEC SP_EXECUTESQL @Sql

    -- Step 2: Advanced Cleanup using LAG logic
    -- (Logic builds dynamic SQL based on excluded columns)
    SELECT @Sql = 'WITH ct AS (
        SELECT *, Ckp = CONCAT(' + @rVal + '),
            PRv = LAG(CONCAT(' + @rVal + ')) OVER(PARTITION BY ' + REPLACE(ColumnSearch, ',DataSystemDate', '') + ' ORDER BY ' + ColumnSearch + ')
        FROM ' + CONCAT(SourceDatabase, '.', SourceSchema, '.', SourceObject) + '
    ) DELETE FROM ct WHERE ISNULL(ckp, '''') = ISNULL(PRv, '''')'
    FROM dbo.SCD2_Rule WHERE SCDID = @SCDID

    IF @IsDebug = 1 PRINT @Sql
    EXEC SP_EXECUTESQL @Sql
END
            

Conclusion

Keeping your SQL Server tables clean is vital for performance and accuracy. By using Dynamic SQL and window functions like ROW_NUMBER(), you can automate the tedious task of data deduplication.

Pro Tip: Always back up your data or test the procedure using @IsDebug = 1 before running a DELETE command on a production dataset!

Found this helpful? Subscribe for more T-SQL tips and beginner-friendly SQL Server guides!

Thursday, June 20, 2024

How to Identify Inlineable Scalar UDFs in SQL Server 2019 and later

Summary: Use sys.sql_modules to identify which Scalar User-Defined Functions (UDFs) are eligible for inlining to boost query performance in SQL Server 2019.

Check if Your SQL Scalar Functions are Inlineable

When you're working with SQL Server 2019 (and later), understanding the properties of your functions can be critical for performance optimization. Scalar UDF Inlining is a game-changing feature that automatically transforms scalar functions into relational expressions.

Why Inlining Matters

This query provides insight into whether a scalar UDF can be inlined. If a function is inlineable, SQL Server essentially expands the function's logic directly into the calling query. This avoids the massive performance overhead of row-by-row function calls and context switching.


-- Identify scalar functions and their inlining status
SELECT 
    OBJECT_NAME(m.object_id) AS [Function Name],
    m.is_inlineable,
    m.inline_type,
    efs.total_worker_time,
    efs.execution_count
FROM sys.sql_modules AS m WITH (NOLOCK)
LEFT OUTER JOIN sys.dm_exec_function_stats AS efs WITH (NOLOCK) 
    ON m.object_id = efs.object_id
WHERE efs.type_desc = N'SQL_SCALAR_FUNCTION'
   OR m.object_id IN (SELECT object_id FROM sys.objects WHERE type = 'FN')
OPTION (RECOMPILE);
        

Understanding the Results

  • is_inlineable: A value of 1 means the function meets the requirements for inlining.
  • inline_type: Indicates whether inlining is currently turned on (1) or off (0) for that specific module.

If your function shows is_inlineable = 0, it might be due to the use of certain non-deterministic functions (like GETDATE()) or specific T-SQL constructs that prevent the optimizer from inlining the code.


For more detailed information on requirements and system views, refer to the official Microsoft documentation:

Want to speed up your queries? Identifying non-inlineable functions is the first step toward significant performance gains in SQL Server 2019!