SQL delete duplicate rows

Deleting duplicate rows in a database table is a common task in SQL, and it’s important for maintaining data integrity. Duplicate rows can occur for various reasons, such as data entry errors or system glitches. Here’s a guide on how to delete duplicate rows in SQL:

Identifying Duplicate Rows

Before deleting duplicate rows, you need to identify them. You can use the GROUP BY clause along with the HAVING clause to find rows where certain columns have duplicate values. For example, if you have a table called your_table and you want to find duplicate rows based on the values in columns col1 and col2, you can use the following query:

SELECT col1, col2, COUNT(*)
FROM your_table
GROUP BY col1, col2
HAVING COUNT(*) > 1;

This query will show you the values of col1 and col2 that have more than one occurrence in the table.

Deleting Duplicate Rows

Once you’ve identified the duplicate rows, you can use the DELETE statement to remove them. The general syntax for deleting duplicate rows is as follows:

DELETE FROM your_table
WHERE (col1, col2) IN (
    SELECT col1, col2
    FROM your_table
    GROUP BY col1, col2
    HAVING COUNT(*) > 1
);

This query deletes all rows where the combination of values in col1 and col2 is duplicated.

Using Row Number to Keep One Instance

Another approach is to use the ROW_NUMBER() window function to assign a unique number to each row within a partition of the duplicate values. You can then delete all rows with a row number greater than 1. Here’s an example:

WITH CTE AS (
    SELECT
        col1,
        col2,
        ROW_NUMBER() OVER (PARTITION BY col1, col2 ORDER BY col1) AS RowNum
    FROM your_table
)
DELETE FROM CTE WHERE RowNum > 1;

This query uses a Common Table Expression (CTE) and the ROW_NUMBER function to assign a unique number to each row within the partitions defined by col1 and col2. The DELETE statement then removes rows with a row number greater than 1.

Caution

Before executing DELETE statements, it’s crucial to have a backup of your data or to work in a test environment to avoid accidental data loss. Always be certain that you are deleting the correct duplicate rows.

Keep in mind that the specific syntax may vary slightly depending on the database system you are using (e.g., MySQL, PostgreSQL, SQL Server), so it’s a good practice to consult the documentation for your specific database.