Tuesday, January 11, 2011

Dedup on huge table

Hi friends
At this time I would like to share with you my experience to delete duplicates in very large table(800 million of rows).
A very general method is to use ROW_NUMBER function to PARTITION ON desired columns and then filter out only unique data.

WITH cte
AS
(
SELECT ,ROW_NUMBER() OVER (PARTITION BY ORDER BY ) rn
FROM tbl
) SELECT * FROM cte WHERE rn=1

As you imagine on huge table it will take too long. In order to optimize that query I used batch processing (divided that transaction into small chunks)


DECLARE @x INT
SET @x = 1
WHILE @x < 44,000,000 -- Set appropriately
BEGIN

;WITH cte
AS
(
SELECT ,ROW_NUMBER() OVER (PARTITION BY ORDER BY ) rn
FROM tbl WHERE ID BETWEEN @x AND @x + 10000
)SELECT * FROM cte WHERE rn=1

SET @x = @x + 10000
END
Ok ,it worked much faster as we have a clustered index on ID column such as SQL Server uses it to get the data based on defined range.However, we have another problem that data we are getting back is not actually unique.You see that for specific range ,say (from 1 to 10000) I get the data based on required partition and filter out for rn=1, BUT it is possible that the same row will occur in the next chunk (from 10001 to 20000) and we will also get it back because SQL Server does not recognize it as duplicate we have already got from the first chunk.

More reliable solution is checking on entire table and not to base on ranges.
SELECT ,COUNT(*) rn FROM
tbl GROUP BY
HAVING COUNT(*)>1

As you can see it could take long time ,so I also tried to create a Dedupt table with a key IGNORE_DUP_KEY option (thanks to Hugo) but insert into the table was pretty slow as well. Peter Larsson a fellow MVP has suggested the below technique that worked pretty well

CREATE TABLEe #unique (id primary key clustered)
INSERT INTO #temp (id) SELECT MIN(ID) AS ID
FROM tbl
GROUP BY Col1, col2, col3... (here you decide the uniqueness)

--And then insert into batches for

SET @id = 0
WHILE @id < (800 million or more)
BEGIN
SELECT t1.ID, t1.Col
FROM dbo.Table1 AS T1
INNER JOIN #unique as u ONu.id = t1.id
AND u.id BETWEEN @id AND @id +99999

SET @id += 100000
END

It would be great it put back here testing results.