At this time I would like to share with you my experience to delete duplicates in very large table(800 million of rows).
A very general method is to use ROW_NUMBER function to PARTITION ON desired columns and then filter out only unique data.
) SELECT * FROM cte WHERE rn=1
As you imagine on huge table it will take too long. In order to optimize that query I used batch processing (divided that transaction into small chunks)
DECLARE @x INT
SET @x = 1
WHILE @x < 44,000,000 -- Set appropriately
FROM tbl WHERE ID BETWEEN @x AND @x + 10000
)SELECT * FROM cte WHERE rn=1
SET @x = @x + 10000
Ok ,it worked much faster as we have a clustered index on ID column such as SQL Server uses it to get the data based on defined range.However, we have another problem that data we are getting back is not actually unique.You see that for specific range ,say (from 1 to 10000) I get the data based on required partition and filter out for rn=1, BUT it is possible that the same row will occur in the next chunk (from 10001 to 20000) and we will also get it back because SQL Server does not recognize it as duplicate we have already got from the first chunk.
More reliable solution is checking on entire table and not to base on ranges.
tbl GROUP BY
As you can see it could take long time ,so I also tried to create a Dedupt table with a key IGNORE_DUP_KEY option (thanks to Hugo) but insert into the table was pretty slow as well. Peter Larsson a fellow MVP has suggested the below technique that worked pretty well
CREATE TABLEe #unique (id primary key clustered)
INSERT INTO #temp (id) SELECT MIN(ID) AS ID
GROUP BY Col1, col2, col3... (here you decide the uniqueness)
--And then insert into batches for
SET @id = 0
WHILE @id < (800 million or more)
SELECT t1.ID, t1.Col
FROM dbo.Table1 AS T1
INNER JOIN #unique as u ONu.id = t1.id
AND u.id BETWEEN @id AND @id +99999
SET @id += 100000
It would be great it put back here testing results.