Thursday, February 22, 2007

Dealing with duplicates

A this time I'd like to show some techniques to deal with duplicates
CREATE TABLE TableWithDups
(
row_id int identity(1,1),
col1 int,
col2 int
)
INSERT INTO TableWithDups (col1 ,col2 ) VALUES (1,10)
INSERT INTO TableWithDups (col1 ,col2 ) VALUES (1,10)
INSERT INTO TableWithDups (col1 ,col2 ) VALUES (4,2)
INSERT INTO TableWithDups (col1 ,col2 ) VALUES (3,3)
INSERT INTO TableWithDups (col1 ,col2 ) VALUES (3,3)
INSERT INTO TableWithDups (col1 ,col2 ) VALUES (9,20)

PRINT 'Duplicates in Table'
SELECT * FROM TableWithDups
WHERE EXISTS
(SELECT * FROM TableWithDups TWD
WHERE TWD.row_id <> TableWithDups.row_id
AND TWD.col1 = TableWithDups.col1
AND TWD.col2 = TableWithDups.col2)

PRINT 'Duplicates to be Deleted'
SELECT * FROM TableWithDups
WHERE EXISTS
(SELECT * FROM TableWithDups TWD
WHERE TWD.row_id < TableWithDups.row_id
AND TWD.col1 = TableWithDups.col1
AND TWD.col2 = TableWithDups.col2 )


PRINT 'Executig Deletion of Duplicates '
DELETE FROM TableWithDups
WHERE EXISTS
(SELECT * FROM TableWithDups TWD
WHERE TWD.row_id < TableWithDups.row_id
AND TWD.col1 = TableWithDups.col1
AND TWD.col2 = TableWithDups.col2 )

SELECT * FROM TableWithDups

If you are using SQL Server 2005 take a look at the below script
WITH Duplicates
AS
( SELECT *, ROW_NUMBER()OVER(PARTITION BY col1 , col2 ORDER BY col1 , col2 ) AS cnt FROM TableWithDups)
DELETE FROM Duplicates WHERE cnt> 1;

Recently my good friend Doron Farber wrote great article about how efficiently remove duplicates on huge tables based on real world example.
http://www.dfarber.com/computer-consulting-blog/2011/12/26/remove-duplicate-records-in-sql.aspx

8 comments:

JustinH said...

Hi Uri,

Surely the true test is when there is no identifier on the table (i.e. a badly designed table)?

The SQL Server 2005 method works for this but in SQL 2000 the best I've managed to come up with is using a temp table and SELECT DISTINCT - which gets a lot more complicated if the duplicate rows have foreign key references :-/

Any thoughts?

--CELKO-- permission to use my posts on websites that make a profit from advertising is denied without my written consent said...

By definition this is not a table at all -- it has no key! What this usually models is a commodity, we need a column for the quantity, which can be an absolute scale (count) or a ratio scale (units of measure).
CREATE TABLE Inventory
(item_name VARCHAR(25) NOT NULL PRIMARY KEY,
item_qty INTEGER NOT NULL
CHECK (item_qty > 0));

INSERT INTO Inventory VALUES ('No. 5 Machine Bolts', 123);
INSERT INTO Inventory VALUES ('No. 5 Hex Nuts', 127);
etc.

This sort of design comes from mimicking a paper form or other physical data source directly in SQL. It should never be allowed inthe first place.

aiya said...
This comment has been removed by a blog administrator.
Adim said...

These articles are fantastic; the information you show us is interesting for everybody and really good written. It’s just great!!

Agriculture Logo

Admin said...

Hi, I like your blog very much! It is found me so interesting and informative... Thanks very much for sharing this amazing information over here…Ready-Made Logo

Felix Smith said...

This is an excellent site, I like your blog and your writing style, thanks for sharing your knowledge, keep up the good work. Cheap Essay Writing

Felix Smith said...

I really like this website. This is a great resources, provide you with your free distribution. It gives in depth information. Thank you for this valuable information. Thesis Paper Writing

Felix Smith said...

Hi, I like your blog very much! It is found me so interesting and informative. Thanks very much for sharing this amazing information over here. Essay Writing Services