Wednesday, February 28, 2007

Compare tables

I have no doubt that you've seen many ways/or third products to compare tables. I'd like to show a couple of examples that the first one I learned from Itzik Ben-Gan and the second one I used BINARY_CHECKSUM function.
1)
IF(SELECT CHECKSUM_AGG(checksum(*)) FROM t1)
<> (select CHECKSUM_AGG(checksum(*)) FROM t2)
Print'different'
ELSE
Print'probably the same'

/*
CHECKSUM_AGG
Returns the checksum of the values in a group. Null values are ignored.
*/

2)
SELECT a.Col, a.CheckSum
From (Select colid as "Col", BINARY_CHECKSUM(*) as "CheckSum"
FROM dbo.TableA ) a
Inner Join (
Select colid as "Col", BINARY_CHECKSUM(*) as "CheckSum"
FROM dbo.TableB) b
On a.Col= b.Col
Where a.CheckSum <> b.CheckSum

Actually , the second example may or may not return you inaccurate value, so please test it before using.

8 comments:

dmarkle said...

Uri: I have

dmarkle said...

Uri:

I have used this method a couple of times. You probably should warn people that this method will NOT return accurate results 100% of the time, though. Here's a counterexample:

create table #tableA (colid int primary key, foo int, foo2 int)
create table #tableB (colid int primary key, foo int, foo2 int)
insert #tablea select 1, 0, 16
insert #tableb select 1, 1, 0

SELECT a.Col, a.CheckSum
From (Select colid as "Col", BINARY_CHECKSUM(*) as "CheckSum"
FROM #TableA ) a
Inner Join (
Select colid as "Col", BINARY_CHECKSUM(*) as "CheckSum"
FROM #TableB) b
On a.Col= b.Col
Where a.CheckSum <> b.CheckSum


--If you want to find out which rows might collide using this method, run this function. You'll see a definite pattern ;-)

declare @m int
declare @n int
set @n = 0
while @n < 500 begin
set @m = 0
while @m < 500 begin
set @m = @m+1
if (select binary_checksum(*) from (select 1 as colid, 0 as foo, @m as foo2) as x)=
(select binary_checksum(*) from (select 1 as colid, @n as foo, 0 as foo2) as z) begin
print ' foo2 = ' + cast(@m as varchar(max))+ ' foo = ' + cast(@n as varchar(max))
end
end
set @n = @n + 1
end

Uri Dimant said...

Yep, thanks David.I'm aware of the issue and will be warning people soo.Personally I have not met with inaccurate values in real life,so thanks for pointing that out.

dmarkle said...

:) Just for the benefit of the readers, here... When you run the last script, you can see pretty clearly how BINARY_CHECKSUM() works. Here's the series of numbers up to 500 where BINARY_CHECKSUM() returns identical values:

foo2 = 16 foo = 1
foo2 = 32 foo = 2
foo2 = 48 foo = 3
foo2 = 64 foo = 4
foo2 = 80 foo = 5
foo2 = 96 foo = 6
foo2 = 112 foo = 7
foo2 = 128 foo = 8
foo2 = 144 foo = 9
foo2 = 160 foo = 10
foo2 = 176 foo = 11
foo2 = 192 foo = 12
foo2 = 208 foo = 13
foo2 = 224 foo = 14
foo2 = 240 foo = 15
foo2 = 256 foo = 16
foo2 = 272 foo = 17
foo2 = 288 foo = 18
foo2 = 304 foo = 19
foo2 = 320 foo = 20
foo2 = 336 foo = 21
foo2 = 352 foo = 22
foo2 = 368 foo = 23
foo2 = 384 foo = 24
foo2 = 400 foo = 25
foo2 = 416 foo = 26
foo2 = 432 foo = 27
foo2 = 448 foo = 28
foo2 = 464 foo = 29
foo2 = 480 foo = 30
foo2 = 496 foo = 31

Uri Dimant said...

create table #tableA (colid int primary key, foo int, foo2 int)
create table #tableB (colid int primary key, foo int, foo2 int)
insert #tablea select 2, 0, 32
insert #tableb select 2, 1, 0

This will return differences.

I got the error running your script
Server: Msg 170, Level 15, State 1, Line 10
Line 10: Incorrect syntax near 'max'.

dmarkle said...

It uses the VARCHAR(MAX) datatype, so you have to be running it on SQL Server 2005...

insert #tablea select 1, 0, 32
insert #tableb select 1, 2, 0

Does not return differences. (that's case #2 in my table of values)

Unknown said...

There is a problem with both CHECKSUM and BINARY_CHECKSUM...

I usually use BINARY_CHECKSUM as CHECKSUM is not case sensitive (try SELECT CHECKSUM('AAAA', 'BBBB'), CHECKSUM('aaaa', 'bbbb') ) while and BINARY_CHECKSUM is, but in both cases as the function is simply a hash of the data.

I have actually seen a case where completely different rows were being flagged as matching with real data while doing updates on a very big dimension (>10 columns of multiple types)!

The way that we got around it is to pass the columns into the function in 2 different orders. As column order is taken into account when calculating the hash value (i.e. BINARY_CHECKSUM(colA, colB) <> BINARY_CHECKSUM(colB, colA)) the chances of incorrect matches is greatly reduced.

Of course you will take a small hit in performance :(

aiya said...
This comment has been removed by a blog administrator.