MERGE in SQL Server 2008

MERGE!  You’ve arrived just in time to help me with my ETL operations.

Introduced in SQL Server 2008, MERGE lets one update one table by inserting, updating, or deleting fields as appropriate based on a join with another table.  I recognized quickly that the command was clearly what I needed in order to synchronize the dimension tables of our data warehouse between our development, staging, and production servers.  We recently converted our data warehouse to use foreign keys between our fact and dimension tables — yes, that’s right, there were no foreign keys in our data warehouse — and the old method of deleting and replacing the entire dimension table when updating a dimension on our staging or production server was simply not appropriate anymore.  I needed “upsert” capability, a way to insert and update only the things that changed between the dimension tables.

Enter the MERGE command.  Here it is in action, as I use it to update a dimension (proprietary information removed and/or changed):

merge dimHospital as target
 using development_db.dbo.dimHospital src
 on src.hospital_key = target.hospital_key

 when not matched by target then
 insert (
 hospital_key,
 full_name,
 addr,
 city,
 state_abbr,
 zip_cde
 )
 values (
 src.hospital_key,
 src.full_name,
 src.addr,
 src.city,
 src.state_abbr,
 src.zip_cde
 ) 

 when matched and (
 target.full_name <> src.full_name
 or target.addr <> src.addr
 or target.city <> src.city
 or target.state_abbr <> src.state_abbr
 or target.zip_cde <> src.zip_cde
 ) then
 update set
 full_name = src.full_name
 ,city = src.city
 ,addr = src.addr
 ,state_abbr = src.state_abbr
 ,zip_cde = src.zip_cde

 output 

 inserted.hospital_key,
 inserted.full_name,
 inserted.addr,
 inserted.city,
 inserted.state_abbr,
 inserted.zip_cde,

 deleted.hospital_key,
 deleted.full_name,
 deleted.addr,
 deleted.city,
 deleted.state_abbr,
 deleted.zip_cde

 $action, @merge_dt into dimHospital_Merge_History
 ;

Note the semicolon at the end there, that is actually a necessary part of the command.

So here’s what’s going on.  I’m telling SQL Server to modify my destination table (aliased as “target”) based on some source version of that table, comparing them based on some join condition (here using the primary key, hospital_key):

merge dimHospital as target
 using development_db.dbo.dimHospital src
 on src.hospital_key = target.hospital_key

If rows exist in the source table that do not exist in the destination table, we insert those rows into the destination table:

when not matched by target then
 insert (
 hospital_key,
 full_name,
 addr,
 city,
 state_abbr,
 zip_cde
 )
 values (
 src.hospital_key,
 src.full_name,
 src.addr,
 src.city,
 src.state_abbr,
 src.zip_cde
 )

Where both the source and destination tables rows exist, and there is at least one difference in the fields I check, I update the destination table values from the source table:

 when matched and (
	target.full_name <> src.full_name
	or target.addr <> src.addr
	or target.city <> src.city
	or target.state_abbr <> src.state_abbr
	or target.zip_cde <> src.zip_cde
	) then
  update set
   	 full_name = src.full_name
	,city = src.city
	,addr = src.addr
	,state_abbr = src.state_abbr
	,zip_cde = src.zip_cde

Just what I needed to do.  One can also delete rows in the destination table which do not match your source table, but that isn’t shown here because, this being a dimension table, I never delete rows.

But Wait, There’s More!

And now for the best part, the OUTPUT clause.

output 

 inserted.hospital_key,
 inserted.full_name,
 inserted.addr,
 inserted.city,
 inserted.state_abbr,
 inserted.zip_cde,

 deleted.hospital_key,
 deleted.full_name,
 deleted.addr,
 deleted.city,
 deleted.state_abbr,
 deleted.zip_cde

 $action, @merge_dt into dimHospital_Merge_History

As you can see, I am storing the values of the destination table prior to any change (the deleted table entries), and after any change (the inserted table entries), along with the merge action and the merge date into a custom “merge history” table.

That table was created at the top of my dimension update script via the following SQL:

if not exists(select *
 from information_schema.tables
 where table_name = 'dimHospital_Merge_History')
 begin
 create table dimHospital_Merge_History (

 inserted_hospital_key int null,
 inserted_full_name varchar(60) null,
 inserted_addr varchar(40) null,
 inserted_city varchar(30) null,
 inserted_state_cde char(2) null,
 inserted_zip_cde char(5) null,

 deleted_hospital_key int null,
 deleted_full_name varchar(60) null,
 deleted_addr varchar(40) null,
 deleted_city varchar(30) null,
 deleted_state_cde char(2) null,
 deleted_zip_cde char(5) null,

 merge_action varchar(20) not null, merge_dt datetime not null default getdate()
 )

AND, I made sure that it was also initially populated with the existing values of the destination table prior to the very first merge operation (by putting it inside the if not exists clause shown started above):

insert dimHospital_Merge_History (

 inserted_hospital_key,
 inserted_full_name,
 inserted_addr,
 inserted_city,
 inserted_state_cde,
 inserted_zip_cde,

 merge_action, merge_dt

 )

 select     

 hospital_key,
 full_name,
 addr,
 city,
 state_cde,
 zip_cde,
 'BASELINE', @merge_dt

 from dimHospital

Using this information I should be able to recreate the dimension table at any point in its history, and know exactly what changed along the way.  I haven’t done it yet, but it is kind of neat to think that it can be done. I do know that the question of what changed at what time in our dimension tables comes up often from our analyst, and the history obtained from the OUTPUT clause should  prove useful in the future.

This entry was posted in T-SQL. Bookmark the permalink.