Have you used
WHEN NOT MATCHED BY Source THEN DELETE
much in your work with MERGE?
I’ve been using SQL Server’s MERGE statement since a little before my previous post on it, but I never used the clause above much at all.
Recently I had an opportunity to use it, and I found some unanticipated behavior. And since our architect said “why is it doing that?” when I showed him what I found, I don’t feel too bad about posting this here.
The scenario: my goal was to merge in three sets of data from three different hospitals into one fact table, including deleting records in the fact table that were not found in the source files, since each of the source files was a “complete refresh” of that hospital’s data. The delete part seemed like a good application of the WHEN NOT MATCHED BY Source, right?
Indeed it was, but I discovered quickly that there were some specific things to be careful about when using this clause, as written above at least:
- If your source is empty, you’ll delete all the data in your fact table. This is a real bummer when this happens. <sad trombone>
- If your source is one of multiple sources going to the destination fact table, and you use the clause above as is, your delete will delete all the other data in the fact table. This is less of a bummer than deleting all your data, but still worthy of a <sad trombone>.
I tripped over the first item when I initially tested my changes and had put the source files in the wrong directory. In our system, there is an initial truncate of the staging table, a load to it from the source files — which inserted zero rows, since it couldn’t find the source files — and then the merge. The merge, containing the clause
WHEN NOT MATCHED BY Source THEN DELETE
happily deleted every record that existed in the fact table (the merge target) that it did not have in the empty staging table (the merge source) — that is, all of them.
Fortunately, this is easily remedied by capturing the size of your source and either avoiding the merge if it is empty, or augmenting the WHEN NOT MATCHED BY Source clause. I chose the latter, since it fit into our standard merge stored procedures more cleanly:
WHEN NOT MATCHED BY Source AND @SourceRowCount > 0 THEN DELETE
where @SourceRowCount was populated previously by simply doing a COUNT(*) on the staging table.
I put that in and tested it, since I was sure I’d make that mistake with the sources again (it’s good to know yourself), and all was good. Then I put the three source files — one for each hospital — into the correct location, ran it again and bam! I found I only had the last hospital’s data in the fact table when I was done.
Hmm. Being used to using MERGE solely for upserts, it took a while to grok, but it did make sense eventually — just like the empty merge source situation, it will delete all records that exist in the target but aren’t found in the source.
So my truncate fact table, load hospital A, load hospital B, load hospital C process ended up leaving only hospital C’s data in the fact table, since hospitals A and B’s records were not in hospital C’s file, of course.
How to remedy this? I had to specify that it should only delete records corresponding to the hospital it was currently working on. Unfortunately, there is no way to append a WHERE clause on the DELETE operation specifically, but augmenting the WHEN NOT MATCHED BY Source clause achieves the same goal. I knew what hospital I was merging in, and so putting that into the clause:
WHEN NOT MATCHED BY Source AND @SourceRowCount > 0 AND Target.hospital = @currentHospital THEN DELETE
seemed the way to go. You can see that this addition effectively restricts the DELETE to operating on only those records in the fact table corresponding to the hospital you specify by @currentHospital, similar to putting a WHERE clause on the DELETE.
And indeed, it did work. I hope by putting this up here that I’ve saved you some head-scratching if you’re running into the same questions about using DELETE with MERGE.