Business Intelligence is all about using data to construct meaningful insights that can be consumed by the business. As IT practitioners, we encounter situations that test our ingenuity on a daily basis, and must dig into our toolbox of techniques to overcome any technical challenges.
The Use Case
If you work on the ETL (Extraction/Transformation/Load) side of Business Intelligence, you’ll recognize the use case as a common one: detecting changes in data to enable retrieval and loading into an environment. This is the key to allowing incremental loads. When dealing with small datasets, the issue is not so critical. But when you scale, the data to be loaded must be more targeted in order to keep your system from glowing red with activity. When you have several millions of rows of data to evaluate, it’s a different scenario than when you’re looking at a few thousand.
The Specific Challenge/Goal
The specific challenge/goal is straightforward: get a MAX change date from multiple columns—usually across multiple tables—and get it all done in a performance-friendly manner. My personal scenario involved nine tables in a query (order header, order detail, and seven reference tables with 95 million rows) to check.
- Include accurate CreateDate or ModifiedDate columns in the source to be able to interrogate
- If you don’t have a usable change date, generate one using something like Hashbytes to compare between the stored data and the previous incoming data. Otherwise, you must forgo performing an incremental load.
Even with usable CreateDate or ModifiedDate columns, as the number of columns increases, development (along with the ability to maintain accurate code) becomes harder as the checks grow almost exponentially.
Just going from two to three columns increases the required checks and complexity.
IF column1 >= Column 2 THEN column 1 IF column2 > Column 1 THEN column 2
IF column1 >= Column 2 AND column 1 >= Column 3 THEN column 1 IF column2 >= Column 3 AND column 2 >= Column 1 THEN column 2 IF column3 >= Column 2 AND column 3 >= Column 1 THEN column 3
… and it keeps on growing with each added column!
Again, my scenario included nine tables in a query (order header, order detail, and seven reference tables) with more than 95 million rows that had potentially changed information I needed to evaluate to determine if it had to be supplied to my data warehouse load. Losing a transaction is not an option.
To let the cat out of the bag: I used a VALUES clause in a column sub-select to let me quickly get the max change date from nine columns across nine tables without adversely affecting performance or creating a coding nightmare. But before we take a closer look, let’s consider some options for performing our check.
SELECT case WHEN column1 > Column 2 AND column 1 > Column 3 then column1 WHEN column2 > Column 3 AND column 2 > Column 1 then column2 WHEN column3 > Column 2 AND column 3 > Column 1 then column3 END FROM table1, table2, table3
User Defined Function (UDF)
SELECT dbo.GetMaxChangeDate(column1, column2, column3) FROM table1 LEFT JOIN table2 ON … LEFT JOIN table3 ON …
SELECT (SELECT MAX([Date]) FROM (SELECT column1 AS [Date] FROM table1 where table1.PK_ID = drvTable.PK_ID UNION ALL SELECT column2 FROM table2 where table2.PK_ID = drvTable.PK_ID UNION ALL SELECT column3 FROM table3 where table3.PK_ID = drvTable.PK_ID ) D ) AS MaxChangeDate FROM table1 as drvTable
Sub-Select with a VALUES Clause
SELECT ( SELECT MAX(ChangeDate) FROM (VALUES (column1) ,(column2) ,(column3) ) AS value(ChangeDate) ) AS MaxChangeDate FROM table1 LEFT JOIN table2 ON … LEFT JOIN table3 ON …
Here’s a simple, working SQL example as a take-away (use the AdventureWorks DB).
,( SELECT MAX(ChangeDate)
FROM (VALUES (od.[ModifiedDate])
) AS value(ChangeDate)
) AS MaxChangeDate
FROM [Sales].[SalesOrderDetail] od
INNER JOIN [Production].[Product] p
on p.ProductID = od.ProductID
The VALUES clause in a sub-select as part of the column select is a straightforward construct. This usage is a little-known, but powerful, documented TSQL trick. The VALUES clause allows you to materialize the check-in memory to let you perform a streamlined sub-select with the column values resident for the row being processed. Though my developer background would have me leaning towards using the UDF to provide a common code module to support the logic, poor performance on SQL Server negates it. When executed as part of the column list in a SELECT statement, SQL Server must instantiate a copy of the UDF for every row processed. Performance degrades as the number of rows increases, substantial even when the number is in the thousands. The VALUES clause processes in memory and the effect on performance is negligible. Plus, the great thing is that you can use any aggregate function in conjunction with the construct.
This is just one solution (albeit a critical one) to a problem that is itself just one piece of the overall picture. As IT practitioners, we live and die by our experience and the techniques and methods in our toolbox. So, what is in yours? If you have any questions around SQL Server best practices, the simplest and quickest way to solve them is to schedule a consult with Anexinet’s experienced team. To learn more about our Business Intelligence services, click here.
Business Intelligence Architect