How to Leverage an Under-Utilized SQL Technique to Find Max Date Across Columns

Business Intelligence is all about using data to construct meaningful insights that can be consumed by the business. As IT practitioners, we encounter situations that test our ingenuity on a daily basis, and must dig into our toolbox of techniques to overcome any technical challenges.

The Use Case

If you work on the ETL (Extraction/Transformation/Load) side of Business Intelligence, you’ll recognize the use case as a common one: detecting changes in data to enable retrieval and loading into an environment. This is the key to allowing incremental loads. When dealing with small datasets, the issue is not so critical. But when you scale, the data to be loaded must be more targeted in order to keep your system from glowing red with activity. When you have several millions of rows of data to evaluate, it’s a different scenario than when you’re looking at a few thousand.

The Specific Challenge/Goal

The specific challenge/goal is straightforward: get a MAX change date from multiple columns—usually across multiple tables—and get it all done in a performance-friendly manner. My personal scenario involved nine tables in a query (order header, order detail, and seven reference tables with 95 million rows) to check.

Prerequisite/Considerations

  • Include accurate CreateDate or ModifiedDate columns in the source to be able to interrogate
  • If you don’t have a usable change date, generate one using something like Hashbytes to compare between the stored data and the previous incoming data. Otherwise, you must forgo performing an incremental load.

Even with usable CreateDate or ModifiedDate columns, as the number of columns increases, development (along with the ability to maintain accurate code) becomes harder as the checks grow almost exponentially.

Example:
Just going from two to three columns increases the required checks and complexity.

2 Columns

IF column1 >= Column 2                 THEN column 1
IF column2 >  Column 1                 THEN column 2

3 Columns

IF column1 >= Column 2 AND column 1 >= Column 3         THEN column 1
IF column2 >= Column 3 AND column 2 >= Column 1         THEN column 2
IF column3 >= Column 2 AND column 3 >= Column 1         THEN column 3

… and it keeps on growing with each added column!

Again, my scenario included nine tables in a query (order header, order detail, and seven reference tables) with more than 95 million rows that had potentially changed information I needed to evaluate to determine if it had to be supplied to my data warehouse load. Losing a transaction is not an option.

Options

To let the cat out of the bag: I used a VALUES clause in a column sub-select to let me quickly get the max change date from nine columns across nine tables without adversely affecting performance or creating a coding nightmare. But before we take a closer look, let’s consider some options for performing our check.

In-Line Code

SELECT case
       WHEN column1 > Column 2 AND column 1 > Column 3  then column1
       WHEN column2 > Column 3 AND column 2 > Column 1  then column2
       WHEN column3 > Column 2 AND column 3 > Column 1  then column3                     
END
FROM table1, table2, table3

User Defined Function (UDF)

SELECT dbo.GetMaxChangeDate(column1, column2, column3)
FROM table1
LEFT JOIN table2  ON …
LEFT JOIN table3  ON …

Sub-Select

SELECT (SELECT MAX([Date])
       FROM    (SELECT column1 AS [Date] FROM table1  where table1.PK_ID = drvTable.PK_ID
               UNION ALL
               SELECT column2 FROM table2  where table2.PK_ID = drvTable.PK_ID
               UNION ALL
               SELECT column3 FROM table3  where table3.PK_ID = drvTable.PK_ID
               )
               D
       )
       AS MaxChangeDate
FROM table1 as drvTable

Sub-Select with a VALUES Clause

SELECT (
       SELECT  MAX(ChangeDate)
             FROM (VALUES (column1)
                 ,(column2)
                 ,(column3)
               ) AS value(ChangeDate)
       ) AS MaxChangeDate
FROM table1
LEFT JOIN table2  ON …
LEFT JOIN table3  ON …

Here’s a simple, working SQL example as a take-away (use the AdventureWorks DB).

SELECT od.[SalesOrderID]
,od.[SalesOrderDetailID]
,od.[CarrierTrackingNumber]
,od.[OrderQty]
,od.[ProductID]
,od.[SpecialOfferID]
,od.[UnitPrice]
,od.[UnitPriceDiscount]
,od.[LineTotal]
,od.[rowguid]
,od.[ModifiedDate]
,p.ModifiedDate
,(  SELECT  MAX(ChangeDate)
                 FROM (VALUES (od.[ModifiedDate])
,(p.ModifiedDate )
) AS value(ChangeDate)
) AS MaxChangeDate
FROM [Sales].[SalesOrderDetail] od
INNER JOIN [Production].[Product] p
                on p.ProductID = od.ProductID

Conclusion

The VALUES clause in a sub-select as part of the column select is a straightforward construct. This usage is a little-known, but powerful, documented TSQL trick. The VALUES clause allows you to materialize the check-in memory to let you perform a streamlined sub-select with the column values resident for the row being processed. Though my developer background would have me leaning towards using the UDF to provide a common code module to support the logic, poor performance on SQL Server negates it.  When executed as part of the column list in a SELECT statement, SQL Server must instantiate a copy of the UDF for every row processed. Performance degrades as the number of rows increases, substantial even when the number is in the thousands. The VALUES clause processes in memory and the effect on performance is negligible. Plus, the great thing is that you can use any aggregate function in conjunction with the construct.

This is just one solution (albeit a critical one) to a problem that is itself just one piece of the overall picture. As IT practitioners, we live and die by our experience and the techniques and methods in our toolbox. So, what is in yours? If you have any questions around SQL Server best practices, the simplest and quickest way to solve them is to schedule a consult with Anexinet’s experienced team. To learn more about our Business Intelligence services, click here.

Share on

Facebook sharing Linkedin sharing button Twitter sharing button

Ready to get started?

Enter your information to keep the conversation going.
Location image
4 Sentry Parkway East, Suite 300, Blue Bell PA, 19422

Email Image

info@anexinet.com

Phono Image610 239 8100


Location Image4 Sentry Parkway East, Suite 300, Blue Bell PA, 19422
Phono Image610 239 8100