When I go into a client, I often see two approaches to error handling which are opposite extremes. The client will usually have no exception handling at all or wrap up every process in exception handling which carefully ignores errors, logs them away for safe keeping and quietly continues. The latter is surprisingly worse than the former.
On the one hand, the ETL is at the whim of whatever is thrown its way – connection issues, bad data or other anomalies. On the other, no one ever knows that exceptions are happening and the users of the system go on their merry way, not realizing that their data warehouse is dry rotting like a curtain, until one day, they give a tug to let in the sun and it falls to pieces. How easily trust is lost, when there are apparent gaps in the data from lost rows.
Let me give a specific example, I was working on a package which used a script component to parse a flat file which caught all exceptions and sent them to an error output. The error output carried a vague hint of what the error was and what column was affected, but it did not have trace information about the source file, the affected row or what the value was. I should also note here that a script component should never be used to replace the equivalent functionality in an out-of-the-box component (i.e. flat file source). This is an important aspect of exception handling – keep it simple, but not too simple.
So why was this catch-all methodology being employed? I found that the source files would sometimes contain invalid values from user input. In fact, there were just three fields that would sometimes go wrong, and only two cases in which the data should have been rejected. Rejected and reported back to business, mind you.
Ayende Rahien sums up a very simple approach to exception handling in this post, ://ayende.com/blog/3952/kobe-an-example-of-exception-handling-done-wrong, in which he states:
Exception handling should appear in exactly two places:
· When an error is expected (making a web request call, for example) and there is some meaningful behavior to be done in the case of a failure (such as retrying after some delay)
· On a system boundary, in which case you need to make a decision about how you are going to expose the error to the outside world.
If we apply this same methodology to my example problem, we would want exception handling for our three bad data cases and that’s it. Since this file is produced from another database, as long as we match the source data types, there will be no need to protect the system boundaries. In other cases, we might find that the source system is poorly designed and uses a 255 length string for every column. That would be a good case to enable exception handling for conversion errors for what the values actually should be.
So what do we do with the exceptions anyway?
In his book, Data Quality the Accuracy Dimension, Jack Olsen offers four possible approaches for dealing with data quality problems, but I think they apply very well for any exception handling in an ETL:
- Ignore the error – log it and report it, but keep the original value in the case of a data quality problem
- Try to correct it – i.e. put in retries for a problematic connection, like every FTP connection – they always seem to be spotty
- Reject the row – I have seen some approaches to quarantining records so they could be massaged by business and reloaded, but I think it is far better to correct problems in the source system. That way, you will never see it again should you need to reload the data.
- Stop the ETL – sometimes all you can do is rely on manual intervention to fix a problem and that is when this approach makes sense.
Combining these two approaches, we can create ETL’s that are very effective at handling exceptions and letting us know about it when it happens.