As a critical element of nearly all modern applications, JSON has also become the new data interchange format for our own pipelines. Database providers have heeded the call to integrate JSON functionality into their offerings. For instance, Redshift, Amazon’s Data Warehouse, is quite fast in spite of its bare-bones JSON functionality (though in its documentation, Amazon recommends “…using JSON sparingly”). We typically store raw JSON in staging tables as data progresses through our pipeline, from applications to warehouses.
But data gurus need to ask themselves whether they should use the JSON access provided in the DB or pivot the data out of the JSON document and into fields. A data professional can be helpful in evaluating the usage. For example, if the usage is such that you are extracting an extra column for a record at the end of a drill down, modern databases like Redshift are a perfect solution as they can accomplish this task very quickly.
Where we encounter a performance trade-off is in reporting on all of our customers and displaying a JSON element in the structured report. A data professional would need to evaluate the frequency of this data usage to determine if the overhead of extracting the JSON outweighs the cost of pivoting the data.
A simple data evaluation will reveal that performance lags and trajectory widens as more data is poured in. As a simple test, I accessed a cluster of three nodes in AWS Redshift which were comprised of dc2.large. The data held over 90k customers with a unique customer ID, company affiliation, and a JSON field. The JSON document in the record is a single key deep and contained over 1.3 million possible data fields. Before starting, I pivoted all the keys from the customer table to a separate table keyed by the customer ID. I then picked three key names from the JSON documents to report on—and didn’t report on keys that contained empty values. This resulted in a report of just under 160 thousand rows. For the tests, I disabled the results cache in Redshift. The straight join method took 12.5 seconds. The select json_extract_path_tex() method finished in 15.35 seconds. Tables were not give a sortkey and used default diststyle of even.
While JSON parsing is fast when building a large list of results and every second counts, kaving the data stored in a table you can join to will always provide faster results.