With the introduction of release 7 of HP Vertica, database and business intelligence developers who have traditionally shied away from incorporating semi and unstructured data within their projects now have new tools at there disposal to incorporate these sources of rich and meaningful information. Both semi-structured and unstructured data travel throughout, and are stored within an organization’s technology assets. Whether they be in the form of call detail records, web server logs, RF/machine sensor logs, or social media feeds, the information contained within them have the potential to provide significant insight into customer and resource behavior. However, they oftentimes do not seem to be a good match for loading into the schema-based structures of traditional relational database management systems. Enter HP Vertica Flex Tables.
Flex Tables speed the process of providing structure to semi and unstructured data, and enable organizations to glean insight from sources that have been viewed, to a large degree, as unapproachable. Flex Tables do not require schema or column definitions in advance of loading, have full Unicode support, as well as full support for standard SQL queries. The tables contain full support for both delimited and JSON data allowing you explore the data before it is materialized. Additionally, Flex Tables provide the ability to place multiple formats into a single table providing the ability to handle any change in the data’s structure over time.
Creating, loading and querying Flex Tables is a very simple process which is illustrated in the steps below. At the time of this writing, these steps assume that you at least have a working copy of HP Vertica 7.0.1 Community Edition available to you. Let’s get started!
Obtaining data to work with
I love JSON formatted data. It’s markedly lighter than XML, and seems to be the preferred tool for client-side data storage and transport of most of today’s front-end developers. Couple with the fact that the API’s of the top social media platforms make their data accessible in this format, and it makes good sense to get familiar with it. Flex Tables make obtaining information from JSON a breeze if you are comfortable with standard SQL syntax.
First you will need to obtain some JSON data to work with. I found an excellent tool for generating random JSON data created by one of my favorite front-end developers, Vazha Omanashvili, called JSON Generator. I needed to modify the template to increase the number of unique records that are generated, to reformat a field named “balance” that stores currency values in a numeric (or floating point) data type, and to format postal codes in US format.
Clicking the “generate” button on the interface produces a data set that looks like this single record below:
Copy the generated output into a text editor and save it to a file with the .json extension. I copied the output to the /tmp/data directory on the Vertica server.
Creating and Loading data into HP Vertica Flex Tables
Connect to your Vertica instance using credentials of a user that has permission to create database objects, and issue the following vSQL command to create a Flex Table:
create flex table customers();
If the above vSQL is processed successfully, a “CREATE TABLE” response will be sent back to the command line.
Issuing a dt meta-command lists all relations that have been created in your Vertica instance.
Notice that Vertica creates two objects when you create a flex table–the flex table itself public.customers, and a table that is used to store meta-data (column names and their data types) about the flex table public.customers_keys.
We now will need to populate public.customers with the data from the JSON file that was saved to /tmp/data. We do this by copying the data from the file system into the flex table.
copy customers from ‘/tmp/data/customers.json’ parser fjsonparser();
This loads the contents of the file directly into the flex table in key value pairs. There are two native columns present with the flex table __identity__ and __raw__. The data themselves are stored in the __raw__ column. Querying this column will bring back the data as they are stored.
select __raw__ from public.customers;
By using the built in maptostring() frunction on the __raw__ column, you can see the contents JSON format. The query below is formatted to return a single block of JSON text by using the limit 1 filter
select maptostring(_raw_) from customers limit 1;
Now that we have data loaded into the customers table, it’s time for us to query the data and pull out some of its useful contents, but we may not know be sure of what attributes are available to include within our query. The customers_keys table includes column definitions for the data that’s contained customers table. If you were this table now, there would not be any definitions present. Execute the compute_flextable_keys() function passing in the public.customers table as a parameter to fill the customer_keys.
select compute_flextable_keys(‘customers’);compute flextable keysPlease see public.customers_keys for updated keys(1 row)
Once the keys have been computed, you may query the customer_keys table for the attribute definitions. The definitions include the attribute names (key_name) ,
a “best guess” of the attributes’ data types (data_type_guess), and a frequency column that displays a count of the frequency of non-null values.
I am interested in the “balance” attribute and notice right away that Vertica guessed an incorrect data type for this attribute. It assumed that the values contained in the column are of variable character type. However as previously noted, I know the column stores currency values in a numeric (or floating point) data type. I can update the attribute’s data_type_guess to a more appropriate field using standard SQL syntax.
update public.customers_keys set data_type guess = ‘float’ where key_name = ‘balance’;
Once this query is processed, I can verify that the change was successful by selecting from the customer_keys table once again. Notice that the "balance" attribute now has a floating point data type.
Once the data types have been appropriately set, we are almost ready to query the data. When we initially created the customers table, Vertica created the customers_keys table. Behind the scenes, Vertica also created a view named customers_view which, when populated with data, will enable a developer to query using standard SQL syntax. You populate the view by executing the build_flextable_view() function, and passing the customer table in as the function's parameter.
select build_flextable_view (‘customers’);
We are now ready to query the view. Below are a few samples that demonstrate the use of standard SQL against a Vertica database:
select company from customers_view group by company having count(*) > 1;
select company, sum(balance) from customers_view group by company having count(*) > 1;
select name, company, balance from customers_view where company in ( select company from customers_view group by company having count(*) > 1)order baby company;
As you can see, HP Vertica Flex Tables make incorporating semi and unstructured data within your projects more approachable than ever before. Contact Anexinet today to find out how we can help you incorporate Big Data technologies into your enterprise data projects.