Written by Jim Thompson, Lead Senior Technical Architect
Freddie Mac, like many companies today, is awash in large amounts of data that come in a variety of formats. Solving the problem of how data scientists can quickly access this data started our journey to a Hadoop platform.
Gartner describes dark data as "the information assets that organizations collect, process and store while conducting their regular business activity, but generally fail to use the data for other purposes (for example, analytics, business relationships and direct monetizing)." Some factors that make using the data difficult are the volume of data, its format and limitations in storage technologies.
One case of dark data starts with several of our key applications using our SOA infrastructure to access a business rules server. The request-and-response data used to interact with the business rules server are captured as XML documents. These XML documents are complex and highly nested, some going down 14-to-15 levels. The size of a single XML document ranged from tens of kilobytes to hundreds of megabytes.
Initially, we stored these XML documents in a relational database column. Accessing attributes in the XML required a data scientist to augment their SQL commands with operands specific to XML. Their SQL would look something like this:
This required the data scientist to know the structure of every XML document and learn the xpath syntax. All of this often made the SQL command overly verbose and complicated. The actual execution of the SQL command to find the relevant XML documents often took hours, sometimes running multiple hours. Once the relevant XML documents were extracted from the database, a SAS or R program was written to extract required attributes for analysis.
One way we attempted to reduce this pain was to pre-shred out of the XML attributes that were commonly referenced and create relational database tables. This required building nightly ETL jobs to run and extract the subset of attributes. While this helped some of the users, especially those involved in operational reporting, this did not do much for the data scientist because the ETL jobs only extracted a small subset of the overall XML attributes. The data scientist doing original research was back to square one and writing complicated SQL to extract the needed attributes.
Over time, as more XML data was collected, the relational database grew to several terabytes, making management of this large amount of data in a relational database harder. This also increased the time for data extraction, to the point where some business questions went unanswered because the effort was too much. In other words, the XML documents were becoming dark data.
To solve this problem, we started looking for alternatives for storing the XML. This led us to investigate the Hadoop ecosystem. After running a proof-of-concept and tests for integrating with our existing security and monitoring infrastructure, we implemented Hadoop at the beginning of 2016.
For our first use case, a couple of data scientists extracted information out of the XML to see how we could use that information for fraud detection. This initial use case proved successful. One benefit we saw with Hadoop and its ecosystem of tools, e.g., HIVE and Spark, was the ability for the data scientists to be productive.
Prior to Hadoop, the data scientist would have had to make a business case to start an IT project to extract the required data into a relational database. Based on historical project data, the project approach would have been more expensive when compared to self-service by data scientists.
With Hadoop in place, the data scientists experimented with four different versions of their data structures before settling in on the final version. Each iteration took about a day. Using Python, HIVE and the PySpark interface to Spark, they could extract and analyze data in minutes instead of hours. This allowed them to iterate quickly, trying different approaches.
Although parsing the XML was much quicker in Hadoop, there was still a need for specialized skills to extract out the required attributes. To limit the impact of learning these specialized skills, we created the Data Engineer role to support the data scientist. This role took on the function of parsing out the XML and creating HIVE tables.
Now in a HIVE table, the data looked like a relational database table. This allowed data scientists to use HiveQL, a SQL variant, reducing the learning curve required for data scientists to use the data. From this initial use case, we now have about a dozen projects actively using Hadoop with more wanting to come on board.