Written by Lakshmi Purushothaman, Director, Single Family Credit Analytics
As soon as a new product was launched, my business group needed daily access to standardized, organized data to assess the impact of the new offering in the loan origination process and to support risk monitoring, quality control sampling, and fraud analytics. Internal and external data sets were available in semi-structured form (JSONs, XMLs) in the Hadoop Distributed File System(HDFS). Our user community was diverse — reporting, advanced analytics and data science. Overall, we had to organize and present data in multiple platforms simultaneously for users to easily access the new data sets and join with legacy data.
Our initial approach was the traditional path, which could take months after the launch to organize the data for advanced analytics. But legacy technologies made it hard to deal with file types like XML, JSON and PDF. In addition, these steps are done in silos — in separate tools, technologies and platforms.
We tried running algorithms against the raw files and have insights pop-up, but we were not able to make much progress with this approach due to the totally new data landscape. There were plenty of unknowns. Data relationships were not well understood; data quality was uncertain and statistical relationships were checked only for a small set of variables.
We came to the realization that we had to change our “data munging” approach drastically. It is common knowledge that data scientists spend 50% to 80% of their time mired in the mundane labor of collecting and preparing unruly data, before it can be used to gather insights. With the new data landscape, the traditional approach would have taken months to design the data model and develop the data application.
In addition to the time, our approach also needed to be easily:
Extensible — Supporting different types of source files;
Reusable — Supporting multiple interfaces for the same type of file;
Low cost — Development costs must be much lower than current costs; and
Quick to market — Report readiness at launch.
What would success look like?
Success was automating the design for data munging and integration in a scalable way, while reducing the time to implement a data application by 50% to 70%, thereby allowing data scientists and business analysts to easily access data.
What did we need to do to clean, standardize and integrate this new data?
- Gain understanding of the data landscape;
- Identify the data relationships, types, etc.;
- Design the data model;
- Add the transformation rules;
- Load the data; and
- Add data movement and data quality controls.
We decided to leverage Big Data tools and technologies (PySpark, Sqoop, Hive, Oozie) to all these steps including the data model design. We created a framework that runs PySpark algorithms to mine and profile the data to gather information about the data relationships and types in the XML. All these data points are stored in an user friendly file format. A data subject matter expert designs the model by reviewing the data mining outputs, and business subject matter experts can add transformation rules to the same configuration file. PySpark code then loads the data based on the information in the configuration file.
In a traditional approach, data analysis is done in silo in a separate tool/platform. Then requirements are written and the model is designed. ETL logic is in a separate platform such as Informatica and the code contains all the logic, while data movement and quality is in another tool.
In our solution, everything is in one place. In the configuration files, business, data and technical subject matter experts all have one interface to work with.
PySpark code is decoupled from the data handling logic and business transformations. Some of the machine learning and analytics packages used are Spark MLlib, Scikit-learn, Pandas, SciPy, NumPy and Matplotlib.
With machine learning , we plan to enchance the framework to include additional components to reduce the overall time to create the data table, which is the longest step in building data applications for analytics as well as models. In the framework, there are multiple modules, such as data discovery, profiling, etc.
Currently, with our framework, we have successfully reduced our data processing time from months to weeks and delivered businessdata applications at launch — ex. LAS Automated Collateral Evaluation (ACE) with clean, standardized, organized data in Hive tables on the Horton Works platform and in our MPP platform for large volumes of data. Business analysts and data scientists are using the use the data simultaneously.