It has been said that over 70% of the budget of data warehouse budgets are spent on data extraction and data cleansing. While it is believable that the cost of data extraction is high, due to the high cost of legacy data extraction and migration, why is the cost of data cleansing included in this famous 70%? Most likely, it is because in the data warehouse environment, the data extraction process is intricately interwoven with the data quality process.
How is this so? The data extraction, transformation, and loading (ETL) process revolves around collecting data from some set of sources, transforming the data through some number of processes, and inserting the transformed data into the target data warehouse or data mart. The fact that the data is not streamed directly out of the source into the target indicates a singular important point: the data is not fit for use in the target system. And fitness for use is the basic concept in the world of data quality!
The ETL process is contingent on the use of transformation rules, which in effect are really data quality rules. ETL tools automate the extraction of data from a number of data sources, provide a means for defining functions to transform data, and then generate scripts for inserting data into the target. Because we are not focusing on the issues of extraction or loading, but data quality, our project will focus on three aspects of data quality:
1) The definition and execution of data quality rules for the purposes of data transformation
2) The determination of the set of reference data domains and domain mappings
3) Data clustering for the purposes of network analysis and data cleansing
Our project involves extracting information from a collection of data files collected from the SEC EDGAR database and creating a data mart with that information. The format of the input file is in a semi-structured format, basically in an SGML (i.e., tagged data) format, where each file represents a document that was filed with the SEC. Here is an example of one of the data files:
<CONFORMED-NAME>ALLEN TEST INC NEW NAME 1
<STREET1>888 SCOTTIE PIPPEN ROAD
<STREET1>333 THE BEST PLAYER STREET
<FORMER-CONFORMED-NAME>ALLEN TEST INC NEW NAME 2
<FORMER-CONFORMED-NAME>ALLEN TEST INC NEW NAME 3
<FORMER-CONFORMED-NAME>ALLEN TESTING CO /ISM
Tags are denoted using open and close angle brackets enclosing a tag name (<tag-name>). Opening tags are indicated with just angle brackets and the tag name, while closing tags have the tag-name preceded by a slash (/). Note that some of the data elements are bounded by both an opening tag and a closing tag, some (typically the ones that are all on one line) are only indicated using the opening tag.
The resulting data mart will provide cross-reference information associated with all filed documents based on the accession number. This means that while there is a unique record in a reference table for each document, there may be multiple documents associated with certain entities referenced in the data set.
In order to execute this project, you will be required to build the following components:
1) A target data model for the data mart.
2) A data parser that extracts the data elements from each of the records, and maintains the affiliation with the document identifier.
3) The determination of data domains and any recognized mappings between data domains.
4) The definition of rules that govern the expected data representation in the data mart.
5) A set of methods that either validate data on entry into the data mart or identify erroneous data within the data mart based on the rules defined in the previous step.
In addition to the above items, you must choose at least one of these three additional components:
A) Network Analysis: Report on the determination of connectivity between different companies that can be inferred from the data mart
B) Clustering: create clusters based on company name, or address
C) Approximate Search module: Provide a means for searching for similar data records within the data set.
To make sure that this is truly a data quality project, we will be inserting erroneous data into the mix. This means that we expect that you will find some errors, and you will be partially graded on the ability to identify these errors.
Our goal is not to force you to learn any particular platform, but rather allow for freedom of expression. You may use any database and any programming language.
The expected output of this project is a report on the records that violated the data quality rules, which rules were violated, and whether those violations could be automatically corrected. A complete project includes documentation of all of the steps above, the code used to implement the project, and the final report.
Your grade will be determined by these factors:
- Completeness – how much of the project was implemented
- Correctness – the degree to which your results match the expected results
- Cohesiveness – how well do the components fit together
- Presentation – how well and completely is the project documented