Data Quality Project

More Details

The goal of our project is to be able to explore large data sets, identify data domains, mappings, and other kinds of rules, and implement a system for doing the following:

-         import data into a data mart

-         validate new data records

You have been provided with “seed” data to explore, as well as a pointer to the SEC web site where the metadata is described. You have had some time to look at the data, and we have discussed some details in class. The next section details the deliverables for a successful project

Schema Components

You have had some time to think about this data, but here are some suggestions:

1)      Try to turn as much data into normalized tables as possible.

2)      The central focus of this system will be entities (e.g., companies) and the transactions they perform (i.e., document filings)

3)      Each entity will have a lot of data about it, some of which is time-related. Any information that is relatively static, maintain in the entity table, but have all auxiliary data off in separate tables (e.g., telephone numbers, addresses, former names

4)      Each filing has an accession number, but that number is only unique to the filing. Use that as a primary key, but not for the entities table.

5)      There are more than just companies that are entities. Note that some entities are people, some are money funds, etc. This is important knowledge and should be noted.

Project Deliverables

1)      Data Schema – This is a database layout or schema, which describes the sets of data that will be in your database. This schema will combine the following kinds of tables:

a.      Metadata: This incorporates all data domains, mappings, definitions, as well as representable data quality rules.

b.      Reference data: This includes information that is, for the most part, static, and is referred to by the transaction data. This includes the representations of entity information, including auxiliary data such as addresses, telephone numbers, former names, etc.

c.      Transaction data: The filing of a set of documents is the transactions that we care about. Each of the files that is in the data set represents a single transaction.

2)      Data Transformation and Loading application – This is an application that parses one of the data files, breaks it down into its components, performs any lookups in the existing database, and creates any update records. This is made of these components:

a.      Transaction parser: This reads a file, makes use of the tags embedded in the text to separate out individual components, and create a view of the data if it were to be inserted into the database

b.      Database lookups – This looks up the data in the database to see if it is already there, checks to see whether what is in the current transaction represents any kind of update to the database

3)      Data validation – You will no doubt, find some kinds of data quality rules (null rules, completeness rules, etc.) You must integrate a validation system, either loading rules from the database into a rules engine, or even if you have to hard code it (just document it well) into the loading application.

4)      One of the three extras; see the original document for details, or email me.


You must be able to demonstrate your database system. If you are using a known database system, that is fine. If you do not have access to a database, create your own using flat data files loaded into hash tables for easy lookup.


You must be able to demonstrate to me that the 4 items above are complete. To do this, you must deliver:

-         A schema

-         Evidence that the data has been loaded

-         The code used to load the data

-         A list of data quality rules being tested

-         The code used for validation

Additionally: you must be able to demonstrate the answers to a set of queries about all kinds of information based on entity. I will expect to be able to get the answer to these queries:

-         Give list of all documents by entity

-         How many times is entity referred to in other documents

-         What are former company names for any company

-         Give all addresses associated with any company

-         What companies have filed which kinds of forms?

You must either demonstrate this (either schedule time during office hours, or call for an appointment), or be able to generate a report file that answers all of these queries.