Special Topics in Computer Science: Data Quality

Summer 2000

Instructor: David Loshin
Office hours: Wednesday, 5:00pm-6:00pm,

Time: 6:00-8:20pm Wednesday
Location : room 102 Ciww

If you intend to take this class, please send e-mail with your name to the address loshin@cs.nyu.edu.


The class email list web page is http://www.cs.nyu.edu/mailman/listinfo/g22_3033_001_su00

Office Hours are 5:00-6:00 PM Wednesday in room 401.

Class-related e-mail and homework: loshin@cs.nyu.edu
Urgent matters: loshin@knowledge-integrity.com

Over the past 30 years, advances in data collection and database technology have led to massive legacy databases controlled by legacy software. The implicit programming paradigm encompasses both business policies and data validation policies as application code. Yet, most legacy applications are maintained by second- and third-generation engineers, and it is rare to find any staff members with first-hand experience in either the design or implementation of the original system. As a result, organizations maintain significant ongoing investments in daily operations and maintenance of the information processing plant, while mostly ignoring the tremendous potential of the intellectual capital that is captured within the data assets.


In this course we will investigate the notion of data quality and how it fits in the operational and strategic data environment. We will explore ways to formally characterize what is mostly considered to be a hazy prospect at best, and we will study a framework for formally describing a set of data quality rules that can be used to qualify a data set.


In addition, we will look at algorithms for standardizing, cleansing, and merging data from multiple sources. We will look at the computational complexity of these algorithms as well as heuristics for improving their runtime performance.

Midterm Samples


Check regularly for important information.

Course syllabus


Second Project Handout

Download first data set for project

NEW! Project Deliverables to HAND IN

Download second data set for project

The second data set has been salted with some errors! Be aware!


2000, David Loshin