Special Topics in Computer Science: Data Quality

Summer 2001

Instructor: David Loshin
Office hours: Wednesday, 5:00pm-6:00pm,

Time: 6:00-8:20pm Wednesday
Location : room 10? (We will get this for you) Ciww

If you intend to take this class, please send e-mail with your name to the address loshin@cs.nyu.edu.


The class email list web page is http://www.cs.nyu.edu/mailman/listinfo/g22_3033_001_su01

Office Hours are 5:00-6:00 PM Wednesday in room 401.

Class-related e-mail and homework: loshin@cs.nyu.edu
Urgent matters: loshin@knowledge-integrity.com
Mail all homeworks to our TA Jack Gold: jg597@CIMS.nyu.edu

Over the past 30 years, advances in data collection and database technology have led to massive legacy databases controlled by legacy software. The implicit programming paradigm encompasses both business policies and data validation policies as application code. Yet, most legacy applications are maintained by second- and third-generation engineers, and it is rare to find any staff members with first-hand experience in either the design or implementation of the original system. As a result, organizations maintain significant ongoing investments in daily operations and maintenance of the information processing plant, while mostly ignoring the tremendous potential of the intellectual capital that is captured within the data assets.


In this course we will investigate the notion of data quality and how it fits in the operational and strategic data environment. We will explore ways to formally characterize what is mostly considered to be a hazy prospect at best, and we will study a framework for formally describing a set of data quality rules that can be used to qualify a data set.


In addition, we will look at algorithms for standardizing, cleansing, and merging data from multiple sources. We will look at the computational complexity of these algorithms as well as heuristics for improving their runtime performance.


Check regularly for important information.

Course syllabus



2001, David Loshin