Instructor: David Loshin
Office hours: Wednesday, 5:00pm-6:00pm,
Time: 6:00-8:20pm Wednesday
Location : room 102 Ciww
If you intend to take this class, please send e-mail with your name to the address firstname.lastname@example.org.
The class email list web page is http://www.cs.nyu.edu/mailman/listinfo/g22_3033_001_su00Office Hours are 5:00-6:00 PM Wednesday in room 401.
Over the past 30 years, advances in data collection and database technology have led to massive legacy databases controlled by legacy software. The implicit programming paradigm encompasses both business policies and data validation policies as application code. Yet, most legacy applications are maintained by second- and third-generation engineers, and it is rare to find any staff members with first-hand experience in either the design or implementation of the original system. As a result, organizations maintain significant ongoing investments in daily operations and maintenance of the information processing plant, while mostly ignoring the tremendous potential of the intellectual capital that is captured within the data assets.
In this course we will investigate the notion of data quality and how it fits in the operational and strategic data environment. We will explore ways to formally characterize what is mostly considered to be a hazy prospect at best, and we will study a framework for formally describing a set of data quality rules that can be used to qualify a data set.
In addition, we will look at algorithms for standardizing, cleansing, and merging data from multiple sources. We will look at the computational complexity of these algorithms as well as heuristics for improving their runtime performance.
Check regularly for important information.