New York University, Spring 2003
Motivation and Goals:
We live in the Age of Information. The importance of collecting data that reflect your business or scientific activities to achieve competitive advantage is widely recognized now. Powerful systems for collecting data and managing it in large databases are in place in all large and mid-range companies. However, the bottleneck of turning this data into your success is the difficulty of extracting knowledge about the system from the collected data.
These are all the questions that can probably be answered if information hidden among megabytes of data in your database can be found explicitly and utilized. Modeling the investigated system, discovering relations that connect variables in a database are the subject of data mining.
Modern computer data mining systems self learn from the previous history of the investigated system, formulating and testing hypotheses about the rules which this system obeys. When concise and valuable knowledge about the system of interest is discovered, it can and should be incorporated into some decision support system which helps the manager to make wise and informed business decisions.
The course will introduce concepts and techniques of data mining and data warehousing, including concept, principle, architecture, design, implementation, application of data warehousing and data mining.
You must be enrolled to attend the lectures.
Introductory courses in databases and fundamental algorithms. Knowledge or experience in data warehousing is a plus.
1) Data Mining: Concepts and Techniques, J. Han and M. Kamber, Morgan Kaufmann , 2000
2) Microsoft SQL Server 2000 Analysis Services Step by Step by Reed Jacobson, 2000.
A freely available data mining software such as DBMiner2.0 (dbminer.com)
Three homeworks (40%), Class participation (5%), Midterm (15%), Course project (40%) (Tentative). The project report is due at the last day of the class. There will be no final exam.
Collaboration on the problem sets is allowed. You may work together with one or two other partners and sign your names to a single submitted homework. You will receive the grade that the homework merits. There is no penalty for working on problem sets in teams up to three (more than three is not allowed).
The course project is an opportunity for student groups to investigate a data mining problem that interests them. The course project should apply data mining techniques to real-world problems. Data and software for these projects can be obtained from various internet sites, or developed by students.
A presentation of each project is required in addition to a written report.
Sample project ideas include but not restricted to the following:
An example is data from the 1998 KDD Cup data mining contest (http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html). The overall task is to develop a predictive model that selects optimally which individuals should be sent a donation request. The details of the task are described in the contest instructions.
You should implement and test at least two different methods for solving this problem. You do not need to use complex classification and regression algorithms for this task. A combination of naive Bayesian learning, linear regression, and bagging would be fine, for example. You may implement your own software or use or modify software that you obtain elsewhere (recommended).
Another example is predicting prices of initial public offerings. Determine how much you should pay for an initial public offering on the first day of offering. Check (http://www.cs.utsa.edu/~kwek/cs4793/ipoDescription.txt) for further description
A final example is weather data. There is an abundance of weather data online. In particular, the National Climatic Data Center (