Data Mining

New York University, Spring 2003

Motivation and Goals:

We live in the Age of Information. The importance of collecting data that reflect your business or scientific activities to achieve competitive advantage is widely recognized now. Powerful systems for collecting data and managing it in large databases are in place in all large and mid-range companies. However, the bottleneck of turning this data into your success is the difficulty of extracting knowledge about the system from the collected data.

- Which goods should be promoted to this customer?
- What is the probability that a certain customer will respond to a planned promotion?
- Can one predict the most profitable securities to buy/sell during the next trading session?
- Will this customer default on a loan or pay back on schedule?
- Which medical diagnosis should be assigned to this patient?
- How large the peak loads of a telephone or energy network are going to be?
- Why the manufacturing facility suddenly starts to produce defective goods?

These are all the questions that can probably be answered if information hidden among megabytes of data in your database can be found explicitly and utilized. Modeling the investigated system, discovering relations that connect variables in a database are the subject of data mining.

Modern computer data mining systems self learn from the previous history of the investigated system, formulating and testing hypotheses about the rules which this system obeys. When concise and valuable knowledge about the system of interest is discovered, it can and should be incorporated into some decision support system which helps the manager to make wise and informed business decisions.

Overview

The course will introduce concepts and techniques of data mining and data warehousing, including concept, principle, architecture, design, implementation, application of data warehousing and data mining.

TOPICS:

- Introduction
- Data warehousing and OLAP technology for data mining
- Data preprocessing
- Descriptive data mining: characterization and comparison
- Association analysis
- Classification and prediction
- Cluster analysis
- Mining complex types of data
- Applications and trends in data mining

Mechanics

You must be enrolled to attend the lectures.

Prerequisites

Introductory courses in databases and fundamental algorithms. Knowledge or experience in data warehousing is a plus.

Textbooks

1) Data Mining: Concepts and Techniques, J. Han and M. Kamber, Morgan Kaufmann , 2000

2) Microsoft SQL Server 2000 Analysis Services Step by Step by Reed Jacobson, 2000.

References

- The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses by Ralph Kimball
- Data Warehouse Project Management by Sid Adelman
- Data Warehouse: From Architecture to Implementation by Barry Devlin
- Predictive Data Mining by S.M. Weiss and N. Indurkhya
- Seven Methods for Transforming Corporate Data Into Business Intelligence by Vasant Dhar
- Data Mining Techniques: For Marketing, Sales, and Customer Support by Micahel Berry

Software

A freely available data mining software such as DBMiner2.0 (dbminer.com)

Requirements

Three homeworks (40%), Class participation (5%), Midterm (15%), Course project (40%) (Tentative). The project report is due at the last day of the class. There will be no final exam.

Collaboration on the problem sets is allowed. You may work together with one or two other partners and sign your names to a single submitted homework. You will receive the grade that the homework merits. There is no penalty for working on problem sets in teams up to three (more than three is not allowed).

Course Project

The course project is an opportunity for student groups to investigate a data mining problem that interests them. The course project should apply data mining techniques to real-world problems. Data and software for these projects can be obtained from various internet sites, or developed by students.

A presentation of each project is required in addition to a written report.

Sample project ideas include but not restricted to the following:

- Compare approaches to a particular problem on criteria as accuracy, memory utilization and performance. Implement several alternative approaches and rigorously compare them on data sets with distinct properties. You can also create artificial databases to test the bounds of each approach. Some comparisons include comparing characterization methods, feature selection methods, clustering methods, and parallel data mining approaches.
- If you're interested in working as a data analyst, you are encouraged to study real world problems and needs for data mining. Use whatever means you can find to discover interesting patterns. The following are suggested topics: Customer segmentation, Predictive model for customer retention, Customer churn in Teleco, Mining of web logs and fraud detection.
- A survey/research paper that can lead to a tutorial material or a journal paper. Extensiveness, comprehensibility, technical worthiness are major considerations. You should only choose this type of projects if you are familiar with the subfield you wish to survey; otherwise, you're advised not to do it. The following list of topics is suggested for your reference. Topics may include: Web usage mining, Data mining and E-business, Mining unstructured and semi-structured data on WWW, Text mining, Spatial data mining, Multimedia data mining, Content-Based Image Indexing and Retrieval, and Data mining applications in finance.

An example is data from the 1998 KDD Cup data mining contest (http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html). The overall task is to develop a predictive model that selects optimally which individuals should be sent a donation request. The details of the task are described in the contest instructions.

You should implement and test at least two different methods for solving this problem. You do not need to use complex classification and regression algorithms for this task. A combination of naive Bayesian learning, linear regression, and bagging would be fine, for example. You may implement your own software or use or modify software that you obtain elsewhere (recommended).

Another example is predicting prices of initial public offerings. Determine how much you should pay for an initial public offering on the first day of offering. Check (http://www.cs.utsa.edu/~kwek/cs4793/ipoDescription.txt) for further description

A final example is weather data. There is an abundance of weather data online. In particular, the National Climatic Data Center (http://www.ncdc.noaa.gov/) has some free datasets online. One of the data mining tasks you might try is to predict the weather at a given time period from previous time periods. Another task you might try is clustering to partition a region into different climates.