Project 3: Clustering

Assigned: Oct. 25
Due: Nov. 22 (Note: This is 4 weeks.)

Input: A file with a list of URL's.
Output: A division of the web pages into clusters, based on topic.

Implement the k-means clustering algorithm for web pages, based on the content of the pages.

The input is a set of web pages and a value of K, the target number of categories. The output is a division of the pages into categories, with a label on each of the categories. The label of category C is some word or words that appears often in the pages in C and rarely in the other pages.

The remaining details of the implementation are up to you.

Experiment 1

Let K, M, and N be integers. Let us say that a query Q is "Google-ambiguous (K,M,N)" if it has the following properties:

Find 3 separate queries Q that are Google-ambiguous (2,30,8). Run the two clustering algorithms on the set S(Q,30).

Experiment 2

Find three other interesting sets of pages to partition, each of size at least 30. These can have more instances of the kinds of sets in experiment 1 or you can assemble these sets in some other way. The clustering problem on these sets should not be extremely easy. Run the algorithm on each of these three sets.

I would be particularly interested in this experiment in sets of pages that all deal with the same subject but are nonetheless clearly divided into two or more categories (e.g. pro-Kerry vs. pro-Bush descriptions of the election; reviews of some kind of merchandise vs. pages selling the merchandise; etc.). But I'm not requiring this because they may well be hard to find.

To submit

Implementation

To make the grading easier, please make sure that your code can be run on the department machines spunky and slinky. If this imposes an intolerable hardship, send me email.