## Project 3: Clustering

Assigned: Oct. 25
Due: Nov. 22 (Note: This is 4 weeks.)

Input: A file with a list of URL's.
Output: A division of the web pages into clusters, based on topic.

Implement the k-means clustering algorithm for web pages, based on the content of the pages.

The input is a set of web pages and a value of K, the target number of categories. The output is a division of the pages into categories, with a label on each of the categories. The label of category C is some word or words that appears often in the pages in C and rarely in the other pages.

The remaining details of the implementation are up to you.

### Experiment 1

Let K, M, and N be integers. Let us say that a query Q is "Google-ambiguous (K,M,N)" if it has the following properties:
• Q has at least K different meanings Q1 ... QK.
• Let S(Q,M) be the set of the first M pages returned by Google for query Q. Let SI be the subset of S where clearly Q has been "interpreted" with meaning QI. Then each SI has size at least N.

Find 3 separate queries Q that are Google-ambiguous (2,30,8). Run the two clustering algorithms on the set S(Q,30).

### Experiment 2

Find three other interesting sets of pages to partition, each of size at least 30. These can have more instances of the kinds of sets in experiment 1 or you can assemble these sets in some other way. The clustering problem on these sets should not be extremely easy. Run the algorithm on each of these three sets.

I would be particularly interested in this experiment in sets of pages that all deal with the same subject but are nonetheless clearly divided into two or more categories (e.g. pro-Kerry vs. pro-Bush descriptions of the election; reviews of some kind of merchandise vs. pages selling the merchandise; etc.). But I'm not requiring this because they may well be hard to find.

### To submit

• Your source code.
• A report on the 6 experiments.
• Describe your implementation of the algorithms.
• What were the input sets of pages?
• What results did the algorithms return?
• Discuss the results. To what extent did the algorithms find meaningful categories? What pages were misclassified? What clusters were incoherent? What clusters were overly split? What labels are unsuggestive?
• Any other interesting issues that came up.

### Implementation

To make the grading easier, please make sure that your code can be run on the department machines spunky and slinky. If this imposes an intolerable hardship, send me email.