## Project 3: Clustering

Assigned: Oct. 25
Due: Nov. 22 (Note: This is 4 weeks.)

Input: A file with a list of URL's.
Output: A division of the web pages into clusters, based on topic.

Implement the k-means clustering algorithm for web pages, based on the content of the pages.

The input is a set of web pages and a value of K, the target number of categories. The output is a division of the pages into categories, with a label on each of the categories. The label of category C is some word or words that appears often in the pages in C and rarely in the other pages.

The remaining details of the implementation are up to you.

### Experiment 1

Let K, M, and N be integers. Let us say that a query Q is "Google-ambiguous (K,M,N)" if it has the following properties:
• Q has at least K different meanings Q1 ... QK.
• Let S(Q,M) be the set of the first M pages returned by Google for query Q. Let SI be the subset of S where clearly Q has been "interpreted" with meaning QI. Then each SI has size at least N.

Find 3 separate queries Q that are Google-ambiguous (2,30,8). Run the two clustering algorithms on the set S(Q,30).

### Experiment 2

Find three other interesting sets of pages to partition, each of size at least 30. These can have more instances of the kinds of sets in experiment 1 or you can assemble these sets in some other way. The clustering problem on these sets should not be extremely easy. Run the algorithm on each of these three sets.

I would be particularly interested in this experiment in sets of pages that all deal with the same subject but are nonetheless clearly divided into two or more categories (e.g. pro-Kerry vs. pro-Bush descriptions of the election; reviews of some kind of merchandise vs. pages selling the merchandise; etc.). But I'm not requiring this because they may well be hard to find.