Project 3: Clustering
Assigned: Oct. 25
Due: Nov. 22 (Note: This is 4 weeks.)
Input: A file with a list of URL's.
Output: A division of the web pages into clusters, based on topic.
Implement the k-means clustering algorithm for web pages, based on the
content of the pages.
The input is a set of web pages and a value of K, the target number of
The output is a division of the pages
into categories, with a label on each of the categories. The label of
is some word or words that appears often in the pages in C and rarely
in the other pages.
The remaining details of the implementation are up to you.
Let K, M, and N be integers.
Let us say that a query Q is "Google-ambiguous (K,M,N)" if it has the following
- Q has at least K different meanings Q1 ... QK.
- Let S(Q,M) be the set of the first M pages returned by Google for query Q.
Let SI be the subset of S where clearly Q has been "interpreted"
with meaning QI. Then each SI has size at least N.
Find 3 separate queries Q that are Google-ambiguous (2,30,8).
Run the two clustering algorithms on the set S(Q,30).
Find three other interesting sets of pages to partition, each of size
at least 30. These can have more instances of the kinds of sets in experiment
1 or you can assemble these sets in some other way. The clustering
problem on these sets should not be extremely easy. Run the algorithm
on each of these three sets.
I would be particularly interested in this experiment in sets of pages
that all deal with the same subject but are nonetheless clearly
divided into two or more categories (e.g. pro-Kerry vs. pro-Bush
descriptions of the election; reviews of some kind of merchandise vs.
pages selling the merchandise; etc.). But I'm not requiring this because
they may well be hard to find.
- Your source code.
- A report on the 6 experiments.
- Describe your implementation of the algorithms.
- What were the input sets of pages?
- What results did the algorithms return?
- Discuss the results. To what extent did the algorithms find meaningful
What pages were misclassified? What clusters were incoherent? What
clusters were overly split? What labels are unsuggestive?
- Any other interesting issues that came up.
To make the grading easier, please make sure that your code can be run
on the department machines spunky and slinky. If this imposes an
intolerable hardship, send me email.