Programming Assignment 4

Assigned: Apr. 11
Due: May 2

Overview

In this assignment, you are to take a set of snippets produced by a search engine, and to cluster them by subject.

Input format

A sample input page, corresponding to the first nineteen results (excluding a news article) returned by Google for the query "Jaguar" (10/31/07) is here.

In general, the input will be an HTML page with the following format:

First line: "< TITLE > Query:" search query "< /TITLE >"
Second line: "< H2 > Query:" search query "< /H2>"

Then a sequence of snippets. Each snippet has the following form:

< p >
< A href=" URL "> Title of web page < /A >
Snippet text.
You may assume that there are no HTML tags in the snippet text.

Output format

The output page for the above "Jaguar" example is here.

In general, the output will have the following format:

First line: "< TITLE > Clusters for" search query "< /TITLE >"
Second line: "< H2 > Clusters for" search query "< /H2>"

Then a sequence of clusters. Each cluster has the following form:
First line: "< H3 >" Title of cluster "< /H3 >" followed by a sequence of snippets in the same format as the input.

If you want to generate output in a sexier format, that's probably OK, but check with me.

Procedure for computing clusters

You may use any procedure you wish to carry out the assignment; however, I recommend the following. If you decide something else, it would be a good idea to check with me first by email. There is no gold standard for clustering, but your code does have to do a fairly reasonable job on the test examples we will use to test it.

My suggestion: To compute the clusters, construct the following undirected graph: Every vertex is a snippet. Two vertices joined by an edge if the corresponding snippets have a contentive word in common. (How to choose contentive words is described below.) The edge from U to V is labelled with all the contentive words that the snippets share. You should use the words both in the snippet title and in the snippet text, but not in the URL. (However, if the snippet text or title contains a URL, you may consider that as part of the text/title.)

Once the graph is complete, compute the connected components of the graph. Each connected component is one cluster. The title of the cluster is the set of all the words on any label in an edge in the cluster. If a cluster is a single snippet, then the title of the cluster is just the name of the web page.

Regularization of text and extraction of contentive words

The contentive words in a snippet are found as follows:

The list of stop words

This algorithm strongly tends to overlump, so the list of stop words here is much longer than in many applications. It should actually be much longer than it is.

The words "Wikipedia", "free", and "encyclopedia" are marked as stop words here because the title of any Wikipedia web page contains these three words, so if these are not excluded, any two snippets from Wikipedia will be joined together.

Extensions

If you want to make the algorithm more sophisticated or to add on additional features for extra credit, feel free. However, I strongly advise you to get the basic system working before adding bells and whistles. I will not sympathize if you tell me that you were unable to complete the assignment because you were too ambitious in how you set about it.

Submitting the assignment

Please email the grader (a) the source code; (b) a README file stating how to run the code; (c) a brief, high-level description of any significant changes or extensions that you've made to the basic form of the assignment.