Programming Assignment 3
Assigned: Nov. 1
Due: Nov. 29
In this assignment, you are to take a set of snippets produced by a search
engine, and to cluster them by subject.
A sample input page, corresponding to the first nineteen results (excluding
a news article) returned
by Google for the query "Jaguar" (10/31/07) is
In general, the input will be an HTML page with the following format:
First line: "< TITLE > Query:" search query "< /TITLE >"
Second line: "< H2 > Query:" search query "< /H2>"
Then a sequence of snippets. Each snippet has the following form:
< p >
You may assume that there are no HTML tags in the snippet text.
< A href=" URL "> Title of web page < /A >
The output page for the above "Jaguar" example is
In general, the output will have the following format:
First line: "< TITLE > Clusters for" search query "< /TITLE >"
Second line: "< H2 > Clusters for" search query "< /H2>"
Then a sequence of clusters. Each cluster has the following form:
First line: "< H3 >" Title of cluster "< /H3 >"
followed by a sequence of snippets in the same format as the input.
Procedure for computing clusters
To compute the clusters, construct the following undirected
graph: Every vertex is a snippet. Two vertices joined by an edge if
the corresponding snippets have a contentive word in common. (How to choose
contentive words is described below.) The edge from U to V is labelled
with all the contentive words that the snippets share. You should use
the words both in the snippet title and in the snippet text, but not in
the URL. (However, if the snippet text or title contains a URL, you may
consider that as part of the text/title.)
Once the graph is complete, compute the connected components of the graph.
Each connected component is one cluster. The title of the cluster is
the set of all the words on any label in an edge in the cluster. If a cluster
consists of a single snippet, then the title of the cluster is just the
name of the web page.
Regularization of text and extraction of contentive words
The contentive words in a snippet are found as follows:
- A word is a sequence of alphabetical characters a-z,A-Z. Assume that
any punctuation breaks a word, and do not worry about non-standard characters.
- Convert all words to lower case.
- Discard all stop words, all words of fewer than three characters, and
the words in the query itself.
A list of stop words is given here.
- For any word more than 3 letters long ending in "s", discard the
"s". (Obviously this is a very poor stemming algorithm, and
will cause a lot of non-words to end up in cluster titles, but I think
it's (slightly) better than nothing.)
The list of stop words
This algorithm strongly tends to overlump, so the list of stop words here
is much longer than in many applications. It should actually be much
longer than it is.
The words "Wikipedia", "free", and "encyclopedia" are marked as stop words
here because the title of any Wikipedia web page contains these three
words, so if these are not excluded, any two snippets from Wikipedia will
be joined together.
Submitting the assignment
Please email your assignment to me not the grader (firstname.lastname@example.org).
Include the source code and a README file stating how to run the code.