Programming Assignment 3
Assigned: Mar. 30
Due: Apr. 27
In this assignment, you are to take a set of snippets produced by a search
engine, and to cluster them by subject.
A sample input page, corresponding to the first nineteen results (excluding
a news article) returned
by Google for the query "Jaguar" (10/31/07) is
In general, the input will be an HTML page with the following format:
First line: "< TITLE > Query:" search query "< /TITLE >"
Second line: "< H2 > Query:" search query "< /H2>"
Then a sequence of snippets. Each snippet has the following form:
< p >
You may assume that there are no HTML tags in the snippet text.
< A href=" URL "> Title of web page < /A >
Bonus brownie points
If you feel like writing a separate transducer to convert actual Google output
to this format, and getting this working and mailed to me at least a week
before the due date, so that I can post it, I will consider it a favor and
will give some extra brownie points for that.
The output page for the above "Jaguar" example is
In general, the output will have the following format:
First line: "< TITLE > Clusters for" search query "< /TITLE >"
Second line: "< H2 > Clusters for" search query "< /H2>"
Then a sequence of clusters. Each cluster has the following form:
First line: "< H3 >" Title of cluster "< /H3 >"
followed by a sequence of snippets in the same format as the input.
If you want to generate output in a sexier format, that's probably OK, but
check with me.
Procedure for computing clusters
You may use any procedure you wish to carry out the assignment; however, I
recommend the following. If you decide something else, it would be a good
idea to check with me first by email. There is no gold standard for
clustering, but your code does have to do a fairly reasonable job on the
test examples we will use to test it.
To compute the clusters, construct the following undirected
graph: Every vertex is a snippet. Two vertices joined by an edge if
the corresponding snippets have a contentive word in common. (How to choose
contentive words is described below.) The edge from U to V is labelled
with all the contentive words that the snippets share. You should use
the words both in the snippet title and in the snippet text, but not in
the URL. (However, if the snippet text or title contains a URL, you may
consider that as part of the text/title.)
Once the graph is complete, compute the connected components of the graph.
Each connected component is one cluster. The title of the cluster is
the set of all the words on any label in an edge in the cluster. If a cluster
consists of a single snippet, then the title of the cluster is just the
name of the web page.
Regularization of text and extraction of contentive words
The contentive words in a snippet are found as follows:
- A word is a sequence of alphabetical characters a-z,A-Z. Assume that
any punctuation breaks a word, and do not worry about non-standard characters.
- Convert all words to lower case.
- Discard all stop words, all words of fewer than three characters, and
the words in the query itself.
A list of stop words is given here.
- For any word more than 3 letters long ending in "s", construct
the singular by discarding the "s". (Obviously this is a very
poor stemming algorithm, but I think it's (slightly) better than nothing.)
Alternatively you may use the Porter Stemming Algorithm, which is standard,
though not necessarily optimal. Code for the Porter Stemming Algorithm
in a variety of languages can be found
In either case, the title of the clumps should contain one of the original
forms of the word, not the output of the stemming (which is often not a word).
The list of stop words
This algorithm strongly tends to overlump, so the list of stop words here
is much longer than in many applications. It should actually be much
longer than it is.
The words "Wikipedia", "free", and "encyclopedia" are marked as stop words
here because the title of any Wikipedia web page contains these three
words, so if these are not excluded, any two snippets from Wikipedia will
be joined together.
If you want to make the algorithm more sophisticated or to add on additional
features for extra credit, feel free. If you have done anything that you
consider really interesting, please submit the assignment to me in addition
to the grader. However, I strongly advise you to get the basic system working
before adding bells and whistles. I will not sympathize if you tell me that
you were unable to complete the assignment because you were too ambitious in
how you set about it.
Submitting the assignment
Please email the grader
(a) the source code; (b) a README file stating how to run the code;
(c) a brief, high-level description
of any significant changes or extensions that you've
made to the basic form of the assignment.