Programming Assignment 3

Assigned: Nov. 1
Due: Nov. 29

Overview

In this assignment, you are to take a set of snippets produced by a search engine, and to cluster them by subject.

Input format

A sample input page, corresponding to the first nineteen results (excluding a news article) returned by Google for the query "Jaguar" (10/31/07) is here.

In general, the input will be an HTML page with the following format:

First line: "< TITLE > Query:" search query "< /TITLE >"
Second line: "< H2 > Query:" search query "< /H2>"

Then a sequence of snippets. Each snippet has the following form:

< p >
< A href=" URL "> Title of web page < /A >
Snippet text.
You may assume that there are no HTML tags in the snippet text.

Output format

The output page for the above "Jaguar" example is here.

In general, the output will have the following format:

First line: "< TITLE > Clusters for" search query "< /TITLE >"
Second line: "< H2 > Clusters for" search query "< /H2>"

Then a sequence of clusters. Each cluster has the following form:
First line: "< H3 >" Title of cluster "< /H3 >" followed by a sequence of snippets in the same format as the input.

Procedure for computing clusters

To compute the clusters, construct the following undirected graph: Every vertex is a snippet. Two vertices joined by an edge if the corresponding snippets have a contentive word in common. (How to choose contentive words is described below.) The edge from U to V is labelled with all the contentive words that the snippets share. You should use the words both in the snippet title and in the snippet text, but not in the URL. (However, if the snippet text or title contains a URL, you may consider that as part of the text/title.)

Once the graph is complete, compute the connected components of the graph. Each connected component is one cluster. The title of the cluster is the set of all the words on any label in an edge in the cluster. If a cluster consists of a single snippet, then the title of the cluster is just the name of the web page.

Regularization of text and extraction of contentive words

The contentive words in a snippet are found as follows:

The list of stop words

This algorithm strongly tends to overlump, so the list of stop words here is much longer than in many applications. It should actually be much longer than it is.

The words "Wikipedia", "free", and "encyclopedia" are marked as stop words here because the title of any Wikipedia web page contains these three words, so if these are not excluded, any two snippets from Wikipedia will be joined together.

Submitting the assignment

Please email your assignment to me not the grader (davise@cs.nyu.edu). Include the source code and a README file stating how to run the code.