Project 1: Web Crawler

Assigned: Sept. 5
Due: Sept. 26

In this assignment, you will build a specialized web crawler, with some specific crawling strategy.

As a starting point, I have written a minimal Web Crawler in Java. You can also look at the code described in Programming Spiders, Bots, and Aggregators in Java by Jeff Heaton, chapter 8. (Note: This is accessible online for free through an NYU account. You can also buy it in hard-copy, with a CD-ROM.) If you feel ambitious, you could try working with the SPHINX package, which is a lot larger and more complex.

Your code MUST run on the department's Sun system. So, before submitting them, you should upload them to your Sun account and make sure they compile and run. If the TA has to engage in discussion with you asking for fixes to be made, there will be a 5 point deduction out of 100. You should be sure to leave enough time for this; the need to make your code compatible is not a valid reason to request an extension on due dates. You may write your program in any language supported in the Sun system.

CRAWLER COURTESY: VERY IMPORTANT

Crawlers are potentially dangerous. Therefore:

System Load

Parsing

You can assume that things are simple, and that links in HTML files have the form
< a href="URL">
with no white space or weird characters in the middle of the URL. You don't have to worry about the case where this appears in the middle of some other HTML tag (such as a comment). See sample code (when it's written).

Error checking

You can assume that there are no errors in the input. You want your code to be robust under errors in the Web pages you're searching. If an error is encountered, feel free, if necessary, just to skip the page where it is encountered. The SPHINX code is more than adequate in this regard.

Efficiency

Don't be ridiculously inefficient; e.g. in projects 2 and 3 you want to use a hash-table for the list of words, not an unsorted linear list. But there's no need to deliver turbo-charged algorithms or implementations. Don't worry about memory constraints; if your program runs out of space and dies on encountering a large file, that's OK. You do not have to use multiple threads; sequential downloading is OK.

Deliverables

You should email to your TA: You need not send a sample output file.

Projects

I list a number of possible projects below. If you want to do one of these, that's fine. (Just ONE, not ALL.) If you have an idea for a different project, you should email me (davise@cs.nyu.edu) a description for my approval by September 12. I suggest that you work on a project that uses some interesting criterion for ordering or restricting the crawl; other projects are OK, but will have to be fairly imaginative to be approved.

Suggested projects

In all of the following, you should download only HTML files. The output of all the following should be a "search results" page, such as you get from a search engine. At minimum, it should be an HTML file with a link for every page you have downloaded. If the page has a title, then the anchor of the link should be the title; otherwise, the anchor for the link should be the URL. If you want to do something more interesting, by all means.

Project 1: Subject-specific crawl

Input: Starting URL, list of words, max number of pages to download.
Crawl: Any of the following variants: (Do ONE of these) Output: Search results page:

Project 2: Subject-focussed crawl

Input: Starting URL, max number of pages to download
Crawl: Maintain list L of words found in pages so far. For each page P, compute the fraction of words in P that occur in L.Give priority to page with the highest fraction.
Output: Search results page and list of 20 most common words in these pages.

A word in the text is three or more consecutive alphabetical symbols. Exclude stop words . Do case insensitive matching. If you want to use a different criterion for words, that's OK, but be sure to describe your criterion in the README file

Project 3: Subject-dispersal crawl

Input: Starting URL, max number of pages to download
Crawl: Maintain list L of words found in pages so far. For each page P, compute the fraction of words in P that occur in L.Give priority to page with the lowest fraction.
Output: Search results page.

Project 4: Link-based focussed crawl

Input: Starting URL, max number of pages to download.
Crawl: Give priority to page with max number of in-links from pages already seen. In case of ties, prefer page with max number of out-links to pages already seen.
Output: Search results page, in order of descending number of in-links within set.

Project 5: Link-based dispersal crawl

Input: Starting URL, max number of pages to download.
Crawl: Crawl only from pages with exactly one in-link from pages already seen. Compute (number of outlinks to pages not already seen) divided by (total number of outlinks.) Give priority to page that maximizes this fraction. Output: Search results page.