Project 1: Subject-Specific Web Crawler

NOTE: CHANGES MADE

I have made some changes below as to the criteria that you should use in choosing links to follow, and in the form of the input.

Assigned: Sept. 13
Due: Oct. 4

The object of this project is to write a crawler that tries to download pages dealing with a particular subject.

Specification

Inputs:

Output: An HTML page with

CHANGED: In choosing to download a link L from file F, your program should consider

How you want to combine these is up to you. Among links that score equally according to the above criteria, the crawler should choose the link closest to the starting page. (For example, if no links satisfy any of the above criteria, then the crawler should just do a breadth-first search.)

Extra credit (NEW)

For extra credit, you may implement any of the following features:

Deliverables

Email to the TA Zhongshan Zhang (zhongsha@cs) and to me (davise@cs.nyu.edu):

You should choose your experiments so that a simple breadth-first search will do badly, but it would be possible to do well. For instance, an experiment which started from a hub page with fifty links to pages all of which were relevant to the subject would not be a good experiment: too easy. An experiment that chose a subject that is discussed in only one page on the Web would not be a good experiment: too hard.

Some examples

I have here some examples of subjects and links. Your experiments must include at least one example that is not on this list, and that is not being done by other students.

I find that, on the whole, good subjects for this kind of experiment tend to be subjects in which there is a lot of interest by amateurs.

Electronic resources

In general, you may use any suitable electronic resources that you find on the Web. As mentioned above, these must be cited in your report. You use any of these at your own risk; neither the TA nor I will help you with problems you have with any of these, except the code that I've provided myself.

If you should happen to find on the web something that fits this assignment exactly , let me know.

Crawlers

There is all kinds of code for crawlers on the Web, which you may use. As a starting point, I have written a minimal Web Crawler in Java. You can also look at the code described in Programming Spiders, Bots, and Aggregators in Java by Jeff Heaton, chapter 8. (Note: This is accessible online for free through an NYU account. You can also buy it in hard-copy, with a CD-ROM.) If you feel ambitious, you could try working with the SPHINX package, which is a lot larger and more complex.

Natural Language tools

Your program may use natural language tools such as online therauruses etc. I have here a list of stop words, which may be useful.

CRAWLER COURTESY: VERY IMPORTANT

Crawlers are potentially dangerous. Therefore:

Robustness and Efficiency

You need not deal with the kind of robustness and efficiency issues that we discussed in class:

Group and Variant Projects

I'm open to suggestions. You can do a group project, but it will have to be proportionally bigger than the project described here. If you have an idea for a crawler project that you think would be more worthwhile or fun for you than this one, feel free to propose it.

Grading

A program that achieves the functionality specified here, but is poorly coded, uncommented, with poorly chosen experiments, and an inadequate report, will get 70/100. To get 100/100, the program must achieve the functionality, be well coded, well commented, with well-chosen experiments, and a good report.

Late Policy

"On time" means at class time on the due date. Programs submitted late will get a penalty of 2 points out of 100 per day late, up to a maximum of 20 points.