You are required to complete two course projects, of which at least
one must be implementational.
Any implementational or experimental projects may be done in a group
of two or three people. Of course, the more collaborators on a project,
the more I shall expect. Two individuals may not collaborate on
two different projects.
Due dates: The first project is due Oct. 26. The second is
due Nov. 30. A letter grade will be deducted for each week tht the
project is late. (E.g. if the project is worth an A- but is a week
late, it will get a B-.)
At least two weeks before the due date of each project, send me a
brief description for my approval. This should include at least:
- The name of the person/people doing the project.
- A summary of what you plan to do.
- A list of any existing software/data sets you plan to use.
- A fairly precise description of the deliverables, particularly
I can think of three categories of projects: implementational, experimental,
and theoretical (meaning, writing a paper). Feel free to propose projects
that do not fall within any of these categories, or that combine more than
Format: Source code should be emailed to me. Write-ups may be
handed in hard-copy or emailed in one of the following formats:
(1) ASCII text; (2) Postscript; (3) HTML (4) A URL for a web page
in one of the above formats. No other formats will be accepted.
In particular, do not send me .DOC files, Word files, or anything MIME
The deliverables for each category of project are characteristically
as follows. (Of course, this will vary somewhat depending on the
content of the project.)
Implementational projects: These may be done in any language
that I can compile and run on the Sun network.
- Commented source code -- send by email.
- A sample of the test data set -- send by email
- A write-up of what the code does and how it works. Include a
description of any interesting issues raised or problems encountered.
- Instructions for how to run the code.
- Test results. These may be fairly informal.
Experimental projects: Take an existing piece of software and
test it on some data set. You should use a data set of substantial size,
probably not less than fifty documents.
Theoretical project: The deliverable is a paper of not less than
4000 words. It should show substantial reading in the literature
and serious original thought.
- A statement of the object of the experiment.
- Specification of the software and data sets used.
- A detailed evaluation of test results. This should include
most, if not all, of the following:
- Measure of success rate, of CPU time, and of clock time. If the
data falls in different categories, these should be broken down by
- An analysis of the statistical significance of these results.
- Qualitative analysis of test results. This will vary widely, depending
on the particular experiment, but includes issues such as: How useful is
the output? When features in the data gave trouble to the software?
Where, if anywhere, did the software do better than you expected?
How usable was the software? How well designed is the interface?
Is this a practical tool, and, if so, for what purposes?
You are encouraged to come up with your own ideas; those listed below
are just suggestions. If you come up with any ideas for projects that
you don't want to do yourself, but you think someone else might like,
please email them, and I'll add them to the list.
Home page parser for professors: Given a home page in HTML, extract
name, position, university, department, degrees, age, whatever else.
Conference announcement parser: Given a home page for a conference in
HTML. extract name, location, date, chair.
- Implement vector-based similarity measure for text documents.
Implement a cluster analysis over some collection of documents.
- Implement a similarity measure for some class of multimedia
documents, such as images or audio.
- Implement a query engine for some highly marked-up document. For
example, in principle the following questions could be answered
from the DEI version of Hamlet: Which scenes contain both Hamlet and
Gertrude? How many times does Claudius speak immediately after Hamlet?
What is Hamlet's favorite non-stop word? Define and implement a query
language of this kind.
Alternatively, write a program to convert such a document into database
form, and use a standard database query language to answer such questions.
- Implement a parallel/distributed retrieval system.
- Implement a data compression algorithm. Test the effectiveness
of your algorithm over various kinds of data.
- Eric Stedfeld, who works for the NYU Libraries, has suggested
a number of projects which would be of
actual value to the libraries.
- Using existing IR tools, test the impact on retrieval of
such text manipulations as stop-word removal, stemming, synonyms, etc.
- Test the effectiveness of an image compression method on various
kinds of images: e.g. page images of print pages, page images of
handwriting, naturalistic images, graphic designs, etc. If the
compression method is lossy, characterize the loss.
- Study the effectiveness of some existing multimedia retrieval
- Test the quality of Web search engines. Note: the hard part of this
is to design the test in a way that you can argue that the results are
- Test the accuracy of OCR programs on various types of print. What
happens if the OCR program is supplemented with a spelling checker?
What is the impact of the errors in digital documents created by OCR
on IR programs attempting to retrieve the document?
- Define a measurement of the accuracy of a machine translation system.
Measure the quality of the Alta Vista machine translation system.
Design a mark-up DTD (document type definition) for some category of
documents. How would you
evaluate such a DTD? That is, if someone else presented an alternative DTD
for the same category, what kinds of arguments or tests would you use
to determine which one was better?
- Discuss the pros and cons of electronic vs. print scientific journals.
How should an electronic journal be organized, administered, edited, paid for?
- Discuss the problem of retrieval from a software library. That is,
the user wishes to locate a subroutine (function, method, class, module ...)
that will be useful for a specified purpose. How can software be categorized?
- What will the World Wide Web be like 20 years from now? What are the major
technical obstacles to be overcome to achieve this?
Back to course page