Project 2: Indexing and Querying
Assigned: Sept. 26
Due: Oct. 17
In this assignment, you will build two programs: an indexer and
a query answerer.
The indexer takes as command-line input the name of a directory of files.
You may assume that all the files here can be treated as plain text documents,
or that they are all HTML files.
It outputs an inverted file. The inverted file indexes under each word W:
1) The inverse document frequency = log((total number of documents in directory)
/ (number of documents containing W)).
2) For each document D containing W, the name of D; the value
YDW (the measure of the relevance of document D to word
W --- see Lecture Notes 2 for formula);
and a list of all the positions (in terms of word count) of W in
Also generate supplementally an index which records the title for each
file (or whatever other summary information you want to present on the
As in project 1, you can take a "word" to be a sequence of at least 3
alphabetical characters, normalized to lower case, not on the list of
stop words . "Position" is in terms of these
"words". E.g. in this file "proj2.html" the first several words in order
"title", "project", "indexing", "querying", "title"
"project", "indexing", "querying", "assigned", "sept", "due", "oct",
"two", "programs", "indexer", ...
The query answerer should take from input a query in the format specified
below. It should consult with the inverted file --- NOT with the original
text files --- and generate its answer in the form of a "results" page,
as in project 1. The query language is specified as follows: A term
is either a word, as defined above, or an alphabetic prefix followed by
a wildcard "*". A literal is either a term or
"term1 WITHIN K1, K2 term2" where K1, K2 are integers (positive,
negative, or zero), K1 < = K2. A query is either a literal or
a structure built out of literals with AND, OR, or WITHOUT grouped by
parentheses. The query Q1 WITHOUT Q2 is the set of all pages that satisfy
Q1 but do not satisfy Q2.
- 1. year
- 2. year*
- 3. seven WITHIN 0,3 year*
- 4. seven AND war
- 5. seven OR war
- 6. seven WITHIN -2,4 war WITHOUT cucumber
- 7. (seven OR (rather WITHIN -2,4 living)) WITHOUT
((cucumber OR watercress)
The order of the pages returned should be in decreasing order of the
vector similarity measure between the document D and the vector
of words that appear positively (i.e. not in the right-hand side
of a WITHOUT operator)
E.g. queries 4, 5, 6 above all return different
sets of pages, but the order is the same; the similarity measure between
the vector < seven, war > and the document vector.
If you have one WITHOUT operator imbedded in another, then the terms inside
the internal operator are still not considered as positive, for the
purposes of the vector sum. For example, in the query
"Canada WITHOUT (Newfoundland WITHOUT Labrador)" the query vector
is "Canada" not "Canada Labrador".
You may either write the query-answering code to answer a single query,
or to answer a series of queries until the user exits.
As in project 1, your code must run on the department's Sun system.
You may use any pre-existing general database system or code that you
find there or can get to run there, but do not use pre-existing
IR (information retrieval) code. As in project 1, I am not much concerned
It is important that query answering should be implemented quite efficiently.
In particular, if the query involves only uncommon words (words that only
appear a few times in a small number of documents) then the time for
be very short and essentially independent, both of the overall size of
the lexicon and of the total number of documents in the directory. Queries
on very common words will necessarily take longer, but should still be
implemented as efficiently as possible.
The index-building program should run in time linear in the total size
of the documents.
You may assume that the inverted file fits comfortably in main memory.
You need not, therefore, worry about optimizing disk I/O in interacting
with the inverted file.
You should email to your TA:
You need not send a sample output file.
- Your source code.
- A README file that states what your project does, and how to run it.
The following simplified version will be worth 85% of the grade: Omit
the operators OR and WITHOUT from the language. In this version,
a query is thus either a literal or a series of literals connected
by AND. A query can therefore be represented as a list of literals,
rather than requiring a tree as in the full assignment, and the
language no longer needs parentheses. This should substantially simplify
the construction of the query parser, and somewhat simplify the construction
of the retrieval engine.
You can, if you want to, submit the partial assignment on time for 85%,
and then resubmit the full assignment a week late for 90% (taking the
late penalty into account.)