Project 2: Indexing and Querying

Assigned: Sept. 26
Due: Oct. 17

In this assignment, you will build two programs: an indexer and a query answerer.

Indexer

The indexer takes as command-line input the name of a directory of files. You may assume that all the files here can be treated as plain text documents, or that they are all HTML files. It outputs an inverted file. The inverted file indexes under each word W:

Also generate supplementally an index which records the title for each file (or whatever other summary information you want to present on the results page).

As in project 1, you can take a "word" to be a sequence of at least 3 alphabetical characters, normalized to lower case, not on the list of stop words . "Position" is in terms of these "words". E.g. in this file "proj2.html" the first several words in order would be "title", "project", "indexing", "querying", "title" "project", "indexing", "querying", "assigned", "sept", "due", "oct", "assignment", "build" "two", "programs", "indexer", ...

Query answering

The query answerer should take from input a query in the format specified below. It should consult with the inverted file --- NOT with the original text files --- and generate its answer in the form of a "results" page, as in project 1. The query language is specified as follows: A term is either a word, as defined above, or an alphabetic prefix followed by a wildcard "*". A literal is either a term or "term1 WITHIN K1, K2 term2" where K1, K2 are integers (positive, negative, or zero), K1 < = K2. A query is either a literal or a structure built out of literals with AND, OR, or WITHOUT grouped by parentheses. The query Q1 WITHOUT Q2 is the set of all pages that satisfy Q1 but do not satisfy Q2.

Example queries: