Web Search Engines: Lecture 3. Indexing and Query Engines

Required Reading

MR&S chaps 4 and 5, secs 20.3-20.4
Inverted files for text search engines by Justin Zobel and Alistair Moffat, ACM Computing Surveys, vol. 38, no. 2, 2006,
Web Search for a Planet: The Google Cluster Architecture by Luiz André Barroso et al., IEEE Micro, 2003, vol. 23, pp. 22-28.

Recommended reading

Chakrabarti, chap. 3 through sec. 3.2.2 (p. 57)
Challenges on Distribued Web Retrieval by R. Baeza-Yates et al. Data Engineering, 2007.
The Data Center as a Computer: An Introduction to the Design of Warehouse-Scale Machines, by Luiz André Barroso and Urs Hölzle, Morgan and Claypool, 2009.

What information is indexed?

File structure for inverted index

Hash table: Fast add, retrieve. Does not support search by word prefix with wildcard.

B-tree, indexed by word. Logarithmic add, retrieve. Supports search by word prefix.

Trie. Not usable on disk. Useful in-memory data structure while compiling index.

Order of docs for word W in document list.

Order by DOC ID: Fast merging.
Order by relevance score: Fast retrieval for single-word query.
Supplement by hash-table for W,D pair.
Google: Two lists: high-relevance and low-relevance, each sorted by Doc. ID.

Distributed index

A. Divide by term. Each documents generates set of word lists for each server. Queries with 1 word, or all words on same server are in luck. Query involving multiple words on multiple servers: Wordlists must be retrieved to single machine, merging performed there.

B. Divide by document. Each document goes to one index builder. Query is sent to all servers, answers are merged in decreasing order. This is what is done in practice.

Compression techniques

These are important (a) to save disk space (b) to reduce disk traffic (c) to improve cache hits.

Inverted file

Gap encoding. If there are N documents and the word appears in P of them, then the absolute document index has log N bits, but the average gap size is an integer of about size N/P and hence requires log N - log P bits. So, e.g. if N = 10 billion (say) and P = 100,000 (a pretty rare word, occurring in only 1/100,000 docs) then the listing of docs associated with the word is reduced from 33 bits to 17 bits. The more common the word, the higher the degree of compression. Also, if the distribution is non-uniform, then the savings is greater. E.g. suppose that the word appears in documents with ID numbers 1-50,000 and 99,999,995,000-10,000,000,000. Then gap encoding has 100,000 gaps of size 1, requiring 1 bit apiece + 1 gap of 10 billion, requiring 33 bits, for an average of very slightly more than 1 bit per entry. (Rough approximation, but right idea.)

Gap encoding of position works in same way.

Note: it takes a little work to come up with a coding scheme that achieves (close to) this compression but still can encode large gaps when needed. See references.

Note: These bit-saving techniques cost you at decompression time, because the data is split up at the bit level, so you have to do a lot of fiddly bit i manipulation. (Also at indexing time, of course, but that doesn't matter as much.) However, it is always worth trading off a good bit of CPU time to save disk accesses.

Further compression techniques and more detail in Zobel and Moffat and in Witten et. al. Managing Gigabytes.

Link database

See MR&S section 20.4. Here compression matters less, because this hardly involved in query answering (only for in-link queries, which are rare). Mostly used for PageRank, which is offline.

Exploit two regularities

So you do the following:
1. Sort the URL's lexicographically and assign DocID's
2. Use gap encoding. Because of (A) and (C), the gaps tend to be small.
3. Record the outlink list from page P as a modification of the list for Q, where Q has a doc ID within 7 previous.

Using these you can reduce the file to an average of 3 bits per link (as opposed to 33 bits per link for the full DocID of 10 billion docs).

Modified from Barroso et al. 2003.

Steps in query answering: (Partly from Barroso, partly conjectural).

Note: One can deal with query expansion (by related words) either at query time, by expanding the query, or at indexing time, by indexing docs under each related word. I find it hard to believe you would do the latter, since it costs hugely in space (unless you use shared structures -- not easy with disk-based data structures) with little savings of time at query time. However, Barroso and Holzle 2007 do mention it as an option.

Query answering on each index server

Single term: For each highly relevant doc, combine relevance score with query-independent doc score. Sort.

Multiple terms T1, T2, ... Tk: For each doc D, do a modified intersection of positional lists to find proximal groups, Boolean combinations. If there is a NOT constraint in query, exclude any doc that violates the constraint. Compute scores for proximal groups, combine with relevance scores for individual word and query independent score for D to find overall score for D on Q. Sort.

Exact phrase: Either proceed as with multiple terms --- better for rare combinations of common word --- or find most unusual word in phrase, locate all occurrences, check neighborhood for phrase --- better for rare word surrounded by stop words.

Exact phrase of stop words: The only way I can see to do this is to index every exact phrase of stop words of more than a certain degree of rarity, up to a certain length. Of course queries of this kind are very rare, and answers probably don't have to be very accurate. (Certainly the order is essentialy meaningless.