Web Search Engines: Lecture 3. Indexing and Query Engines
MR&S chaps 4 and 5, secs 20.3-20.4
Inverted files for text search engines by Justin Zobel and
Alistair Moffat, ACM Computing Surveys, vol. 38, no. 2, 2006,
Web Search for a Planet: The Google Cluster Architecture
by Luiz André Barroso et al., IEEE Micro, 2003, vol. 23, pp. 22-28.
Chakrabarti, chap. 3 through sec. 3.2.2 (p. 57)
Challenges on Distribued Web Retrieval by R. Baeza-Yates et al.
Data Engineering, 2007.
The Data Center as a Computer: An Introduction to the Design of
Warehouse-Scale Machines, by Luiz André Barroso and
Urs Hölzle, Morgan and Claypool, 2009.
What information is indexed?
Index (Inverted file)
For each term, record: Document. For each document, record:
relevance measure, list of occurrences/use in anchor. Occurrences
may be identified by byte (for indexing into cache copy); by word
(for word-based proximity) and/or by structure key (for structural
Document ID. URLs. Outlinks. Inlinks. Query independent measure. Metadata:
language, format, date, frequency of update, etc.
Document cache for snippet and perhaps exact phrase.
For each word: related words with scores of degree of relatedness.
The complete results page is cached for the K most popular queries, for some
fairly large value of K.
(How this interacts with personalization, I don't know.)
Google Instant is presumably also run off this cache. Each prefix links
to associated results page.
File structure for inverted index
Hash table: Fast add, retrieve. Does not support search by
word prefix with wildcard.
B-tree, indexed by word. Logarithmic add, retrieve. Supports search
by word prefix.
Trie. Not usable on disk. Useful in-memory data structure while
Order of docs for word W in document list.
Order by DOC ID: Fast merging.
Order by relevance score: Fast retrieval for single-word query.
Supplement by hash-table for W,D pair.
Google: Two lists: high-relevance and low-relevance, each sorted
by Doc. ID.
A. Divide by term. Each documents generates set of word lists for each
Queries with 1 word, or all words on same server
are in luck. Query involving multiple words on multiple servers:
Wordlists must be retrieved to single machine, merging performed
B. Divide by document. Each document goes to one index builder.
Query is sent to all servers, answers are
merged in decreasing order. This is what is done in practice.
These are important (a) to save disk space
(b) to reduce disk traffic (c) to improve cache hits.
Gap encoding. If there are N documents and the word appears in P of
them, then the absolute document index has log N bits, but the
average gap size is an integer of about size N/P and hence requires
log N - log P bits. So, e.g. if N = 10 billion (say) and P = 100,000
(a pretty rare word, occurring in only 1/100,000 docs) then the listing
of docs associated with the word is reduced from 33 bits to 17 bits.
The more common the word, the higher the degree of compression. Also, if
the distribution is non-uniform, then the savings is greater. E.g.
suppose that the word appears in documents with ID numbers 1-50,000 and
99,999,995,000-10,000,000,000. Then gap encoding has 100,000 gaps of size
1, requiring 1 bit apiece + 1 gap of 10 billion, requiring 33 bits, for
an average of very slightly more than 1 bit per entry. (Rough approximation,
but right idea.)
Gap encoding of position works in same way.
Note: it takes a little work to come up with a coding scheme that achieves
(close to) this compression but still can encode large gaps when needed.
Note: These bit-saving techniques cost you at decompression time, because the
data is split up at the bit level, so you have to do a lot of fiddly bit i
manipulation. (Also at indexing time, of course, but that doesn't matter
as much.) However, it is always worth trading off
a good bit of CPU time to save disk accesses.
Further compression techniques and more detail in Zobel and Moffat and in
Witten et. al. Managing Gigabytes.
See MR&S section 20.4.
Here compression matters less, because this hardly involved in query answering
(only for in-link queries, which are rare). Mostly used for PageRank,
which is offline.
Exploit two regularities
So you do the following:
- A. The outlinks from page P all tend to go to the same site. Likewise the
- B. If page P is close to Q in terms of the web site, then P and Q tend
to have similar outlink/inlink lists.
- C. Closeness in web site can be gotten by sorting URL's lexicographically.
1. Sort the URL's lexicographically and assign DocID's
2. Use gap encoding. Because of (A) and (C), the gaps tend to be small.
3. Record the outlink list from page P as a modification of
the list for Q, where Q has a doc ID within 7 previous.
Using these you can reduce the file to an average of 3 bits per link (as
opposed to 33 bits per link for the full DocID of 10 billion docs).
Modified from Barroso et al. 2003.
Steps in query answering: (Partly from Barroso, partly conjectural).
Note: One can deal with query expansion (by related words) either at query
time, by expanding the query, or at indexing time, by indexing docs under
each related word. I find it hard to believe you would do the latter, since
it costs hugely in space (unless you use shared structures -- not easy
with disk-based data structures) with little savings of time at query time.
However, Barroso and Holzle 2007 do mention it as an option.
- DNS picks out a cluster geographically close to user.
- A Google Web server machine (GWS) at the cluster is chosen based on
- Carry out any unequivocal regularizations (e.g. upper to lower case).
- Check results page cache. If found, then return cached page.
- Collect related words, corrected or alternate spellings.
- Query is sent to index servers. Each index holds inverted index
for a random subset of documents.
- Index server returns list of docid's,
with relevance scores, sorted in decreasing order.
- GWS merges lists, gets overall list sorted by query.
- Full text of actual documents are divided among file servers.
GWS sends each file server the query plus the list of docid's on
- File server returns document snippet for that query.
- GWS assembles summaries, advertisements from ad server, spelling
correction suggestions for some depth of results pages (not just first ten
- Return to user and save in results page cache.
Query answering on each index server
Single term: For each highly relevant doc, combine relevance score with
query-independent doc score. Sort.
Multiple terms T1, T2, ... Tk: For each doc D, do a modified intersection
of positional lists to find proximal groups, Boolean combinations.
If there is a NOT constraint in query, exclude any doc that violates the
scores for proximal
groups, combine with relevance scores for individual word and query
independent score for D to find overall score for D on Q. Sort.
Exact phrase: Either proceed as with multiple terms --- better for
rare combinations of common word --- or find most unusual
word in phrase, locate all occurrences, check neighborhood for phrase ---
better for rare word surrounded by stop words.
Exact phrase of stop words: The only way I can see to do this is to
index every exact phrase of stop words of more than a certain degree of
rarity, up to a certain length. Of course queries of this kind are
very rare, and answers probably don't have to be very accurate. (Certainly
the order is essentialy meaningless.