Web Search Engines: Lecture 2

Required Reading

Chakrabarti, chap. 3 through sec. 3.2.2 (p. 57)

Additional Reading

Managing Gigabytes by Ian Witten, Alastair Moffat, and Timothy Bell. Extensive discussion of achieving space and time efficiency in recording and retrieving large text documents, including inverted files.

What is a Word?

I have not found a useful systematic discussion of this.

Three issues:

1. What is a word? For our purposes, it is (ideally) any key that a user may wish to use as a component of search, and that can in fact be detected in the document and used as an index.

Delimiting. Certainly, white space (blank, newline, tab) is a delimiter. The tricky question is punctuation.

3. Regularization. Suppose that string G can be regularized to word F != G. There are at least four possible policies:

There are also non-symmetric strategies, in which one wants to allow a standard form in the query to match a non-standard form in the document but not vice versa, or, less likely, the reverse. We omit these. In general, the regularization rules for queries need not be the same as for documents for any of the following reasons:

Also, the same issues arise in doing comparisons between documents. This, of course, is a symmetric situation. Choices may be different than in the case of a query because (a) there is no oppotunity for user feedback; (b) the existence of a lot of data makes any one word comparatively less important.

Specific issues that arise in English text (other languages have their own problems).

A danger of strong regularizations is that they tend to lead to false matches that are not merely wrong but entirely baffling to the user. Users generally find such results much more upsetting than mere errors, and if many occur, user confidence in the system quickly degrades. (Also note that these are not particularly noted in standard measures of quality, such as precision.) E.g. Suppose query is "portability" and document contains "important". A strong stemming algorithm might (etymologically correctly) reduce both to "port" and match them.

What information is indexed?

Inverted file

For each word W, the documents in which it occurs. For each word W and document D, some of the following:

Lexicon: Information about words

Number of documents containing word. Total number of occurrences of word. Possibly related words.

Document index

For each document D: URL; document ID; the list of outlinks from D; the list of inlinks to D; page rank of D (query independent measure of quality); language; format; date; frequency of update; etc.

Word to link index

For each word W, list of links in which W is used as an anchor. (May be part of inverted file).

Index format

Compression techniques

Use positional difference rather than absolute position for W in D. Saves significant space if W occurs often in long document D. Huffman or similar encoding for words, URLs, contexts etc. Give small doc IDs to docs with many different words, large doc ids to docs with few different words.

File structure for lexicon

Hash table Fast add, retrieve. Does not support search by word prefix with wildcard.

B-tree, indexed by word. Logarithmic add, retrieve. Supports search by word prefix.

Trie. Not usable on disk. Useful in-memory data structure while compiling index.

Order of docs for word W in document list.

Order by DOC ID: Fast merging.
Order by relevance score: Fast retrieval for single-word query.
Supplement by hash-table for W,D pair.
Google Two lists: high-relevance and low-relevance, each sorted by Doc. ID.

Distributed system

A. Divide by document. Each document goes to one index builder. Query is sent to all servers, answers are merged in decreasing order.

B. Divide by word. Each documents generates set of word lists for each server. Queries with 1 word, or all words on same server are in luck. Query involving multiple words on multiple servers: Wordlists must be retrieved to single machine, merging performed there.

Query answering

In "Ask Jeeves" the answers for the K most popular queries are cached and hand-edited for relevance for some substantial value of K. I would bet, though I have not seen, that the other search engines do this too, to some extent.

(NY Times, 9/9/2002, p. C6) During the month of July 2002 Yahoo.com had 90.1 million visitors (most popular website); Google.com had 39.6 million visitors (5th most popular, after MSN.com Microsoft.com, and AOL.com.) 90 million visitors per month = about 30 visitors per second.

Retrieval algorithms


Let WA be the word with fewer documents, WB the other.
for each document D in list of WA {
  if D found in list of WB {
     D.relevance := combine(score(D),score(W1,D),score(W2,D))
     add D to list L
} }
return L sorted by decreasing relevance.


for each document D in list of W1 {
  if D not found in list of W2 {
     D.relevance := combine(score(D),score(W1,D))
     add D to list L
} }
return L sorted by decreasing relevance.

W1 OR W2; similar.

Proximity search: W1 close to W2

Let WA be the word with fewer documents, WB the other.
for each document D in list of WA {
  if D found in list of WB {
     find closest pair of occurrence OA of WA in D
                and occurence OB of WB in D {
           D.relevance = combine(score(D), score(W1,D), score(W2,D),
           add D to list L
return L sorted by decreasing relevance.
Finding common documents in lists of WA and WB, or finding close occurrences OA of WA and OB of WB in D depends on data structure. If occurrences are kept in sorted list of equal length, then use a merge strategy. If list are very different lengths, then loop through shorter list and use binary interpolation search on the other. If occurences are kept in list + hashtable, loop through shorter list of one and check hashtable of the other.

Details of "combine" function closely guarded trade secret.

Relevance measures

Computing the relevance of a document to a query has four parts: As far as I can tell, no one is saying anything about how they do (4).

Relevance of a document to a word.

The standard measure here, from IR (information retrieval) theory, is known as the TF/IDF (term frequency / inverse document frequency) measure. It is computed as follows. Let I be a word and D be a document. We define the following quantities:

freqDI = the number of occurences of I in D.
M = the total number of documents.
MI = the number of documents that contain word I

Then the significance of word I in document D is given by
XDI = freqDI log(M/MI)

(Note: The formula that I gave in lecture divided the above by the quantity maxW freqDW, but since this is independent of I, it only changes everything by the same constant factor and can be omitted.)

Features of the above formula.

Combining word significance : Vector Method

There is a very widely used method for using the above measures of word significance into a measure, either of the similarity of two documents or of the relevance of a document to a query. This is called the vector model and is due to Salton.

Let W be the total number of index terms. Consider a W-dimensional geometric (Euclidean) space where each term is a different dimension. We consider a document D to correspond to a vector where the component of D in the dimension corresponding to word I is the value XDI. (Of course, most words I do not appear in D, so the component in that direction will be 0.) Symbollically let Dv be the vector associated with D and let Iu be the unit vector associate with word I. Then

Dv = sumI XDI Iu

Measurement Rule 1: The similarity of documents D and E is measured by the cosine angle between Dv and Ev

Measurement Rule 2: The relevance of document D to query Q is measured as the similarity of D to Q where Q is viewed as a (very short) document, and similarity is measured as in Measurement Rule 1.

The reason that we use the angle between Dv and Ev rather than the distance between them is this: Consider a case where E has all the same distribution of words as D, but is twice as long. Intuitively, the subject matter of the two documents is presumably very similary. Now, the distance between the two vectors will be large, but the angle between them will be 0. In general, using angle rather than distance normalizes the length of the documents, and considers only the frequency of each word in the document.

Geometrical facts


Putting all this together, we get the following algorithm for basic vector-based retrieval:


{ In a first pass through all the documents, 
     compute freqDW and MW
         for every document D and word W;
     record IDFW := log(M/MW) in the lexicon;

  In a second pass, compute for every D and W {
       XDW := freqDW * IDFW;
       |D| = sqrt(sumI in D XDI * XDI);
       and YDW := XDW / |D|;
    record YDW in an inverted file
           (i.e. indexed primarily by W, secondarily by D) 
The value YDW is the coordinate in the W dimension of the vector Dv normalized to unit length.

Retrieval for query Q

{ DVAL = empty; \\ DVAL is a mapping document -> number.
  for each word W in Q {
    for each document D on doc-list of W {
      if (D in DVAL) then V := DVAL[D] else V := 0;
      DVAL[D} := V + YDW * IDFW;
\\ Note: the vector Qv associated with Q is just IDFW 
\\ for each word W in Q. The sum above computes the dot product of this with 
\\ the unit vector associated with the document.  There is no need to divide 
\\ through by the length Qv since that length is independent of the 
\\ document
   return (documents in DVAL sorted by decreasing value)
Note: The method can easily be adapted to weighted queries where user assigns weights to the words in his query. Just use those weights for Q, rather than IDFW.