Web Search Engines: Lecture 2

Required Reading

Chakrabarti, chap. 3 through sec. 3.2.2 (p. 57)
Web Search for a Planet: The Google Cluster Architecture by Luiz Andre Barroso et al., IEEE Micro, 2003, vol. 23, pp. 22-28.

Correction on Google stemming

Google Help: Basics of Search states
Google now uses stemming technology. Thus, when appropriate, it will search not only for your search terms, but also for words that are similar to some or all of those terms. If you search for pet lemur dietary needs, Google will also search for pet lemur diet needs, and other related variations of your terms. Any variants of your terms that were searched for will be highlighted in the snippet of text accompanying each result.
And indeed the fourth page returned under a search on "pet ferret dietary" contains the word "diet" and not "dietary" (though it's hard to be sure that there's no inlink with an anchor containing "dietary"). So my best guess is that Google is using a lexicon for stemming, rather than a stemming algorithm, and that "deregularizations" just didn't make it into the lexicon.

Stopwords

Common words: Generally ignored in query, sometimes not indexed. Oddly, there doesn't seem to be a standard list. This is a list I copied some years ago. Here is another list which the authors believe to be the list of Google stopwords.

Indexing Continued

File structure for inverted index

Hash table: Fast add, retrieve. Does not support search by word prefix with wildcard.

B-tree, indexed by word. Logarithmic add, retrieve. Supports search by word prefix.

Trie. Not usable on disk. Useful in-memory data structure while compiling index.

Order of docs for word W in document list.

Order by DOC ID: Fast merging.
Order by relevance score: Fast retrieval for single-word query.
Supplement by hash-table for W,D pair.
Google: Two lists: high-relevance and low-relevance, each sorted by Doc. ID.

Parallel retrieval

A. Divide by word. Each documents generates set of word lists for each server. Queries with 1 word, or all words on same server are in luck. Query involving multiple words on multiple servers: Wordlists must be retrieved to single machine, merging performed there.

B. Divide by document. Each document goes to one index builder. Query is sent to all servers, answers are merged in decreasing order.

Hardware issues in Google Query Servers

(As of 2003). Reliability from replication.
Multiple clusters distributed worldwide.
A few thousand machines per cluster.

Copied from Barroso et al. 2003.

Steps in query answering:

"On average, a single query in Google reads hundreds of megabytes of data and consumes tens of billions of CPU cycles. Supporting a peak request stream of thousands of queries per second requires an infrastructure comparable in size to that of the largest supercomputer installations. Combining more than 15,000 commodity-class PC's with fault-tolerant software creates a solution that is more cost-effective than a comparable system build out of a smaller number of high-end servers."

A rack: 40 to 80 custom-made servers. CPU range from 533 MHz Celeron to 1.4 GHz Pentium III. Each server has one or more 80GByte disk drive. Server last two or three years; then is outdated by faster machines. A comparable array of standard machines costs $7,700 per month.

Cooling: Power density = 400 Watts / square foot.
Commercial data centers generally can handle 70-150 Watts per square foot.

Relevance Measures

Digression on Vectors and Dot Product

Spelling Correction

I am told that Google's spelling correction is done entirely on the basis of user logs; that is, it is observed how users correct their own spelling in consecutive queries. (Wilson Hsieh, personal communication, 10/07)

Query-dependent ranking from user data

Another aspect of ranking (query dependent) is gathered from user behavior. If users querying on Q generally decide to click on page P, then that ups the relevance of P to Q. (Wilson Hsieh, personal communication, 10/07)