Web Search Engines: Lecture 1

Required Reading

Suggested Further Reading

Structure of a Search Engine

Distinguishing Features

Spider

Major full-scale spiders

Google Yahoo
Teoma (powers Ask.com)

Scale

Actual statistics on any of the major spiders (size of index, freshness, etc.) seem to be completely unavailable later than 2005. Usage statistics (how many queries) are available.

Parallelism in downloading.

Courtesy toward server

Robot Exclusion Standard

Not overloading server: Put a (small) maximum of number of requests issued to any server concurrently, or per second.

Elegant thread solution: Each thread has its own URL queue. URL's are assigned to a given queue based on domain name. (i.e. if there are K threads then hash domain(U) to value I between 0 and K-1, and give U to thread I.) Any domain gets request only from one thread, and thus only one at a time.

Domain Name Resolution

Major bottleneck. Standard software (Java, UNIX) allows only one external DNS request at a time. The result is that 87% of elapsed time is waiting for DNS. Cache of domain names not hugely effective. Custom software allows multiple concurrent requests; reduces this to 25%.

Errors in downloading

Server non-existent, server non-responsive (timeout), URL non-existent or not permitted etc.

Robust parser

Standard parsers (e.g. generated by YACC) simply break under the junk that is out there on the Web: "typos in HTML tags, kilobytes of zeros in the middle of a tag, non-ASCII characters, HTML tags nested hundreds deep ..." (Brin and Page).

Repeated URL

Normalize URL (e.g. "http://blah/xxx" = "http://blah/xxx/" = "http://blah:80/xxx/" = "http://blah/xxx/index.html" = "xxx" in document with URL "http://blah" etc.

Canonicalize domain name. Request canonical name, addresses from DNS. Use either canonical name or lowest IP address (for caching, not for actual request.)

Cache of URL's already seen. MERCATOR: Store fingerprint of URL. Full table on disk, large buffer in memory. High degree of locality, buffering very effective.

Actually, save two-part fingerprint; hostname and full URL. Store by hostname/URL. Thus, when you look up a URL on disk, other URL's from same host are nearby, and are loaded into memory. Helps because URL's on same host tend to be cited together.

Repeated documents:

(Mirror sites, aliases, copies, etc.) Cache 64-bit signature of document. If already in cache, do not process again.

Nearly repeated documents

We will discuss in a later lecture.

Immense or infinite pages

Put an upper bound on the size of the page.

Immense or infinite sites

(Crawler traps). No general way to detect automatically. Report unusually large number of web pages at site to human operator, who can check for the trap.

MERCATOR (Compaq: precursor to Scooter,)

(Compaq: precursor to Scooter, which was the crawler for AltaVista.)

Parallelism via multiple threads. Each thread has its own FIFO URL queue. Each server is assigned to a single thread, determined by canonical host name. Thus, there can be only one ongoing download at a time from any one server.

Size of crawl frontier: 100's of millions of URLs. Each thread buffers 600 URLs in memory; rest of queue on disk.

Timeout after 1 minute.

Content-seen test. Compute 64 bit fingerprint of content. Save in small hash table in memory; large sorted list on disk. New documents get added to hash table; table merged with sorted list when full. (8.5% of documents downloaded are duplicates. Session ID's in URL generate potentially infinitely many "duplicates".)

Checkpointing at regular intervals.

Adaptibility through good, modular, object-oriented programming.

Refresh method

Method 1: Redo complete crawl, starting from list of URL's, rebuild index from scratch.

Method 2: Incremental

For each URL in existing list
  Check existence/recent modification from server.
  If no longer exists, then remove from all indices
  If modified then { 
    remove from all old obsolete indices;
        (e.g. all words in old but not new;
              backlinks from all links in old but not new; etc.)
    add all necessary new indices
    }
Need additionally: Table indexed by URL to all index information for this page.

If document constant but URL changes, then change URL associated with Doc. ID. (Indices actually use fixed length Doc. ID's rather than variable-length URLs)

Refresh strategies

(Arasu et al., 99) Two measures to optimize: Which measure is more appropriate depends on the value of outdated pages. If outdated pages are useless, go with (1); if they lose value proportional to their staleness, go with (2). For either, it's debatable how you should count pages that no longer exist, and new unindexed pages.

Another measure is to weight the above by the frequency of user queries.

Uniform Strategy: Refresh all pages at same rate.

Proportional Strategy: Record how often a page changes, refresh it proportional to frequency of change. Uniform works better than proportional.

Optimal Strategy: (to maximize percentage of current pages, given standard modelling assumptions) downward parabola (frequency vs. frequency). Refresh most frequently pages with a middle change rate. Pages that change rarely should be refreshed rarely. Pages that change too frequently should not be refreshed at all (not worth bothering indexing them.)

Separate news crawler from known news sources.

Crawling Strategies

Give priority to crawling from "important" pages. Criteria of "importance" same as relevance ranking, will discuss then.

Indexing

What is a Word?

I have not found a useful systematic discussion of this.

Three issues:

1. What is a word? For our purposes, it is (ideally) any key that a user may wish to use as a component of search, and that can in fact be detected in the document, or associated with the document, and used as an index.

2. Delimiting. Certainly, white space (blank, newline, tab) is a delimiter. The tricky question is punctuation.

3. Maximum length. Amid the junk on the Web, there must be any number of HTML files with very long strings of characters with no delimiter. I presume that any string longer than some maximum is just skipped.

3. Regularization. Suppose that string G can be regularized to word F != G. There are at least four possible policies:

There are also non-symmetric strategies, in which one wants to allow a standard form in the query to match a non-standard form in the document but not vice versa, or, less likely, the reverse. We omit these. In general, the regularization rules for queries need not be the same as for documents for any of the following reasons:

Also, the same issues arise in doing comparisons between documents. This, of course, is a symmetric situation. Choices may be different than in the case of a query because (a) there is no oppotunity for user feedback; (b) the existence of a lot of data makes any one word comparatively less important.

Specific issues that arise in English text (other languages have their own problems).

A danger of strong regularizations is that they tend to lead to false matches that are not merely wrong but entirely baffling to the user. Users generally find such results much more upsetting than mere errors, and if many occur, user confidence in the system quickly degrades. (Also note that these are not particularly noted in standard measures of quality, such as precision.) E.g. Suppose query is "portability" and document contains "important". A strong stemming algorithm might (etymologically correctly) reduce both to "port" and match them.

It seems, experimentally, that Google regularizes terms in the text but not in the query. On the one hand, the query "deregularizations" returns only two pages, both of which have the word in the plural, and it does not return any of the many pages which have the word in the singular. On the other hand, the query "Origin Roman historical commemoration" matches pages which contain the word "Origins [plural]" "Roman" "historical" and "commemoration" and do not contain the word "origin" in the singular. The theory presumably is that since you specified "deregularizations" you specifically want the plural, but if you ask for the singular, you may well be interested in the plural as well. Also the query can match the form of an anchor rather than the content of the page. On the other hand the query "zeugmas" returns as answer number 4 the page zeugma etc. which does not contain the word in the plural, but the cached copy claims that there is a link that contains the plural form. (The "link" search doesn't find the page with this link. Either link search is unreliable, or the page no longer exists.) (These queries carried out August and September 2007.)

Note: Google will index a page P under a term T that appears only in the anchor of a link to P and not in any form in P itself. I ran a test: the page nquarminos does not contains this anchor term but is now indexed under that term. Incidentally, the page was placed on the web on 8/24/07 and was indexed some time between 9/4 and 9/8.

What information is indexed?

Inverted file

For each word W, the documents in which it occurs. For each word W and document D, some of the following:

Lexicon: Information about words

Number of documents containing word. Total number of occurrences of word. Possibly related words.

Document index

For each document D: URL; document ID; the list of outlinks from D; the list of inlinks to D; page rank of D (query independent measure of quality); language; format; date; frequency of update; etc.

Word to link index

For each word W, list of links in which W is used as an anchor. (May be part of inverted file).

Index format

Compression techniques

Use positional difference rather than absolute position for W in D. Saves significant space if W occurs often in long document D. Huffman or similar encoding for words, URLs, contexts etc. Give small doc IDs to docs with many different words, large doc ids to docs with few different words.