Web Search Engines: Lecture 1

Required Reading

Suggested Further Reading

Structure of a Search Engine

Distinguishing Features

Spider

Major full-scale spiders

Google (4.3 billion web pages indexed + news groups + images)
Yahoo
Teoma (powers AskJeeves)

Scale

4 billion web pages / 1 month per crawl = 2000 web pages per second.

Parallelism in downloading.

Courtesy toward server

Robot Exclusion Standard

Not overloading server: Put a (small) maximum of number of requests issued to any server concurrently, or per second.

Elegant thread solution: Each thread has its own URL queue. URL's are assigned to a given queue based on domain name. (i.e. if there are K threads then hash domain(U) to value I between 0 and K-1, and give U to thread I.) Any domain gets request only from one thread, and thus only one at a time.

Domain Name Resolution

Major bottleneck. Standard software (Java, UNIX) allows only one external DNS request at a time. The result is that 87% of elapsed time is waiting for DNS. Cache of domain names not hugely effective. Custom software allows multiple concurrent requests; reduces this to 25%.

Errors in downloading

Server non-existent, server non-responsive (timeout), URL non-existent or not permitted etc.

Robust parser

Standard parsers (e.g. generated by YACC) simply break under the junk that is out there on the Web: "typos in HTML tags, kilobytes of zeros in the middle of a tag, non-ASCII characters, HTML tags nested hundreds deep ..." (Brin and Page).

Repeated URL

Normalize URL (e.g. "http://blah/xxx" = "http://blah/xxx/" = "http://blah:80/xxx/" = "http://blah/xxx/index.html" = "xxx" in document with URL "http://blah" etc.

Canonicalize domain name. Request canonical name, addresses from DNS. Use either canonical name or lowest IP address (for caching, not for actual request.)

Cache of URL's already seen. MERCATOR: Store fingerprint of URL. Full table on disk, large buffer in memory. High degree of locality, buffering very effective.

Actually, save two-part fingerprint; hostname and full URL. Store by hostname/URL. Thus, when you look up a URL on disk, other URL's from same host are nearby, and are loaded into memory. Helps because URL's on same host tend to be cited together.

Repeated documents:

(Mirror sites, aliases, copies, etc.) Cache 64-bit signature of document. If already in cache, do not process again.

Nearly repeated documents

We will discuss in a later lecture.

Immense or infinite pages

Put an upper bound on the size of the page.

Immense or infinite sites

(Crawler traps). No general way to detect automatically. Report unusually large number of web pages at site to human operator, who can check for the trap.

MERCATOR (Compaq: precursor to Scooter,`)

(Compaq: precursor to Scooter, which was the crawler for AltaVista.)

Parallelism via multiple threads. Each thread has its own FIFO URL queue. Each server is assigned to a single thread, determined by canonical host name. Thus, there can be only one ongoing download at a time from any one server.

Size of crawl frontier: 100's of millions of URLs. Each thread buffers 600 URLs in memory; rest of queue on disk.

Timeout after 1 minute.

Content-seen test. Compute 64 bit fingerprint of content. Save in small hash table in memory; large sorted list on disk. New documents get added to hash table; table merged with sorted list when full. (8.5% of documents downloaded are duplicates. Session ID's in URL generate potentially infinitely many "duplicates".)

Checkpointing at regular intervals.

Adaptibility through good, modular, object-oriented programming.

Refresh method

Method 1: Redo complete crawl, starting from list of URL's, rebuild index from scratch.

Method 2: Incremental

For each URL in existing list
  Check existence/recent modification from server.
  If no longer exists, then remove from all indices
  If modified then { 
    remove from all old obsolete indices;
        (e.g. all words in old but not new;
              backlinks from all links in old but not new; etc.)
    add all necessary new indices
    }
Need additionally: Table indexed by URL to all index information for this page.

If document constant but URL changes, then change URL associated with Doc. ID. (Indices actually use fixed length Doc. ID's rather than variable-length URLs)

Refresh strategies

(Arasu et al., 99) Two measures to optimize: For either, it's debatable how you should count pages that no longer exist, and new unindexed pages.

Uniform Strategy: Refresh all pages at same rate.

Proportional Strategy: Record how often a page changes, refresh it proportional to frequency of change. Uniform works better than proportional.

Optimal Strategy: (to maximize percentage of current pages, given standard modelling assumptions) downward parabola (frequency vs. frequency). Refresh most frequently pages with a middle change rate. Pages that change rarely should be refreshed rarely. Pages that change too frequently should not be refreshed at all (not worth bothering indexing them.)

Separate news crawler from known news sources.

Crawling Strategies

Give priority to crawling from "important" pages. Criteria of "importance" same as relevance ranking, will discuss then.