Web Search Engines: Lecture 1

Required Reading

Structure of a Search Engine

Distinguishing Features

Spider

Major full-scale spiders

Spider          Search engines                Size (self-reported)

FAST            Lycos, AllTheWeb              2.1 billion pages  (6/02)
Google          Google, Yahoo, Netscape       3 billion docs (2 billion web pages) 12/01
Inktomi         Hotbot, AOL, MSN              500 million  (3/01)
Scooter         AltaVista                     550 million  (3/01)

Issues:

MERCATOR (Compaq: precursor to Scooter)

Date?

Parallelism via multiple threads. Each thread has its own FIFO URL queue. Each server is assigned to a single thread, determined by canonical host name. Thus, there can be only one ongoing download at a time from any one server.

Size of crawl frontier: 100's of millions of URLs. Each thread buffers 600 URLs in memory; rest of queue on disk.

Timeout after 1 minute.

Content-seen test. Compute 64 bit fingerprint of content. Save in small hash table in memory; large sorted list on disk. New documents get added to hash table; table merged with sorted list when full. (8.5% of documents downloaded are duplicates. Session ID's in URL generate potentially infinitely many "duplicates".)

Domain Name Resolution: Standard software (Java, UNIX) allows only one external DNS request at a time. The result is that 87% of elapsed time is waiting for DNS. Cache of domain names not hugely effective. Custom software allows multiple concurrent requests; reduces this to 25%.

URL-seen test. Store fingerprint of URL. Full table on disk, large buffer in memory. High degree of locality, buffering very effective. See paper.

Actually, save two-part fingerprint; hostname and full URL. Store by hostname/URL. Thus, when you look up a URL on disk, other URL's from same host are nearby, and are loaded into memory. Helps because URL's on same host tend to be cited together.

Checkpointing at regular intervals.

Adaptibility through good, modular, object-oriented programming.

Google (Original research version)

Distributed system. 3 or 4 crawlers, 1 URL server implemented in Python. Crawlers use asynchronous I/O, with about 300 connections open at once.

Refresh method

Method 1: Redo complete crawl, starting from list of URL's, rebuild index from scratch.

Method 2: Incremental

For each URL in existing list
  Check existence/recent modification from server.
  If no longer exists, then remove from all indices
  If modified then { 
    remove from all old obsolete indices;
        (e.g. all words in old but not new;
              backlinks from all links in old but not new; etc.)
    add all necessary new indices
    }
Need additionally: Table indexed by URL to all index information for this page.

If document constant but URL changes, then change URL associated with Doc. ID. (Indices actually use fixed length Doc. ID's rather than variable-length URLs)

Refresh strategies

(Arasu et al., 99)

Uniform Strategy: Refresh all pages at same rate.

Proportional Strategy: Record how often a page changes, refresh it proportional to frequency of change. Uniform works better than proportional.

Optimal Strategy: (to maximize percentage of current pages, given standard modelling assumptions) downward parabola (frequency vs. frequency). Refresh most frequently pages with a middle change rate. Pages that change rarely should be refreshed rarely. Pages that change too frequently should not be refreshed at all (not worth bothering indexing them.)

Separate news crawler from known news sources.

Crawling Strategies

Give priority to crawling from "important" pages. Criteria of "importance" same as relevance ranking, will discuss then.