How Web Search Engines Work

CSplash

April 20, 2013

Ernie Davis

The Structure of a Web Search Engine

A web search engine consists of three, essentially separate programs, and three data pools.

Outline

Other issues that come up in constructing a results page that we are not going to talk about:

Crawler

Suppose you wanted to collect all the web pages you can find. What you might do would be to start at some starting page, click on all the links, get all the linked pages, save them, click on all their links, get more pages, and so on.

That's basically what a crawler does, except that a crawler doesn't run a browser and doesn't exactly click on links. It cuts out the middle man (the browser). What actually is going on when you are using a browser, like Chrome or Explorer or Firefox, and you click on a link is this: The browser is looking at a file with information about the page. The text that you see as in the link, called the anchor, is marked in the file with the web address (URL) of the page that it links to. (You can see what is actually in the file of the page you are looking at by clicking "View" and "Page Source" or something similar, depending on the browser.) The URL consists of the name of the name of the server S, and the name of the page P. When you click on the link, the browser sends a request to the server asking for the page. The server sends back the page, and then the browser displays it.

For instance, here is a link to My home page . What this file actually contains there is

< A href="http://www.cs.nyu.edu/faculty/davise" > My home page < /A >.
(This particular format is for HTML. Other kinds of documents use other formats.)

Getting back to crawlers. The crawler starts with a queue of some URL's it knows about. (On the first crawl, perhaps www.yahoo.com and www.wikipedia.org, on later crawls all the URL's it found the first time. It then carries out the following loop

Loop:
    Take a URL off the queue;
    Request this page from the server that hosts it. (One request at a time to a server.)
    When the page is received
       1. Make sure it is not a duplicate of a page you already know.
       2. Save it in the page cache to be indexed.
       3. Extract all the new URL's in the page, and add them to the queue.
   until the queue is empty and no requests are outstanding.

Example:

At the start the engine knows about page www1/A.
Queue [www1.A]
   Ask server www1 for page A.
Wait,
   Receive www1.A. 
   Extract from A URL's www1/B, www2/E, www2/F.
Queue: [www1/B, www2/E, www2/F]
   Get URL www1/B from queue. Ask server www1 for page B.
Queue: [www2/E, www2/F]
   Get URL www2/E from queue. Ask server www2 for page E.
Queue: [www2/F]
   Get UCL www2/F from queue. Delay request to www2 until previous request for
      E has been satisfied.
   Receive E from www2. 
   Extract from E URL's www2/F (old), www3/G.
Queue: [wwww2/F, www3/G]
   Get URL www2/F from queue. Ask server www2 for F.
Queue: [www3/G] 
   Get URL www3/G from queue. Ask server www3 for G.
Queue: []
   Receive F from server www2.
   Receive B from server www1.
   Extract from F URL's www1/B (old), www/G (old) from F.
Queue: []
   Extract from B URL's www2/C. Ask server www2 for page C.
   Receive G from server www3. 
   Extract URL's www2/F (old), www3/H from G.
Queue: [www3/H]
   Get URL www3/H from queue. Ask server www3 for H.
   Receive C from www3. Extract URL www3/D from C. 
Queue: [www3/D]
   Get URL www3/D from queue. Ask server www3 for D.
   Receive H from server wws3. Extract URL's www3/D (old) and www3/G (old)
   Receive D from server wws3. Extract URL's www3/H (old).
End: Queue is empty, and no outstanding requests.

When someone running a browswer clicks on a link, what they are doing is sending a request to the server that host the page that the link points to that it send them the page. So conceptually what the crawler is doing is clicking on all the links in the start page; getting all the pages it links to; clicking on all their links; and so on.

Industrial-sized crawlers actually have thousands of computers doing this simultaneously. Pretty much the only interactions betwen them that is necessary is that two machines should not ask for the same URL, and they should not make requests of the same web server simultaneously. It is not even necessary that the machine that extracts the URL be the same as the one that made the request.

How do you rank pages

The query engine gets a query from the user and produces a list of pages in ranked order of quality. How does the engine determine the value of a page for a query?

The details for commercial search engines are a very closely guarded secret, like the recipe for Coca-Cola. (Unlike the recipe for Coca-Cola, the answer changes constantly.) But the general outlines are known. It is presumably some combination of eight factors of three categories.

I'm only going to talk about (1), (3), and (4).

Link Analysis -- PageRank

If many pages point to page P, then P is presumably a good page. However, some links are more important than others:

Page Rank Algorithm

[The creators of Google, Lawrence Page and Sergey Brin published the PageRank algorithm in their first paper on Google, when they were still doctoral students. Since then the algorithm has become very famous, and a large research literature has been devoted to studying it and its properties. However the experts I have asked are of the opinion that the ranking algorithm currently used by Google bears only a remote relation to the original PageRank. But here it is, for what it's worth.]

Suppose that each page P has an importance I(P) computed as follows: First, every page has an inherent importance E (a constant) just because it is a web page. Second, if page P has importance I(P) then P contributes an indirect importance F*I(P) that is shared among the pages that P points to. (F is another constant). That is. let O(P) be the number of outlinks from P. Then if there is a link from P to Q, P contributes F*I(P)/O(P) ``units of importance" to Q. (What happends if P has no outlinks, so that $O_{P}=0$? This actually turns out to create trouble for our model; there are a number of tricks to deal with it which we will not discuss here.) We therefore have the following equation: for every page Q, if Q has in-links from P1 ... Pm,

I(Q) = E + F * (I(P1)/O(P1) + ... + I(Pm)/O(Pm))
This looks circular, but it is just a set of linear equations in the quantities I(P). Let N be the total number of pages on the Web (or in our index.) We have N equations (one for each value of Q on the left) in N unknowns (the values I(P) on the right), so that, at least looks promising.

We now make the following observation. Suppose we write down all the above equations for all the different pages Q on the Web. Now we add up all the left hand sides and all right hand sides. On the left we have the sum of I(Q) over all Q on the web; call this sum S. On the right we have N occurrences of E and for every page P, O(P) occurrences of F*I(P) / O_(P). Therefore, over all the equations, we have for every page P a total of F*I(P), and these add up to F*S. (Note the importance here of our assumption that O_(P) > 0.) Therefore, we have S = NE + FS so F = 1 - NE/S. Since the quantities E,F,N,S are all positive, it follows that F < 1, E < S/N.

For example suppose we have pages U,V,W,X,Y,Z linked as shown in the figure

Note that X and Y each have three in-links but two of these are from the "unimportant" pages U and W. Z has two in-links and V has one, but these are from "important" pages. Let E = 0.05 and let F=0.7. We get the following equations (for simplicity, I use the page name rather than I(page name)):

    U = 0.05
    V = 0.05 + 0.7 * Z
    W = 0.05 
    X = 0.05 + 0.7*(U/2+V/2+W/2)
    Y = 0.05 + 0.7*(U/2+v/2+W/2)
    Z = 0.05 + 0.7*(X+Y)
The solution to these is
U = 0.05,
V = 0.256,
W = 0.05,
X = 0.175,
Y = 0.175,
Z = 0.295

Relevance measure

Generally speaking, to compute how relevant page P is to query Q, the query engine matches words in the query to words that are either in P or in links on other pages that point to P.

There are two main issues here:

I have very little information about the details of how any specific search engines solves these two problem, but we can discuss the issues involved.

The score for page P for a particular word W in a query depends on:

The score for page P for query Q involves the scores for the separate words in Q, but also whether Q appears as an exact phrase in P, or whether the words in Q appear close together in P.

Imperfect matches can be of various kinds:

The Indexes

The purpose of the indexes is so that, when a query is received, the search engine can quickly narrow the pages it has to consider to a small number.

Inverted file

For each word W, the documents in which it occurs in the text or in an inlink. For each word W and document D, some of the following:

Lexicon: Information about words

Number of documents containing word. Total number of occurrences of word. Language of word. Possibly related words.

Document index

For each document D: URL; document ID; the list of outlinks from D; the list of inlinks to D; page rank of D; language(s); format; date; frequency of update; etc.

What is a word?

This is a trickier problem than one might suppose. For the purposes of a search engine, a word is anything that a user might want to search on. There are two issues: Some of the issues that arise in English text. (Other languages have other issues, some of them harder.)