How Web Search Engines Work
CSplash
April 20, 2013
Ernie Davis
The Structure of a Web Search Engine
A web search engine consists of three, essentially separate programs,
and three data pools.
- The Crawler collects all the pages it can find on the web, and
puts them into a web page cache.
- The Indexer uses the pages in the cache to create an
index of the web
- The Query Engine takes a user query, consults with the
index, and the web page cache, and a log of previous
queries and produces a results page.
Outline
- 1. How does the crawler find web pages?
- 2. How does the query engine rank pages?
Other issues that come up in constructing a results page that we are not going
to talk about:
- What does the index look like?
- Implementational issues: How do you do all this in a second or so per
query?
- How does the search engine extract snippets?
- How does the search engine categorize "Similar pages" and so on?
- How does the search engine do spelling correction?
- How does the search engine estimate the number of pages? (Answer:
Google does very
badly, in fact not even consistently. "Number of pages returned by Google"
should never be taken as a serious measure of anything. Other search engines
such as Bing seem to be somewhat better.)
- How does the search engine choose the "sponsored links" (i.e. paid
advertisements)? Google carries out an auction with every query among sponsors
who have bid on that query.
Crawler
Suppose you wanted to collect all the web pages you can find. What you might
do would be to start at some starting page, click on all the links, get
all the linked pages, save them, click on all their links, get more
pages, and so on.
That's basically what a crawler does, except that a crawler doesn't run a
browser and doesn't exactly click on links. It cuts out the middle man
(the browser). What actually is going on when you are using a browser, like
Chrome or Explorer or Firefox, and you click on a link is this: The browser
is looking at a file with information about the page. The text that you
see as in the link, called the anchor, is marked in the file with the web
address (URL) of the page that it links to. (You can see what is actually
in the file of the page you are looking at by clicking "View" and "Page
Source" or something similar, depending on the browser.)
The URL consists of the name
of the name of the server S, and the name of the page P. When you click on the
link, the browser sends a request to the server asking for the page. The
server sends back the page, and then the browser displays it.
For instance, here is a link to
My home page .
What this file actually contains there is
< A href="http://www.cs.nyu.edu/faculty/davise" > My home page < /A >.
(This particular format is for HTML. Other kinds of documents use other
formats.)
Getting back to crawlers.
The crawler starts with a queue of some URL's it knows about. (On the first
crawl, perhaps www.yahoo.com and www.wikipedia.org, on later crawls all the
URL's it found the first time. It then carries out the following loop
Loop:
Take a URL off the queue;
Request this page from the server that hosts it. (One request at a time to a server.)
When the page is received
1. Make sure it is not a duplicate of a page you already know.
2. Save it in the page cache to be indexed.
3. Extract all the new URL's in the page, and add them to the queue.
until the queue is empty and no requests are outstanding.
Example:
At the start the engine knows about page www1/A.
Queue [www1.A]
Ask server www1 for page A.
Wait,
Receive www1.A.
Extract from A URL's www1/B, www2/E, www2/F.
Queue: [www1/B, www2/E, www2/F]
Get URL www1/B from queue. Ask server www1 for page B.
Queue: [www2/E, www2/F]
Get URL www2/E from queue. Ask server www2 for page E.
Queue: [www2/F]
Get UCL www2/F from queue. Delay request to www2 until previous request for
E has been satisfied.
Receive E from www2.
Extract from E URL's www2/F (old), www3/G.
Queue: [wwww2/F, www3/G]
Get URL www2/F from queue. Ask server www2 for F.
Queue: [www3/G]
Get URL www3/G from queue. Ask server www3 for G.
Queue: []
Receive F from server www2.
Receive B from server www1.
Extract from F URL's www1/B (old), www/G (old) from F.
Queue: []
Extract from B URL's www2/C. Ask server www2 for page C.
Receive G from server www3.
Extract URL's www2/F (old), www3/H from G.
Queue: [www3/H]
Get URL www3/H from queue. Ask server www3 for H.
Receive C from www3. Extract URL www3/D from C.
Queue: [www3/D]
Get URL www3/D from queue. Ask server www3 for D.
Receive H from server wws3. Extract URL's www3/D (old) and www3/G (old)
Receive D from server wws3. Extract URL's www3/H (old).
End: Queue is empty, and no outstanding requests.
When someone running a browswer clicks on a link, what they are doing
is sending a request to the server that host the page that the link points to
that it send them the page. So conceptually what the crawler is doing is
clicking on all the links in the start page; getting all the pages it links
to; clicking on all their links; and so on.
Industrial-sized crawlers actually have
thousands of computers doing this simultaneously. Pretty much
the only interactions betwen them that is necessary is that
two machines should not ask for the same URL, and they should not
make requests of the same web server simultaneously. It is not even
necessary that the
machine that extracts the URL be the same as the one that made the request.
How do you rank pages
The query engine gets a query from the user and produces a list of pages
in ranked order of quality. How does the engine determine the value of
a page for a query?
The details for commercial search engines are a very closely guarded secret,
like the recipe for Coca-Cola. (Unlike the recipe for Coca-Cola, the answer
changes constantly.) But the general outlines are known. It is
presumably some combination of eight factors of three categories.
- Query independent measure of the importance of the page.
- 1. Based on the links to the page (PageRank).
- 2. (Possibly) Based on other considerations; e.g. Wikipedia or .edu
pages are considered high quality.
- Query specific measure of how relevant the page is to the query.
- 3. Based on the content of the page.
- 4. Based on links to the page.
- 5. Based on query logs. Most queries that the search engine receives have
been received many times before. The query log records which pages users chose
to click on, and to some extent even how long they spent looking at it (if they
came back soon to the results page).
- 6. Personalized search. The search engine uses its information about you,
the user --- the queries you've made, the links you've clicked, perhaps other
information? --- to select results that it thinks will particularly interest
you.
- 7. Specific answers to known queries. E.g. The first result returned for
"convert grams to ounces" is the answer.
- 8. Topic dispersion: If a search term has many meanings, then the important
ones should be represented among the top results, regardless of the inherant
importance of the page. E.g. among the top search for "CMU" are pages for
"Central Methodist University" and "Central Michigan University", even though
almost certainly all of the top ranked pages are presumably associated with
Carnegie Mellon University.
I'm only going to talk about (1), (3), and (4).
Link Analysis -- PageRank
If many pages point to page P, then P is presumably a good page.
However, some links are more important than others:
- Links from a different domain count more than links from same domain.
- Nature of anchor (e.g. font, boldface, position?) may indicate
judgment of relative importance.
- A link from an important page is more important than a link from an
unimportant page. This is circular, but the circularity can be resolved:
Page Rank Algorithm
[The creators of Google, Lawrence Page and Sergey Brin
published the PageRank algorithm in their first paper on Google,
when they were still doctoral students. Since then the algorithm has become
very famous, and a large research literature
has been devoted to studying it and its properties. However the experts I have
asked are of the opinion that the ranking algorithm currently used by Google
bears only a remote relation to the original PageRank. But here it is, for
what it's worth.]
Suppose that each page P has an importance I(P) computed as follows: First,
every page has an inherent importance E (a constant) just because it is a web page.
Second, if page P has importance I(P) then P contributes an indirect
importance F*I(P) that is shared among the pages that P points to. (F is another
constant). That is. let O(P) be the number of outlinks from P. Then
if there is a link from P to Q, P contributes F*I(P)/O(P) ``units
of importance" to Q.
(What happends if P has no outlinks, so that $O_{P}=0$?
This actually turns out to
create trouble for our model; there are a number of tricks to deal with it
which we will not discuss here.)
We therefore have the following equation: for every page Q, if Q has in-links
from P1 ... Pm,
I(Q) = E + F * (I(P1)/O(P1) + ... + I(Pm)/O(Pm))
This looks circular, but it is just a set of linear equations in the quantities
I(P). Let N be the total number of pages on the Web (or in our index.)
We have N equations (one for each value of Q on the left) in N unknowns
(the values I(P) on the right), so that, at least looks promising.
We now make the following observation.
Suppose we write down all the above
equations for all the different pages Q on the Web. Now we add up all the left hand sides
and all right hand sides. On the left we have the sum of I(Q) over all Q on the web;
call this sum S.
On the right we have N occurrences of E and for every page P, O(P)
occurrences of F*I(P) / O_(P). Therefore, over all the equations,
we have for every page P a total of F*I(P), and these add up to F*S.
(Note the importance here of our assumption that O_(P) > 0.)
Therefore, we have
S = NE + FS so F = 1 - NE/S.
Since the quantities E,F,N,S are all positive, it follows that
F < 1, E < S/N.
For example suppose we have pages U,V,W,X,Y,Z linked as shown in the figure
Note that X and Y each have three in-links but two of these are from the
"unimportant" pages U and W. Z has two in-links and V has one, but these
are from "important" pages. Let E = 0.05 and let F=0.7.
We get the following equations (for simplicity,
I use the page name rather than I(page name)):
U = 0.05
V = 0.05 + 0.7 * Z
W = 0.05
X = 0.05 + 0.7*(U/2+V/2+W/2)
Y = 0.05 + 0.7*(U/2+v/2+W/2)
Z = 0.05 + 0.7*(X+Y)
The solution to these is
U = 0.05,
V = 0.256,
W = 0.05,
X = 0.175,
Y = 0.175,
Z = 0.295
Relevance measure
Generally speaking,
to compute how relevant page P is to query Q, the query engine matches words
in the query to words that are either in P or in links on other pages
that point to P.
There are two main issues here:
- How is the score computed?
- What is considered a match?
I have very little information about the details of how any specific search
engines solves these two problem, but we can discuss the issues involved.
The score for page P for a particular word W in a query depends on:
- The number of occurrences of the W in P and the length of P. Does W
appear once in a page of 1000 words, ten times in a page of 1000 words,
ten times in a page of 10,000 words?
- The rarity of W. If the query is "penguin takes a bath", "penguin" is
important, "bath" is less important, "takes" and "a" are unimportant or
completely ignored.
- The way W is used in P --- e.g. in title, in large font, bold face,
etc.
- If it's an imperfect match, then the quality of the match.
The score for page P for query Q involves the scores for the
separate words in Q, but also whether Q appears as an exact phrase in P, or
whether the words in Q appear close together in P.
Imperfect matches can be of various kinds:
- Misspellings.
- Alternative spellings e.g. "colour" for "color"
- Compound words e.g. "house boat" => "houseboat"
- Words with a common root. E.g. "decompression", "compression",
"compressed" etc. But you don't want to take this too far; it is probably
not helpful to match against "press" though, linguistically, that is actually
the root.
- Closely related words. E.g. if you do a Google search on "Holland
population"
(3/17/11) many of the pages returns contain the word "Netherlands" but not
"Holland" and the snippets have "Netherlands" in boldface, indicating a match.
The Indexes
The purpose of the indexes is so that, when a query is received, the search
engine can quickly narrow the pages it has to consider to a small number.
Inverted file
For each word W, the documents in which it occurs in the text or in an
inlink. For each word W and document D, some of the following:
-
The relevance score of D to W.
-
The number of occurrences of W in D
-
The list of occurrences of W in D. For each occurrence, the position and
the context (e.g. boldface, font size, title, meta-tag etc.) Position
can be word count or byte count or both.
Lexicon: Information about words
Number of documents containing word. Total number of occurrences of word.
Language of word. Possibly related words.
Document index
For each document D: URL; document ID; the list of outlinks from D;
the list of inlinks to D; page rank of D;
language(s); format; date; frequency of update; etc.
What is a word?
This is a trickier problem than one might suppose. For the purposes of a
search engine, a word is anything that a user might want to search on.
There are two issues:
- Where does a word start and end?
- When are two instances of a word the same word?
Some of the issues that arise in English text. (Other languages have other
issues, some of them harder.)
-
Upper vs. lower case. Google regularizes to lower case.
- Multiple character codes
- Internal punctuation:
- Hyphens. Generally word is split to two.
- Apostrophes, for possessives, contractions, other.
- Diacritical marks.
- Abbreviations. U.S.A. USA.
- Numbers: Integers, Floating point, Dates, Monetary
- Compounds with letters, digits, punctuation. E.g. "Java1.1"
- Junk to be excluded. For example, there are lots of text files that for
whatever
reason contain very long strings of alphabetical characters with no
punctuation. Presumably the search engine puts some upper bound on strings
that it's willing to consider a word.