Web Search for a Planet: The Google Cluster Architecture by Luiz Andre Barroso et al., IEEE Micro, 2003, pp. 22-28.

The PageRank Citation Ranking: Bringing Order to the Web by Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd.

Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg

Finding Authorities and Hubs from Link Structures on the World Wide Web by Allan Borodin, Gareth Roberts, Jeffrey Rosenthal, and Panayiotis Tsaparas

Multiple clusters distributed worldwide.

A few thousand machines per cluster.

Copied from Barroso et al. 2003.

Steps in query answering:

- DNS picks out a cluster geographically close to user.
- A Google Web server machine (GWS) at the cluster is chosen based on load-balancing.
- Query is sent to index servers. Each index holds inverted index for a random subset of documents.
- Index server returns list of docid's, with relevance scores, sorted in decreasing order.
- GWS merges lists, gets overall list sorted by query.
- Full text of actual documents are divided among file servers. GWS sends each file server the query plus the list of docid's on that server.
- File server returns document summary for that query (excerpt containing query word with query words highlighted)
- GWS assembles summaries, advertisements from ad server, spelling correction suggestions, returns to user.

A rack: 40 to 80 custom-made servers. CPU range from 533 MHz Celeron to 1.4 GHz Pentium III. Each server has one or more 80GByte disk drive. Server last two or three years; then is outdated by faster machines. A comparable array of standard machines costs $7,700 per month.

Cooling: Power density = 400 Watts / square foot.

Commercial data centers generally can handle 70-150 Watts per square foot.

However, some links are more important than others:

- 1. Links from a different domain count more than links from same domain.
- 2. Nature of anchor (e.g. font, boldface, position?) may indicate judgment of relative importance.
- 3. A link from an important page is more important than a link from an unimportant page. This is circular, but the circularity can be resolved:

(What happends if P has no outlinks, so that O_{P} = 0?
This actually turns out to
create trouble for our model. For the time being, we will assume that every
page has at least one outlink, and we will return to this problem below.)

We therefore have the following equation: for every page Q, if Q has in-links from P1 ... Pm,

(1) I(Q) = E + F * (I(P1)/OThis looks circular, but it is just a set of linear equations in the quantities I(P). Let N be the total number of pages on the Web (or in our index.) We have N equations (one for each value of Q on the left) in N unknowns (the values I(P) on the right), so that, at least looks promising._{P1}) + ... + I(Pm)/O_{Pm})

We now make the following observation.
Suppose we write down all the above
equations for all the different pages Q on the Web. Now we add up all the left hand sides
and all right hand sides. On the left we have the sum of I(Q) over all Q on the web;
call this sum S.
On the right we have N occurrences of E and for every page P, O_{P}
occurrences of F*I(P) / O_{P}. Therefore, over all the equations,
we have for every page P a total of F*I(P), and these add up to F*S.
(Note the importance here of our assumption that O_{P} > 0).
Therefore, we have

(2) S = NE + FS so F = 1 - NE/S.Since the quantities E,F,N,S are all positive, it follows that F < 1, E < S/N.

For example suppose we have pages U,V,W,X,Y,Z with the following links:

Note that X and Y each have three in-links but two of these are from the "unimportant" pages U and W. Z has two in-links and V has one, but these are from "important" pages. Let E = 0.05 and let F=0.7. We get the following equations (for simplicity, I use the page name rather than I(page name)):

U = 0.05 V = 0.05 + 0.7 * Z W = 0.05 X = 0.05 + 0.7*(U/2+V/2+W/2) Y = 0.05 + 0.7*(U/2+v/2+W/2) Z = 0.05 + 0.7*(X+Y)The solution to these is

U = 0.05 V = 0.256 W = 0.05 X = 0.175 Y = 0.175 Z = 0.295

1. Flip a weighted coin, that comes up heads with probability e and tails with probability (1-e). 2. If coin was heads, he picks a page at random in the Web and goes there. 3. If coin was tails, he picks an outlink from P at random and follows it. (Again, we assume that every page has at least one outlink.)

The browser does this for eons (posit that the web stays constant all that time) and we keep track of where he has been. At the end of this we state that the importance of each page is the fraction of the time that he has spent on that page while browsing. It is easy to show that "importance", thus defined, satisfies the equations above, where E=e/N and F=(1-e). Or, equivalently, we can ask "What is the probability that after an eon of browsing, the browser is on page P," and that probability is equal to the "importance".

We will prove half of the probabilistic formulation; namely, we will
prove that * if * the probability distribution converges, then
it must converge to the "importance" as defined in equation (1). We omit
the proof that it does converge; see any text on Markov processes.

Markov process: States and transitions. The event that at time T+1 you will be in state S depends only on state at time T, not on anything else. (States are Web pages.)

AT(P,T) = event that you are in state P at time T.

A[Q,P] = Prob(AT(Q,I+1) | AT(P,I)). NxN matrix.
L_{T}(P) = Prob(AT(P,T)). Vector of length N.

E.g. in the example above, A is the 6x6 matrix (Q is row, P is column).

0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.75 0.05 0.05 0.05 0.05 0.05 0.05 0.40 0.40 0.40 0.05 0.05 0.05 0.40 0.40 0.40 0.05 0.05 0.05 0.05 0.05 0.05 0.75 0.75 0.05Note that the sum of every column is 1.0, because from each state P you go to exactly 1 state Q.

Now,

(3) L_{T+1}(Q) = Prob(AT(Q,T+1)) =
sum_{P} Prob(AT(Q,T+1) | AT(P,T)) Prob(AT(P,T)) =
sum_{P} A(Q,P) L_{T}(P).

In matrix notation we have

(4) L_{T+1} = A * L_{T}

If we have attained a steady state, then L_{T+1} = L_{T},
So we have

(5) L = A L.

or equivalently

(6) 0 = (A-I)L

where I is the unit matrix (1's on the diagonals, 0's elsewhere). In the
terminology of linear algebra, we say that L is an eigenvector of A
with eigenvalue 1.0.

Now, note that A = (e/N)U + (1-e)B where U is the matrix of all 1's and B is the matrix

B[Q,P] = 1/O(e.g. in our example e=0.3; so e/N = 0.05. B=_{P}if P links to Q 0 if P does not link to Q

0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0 0 0 0 0 0 1.0 1.0 0

Also if L is any probability distribution, then the sum of the elements
of L is 1, so (e/N)UL = (e/N)U' where U' is the unit vector (all 1's)
So equation (5) L = AL is equivalent to

(7) L = [(e/N)U + (1-e)B]L = (e/N)U' + (1-e)BL.

But this is just the same thing as equation (1) written in matrix notation.
So L, the steady state probability distribution, is the same thing as
I, the importance.

Another way of visualizing this is using a "hydraulic'' model. Think of each of the web pages as "reservoirs" of "importance" and of there being a fixed rate of flow from each page to every other page. Then, the page rank is the steady state, in which each page has a constant level of importance. (This model makes it easy to see why it is important that every page has an outlink; if there is not enough outflow from a given page, then depending on how you set up the equations, either "importance" simply accumulates at dead end pages, ultimately leaving all the importance there, or "importance" evaporates out of the dead end pages, gradually sucking the entire system dry. In either case, there is no useful steady state.

Theorem: (I will not prove) If 0 < e < = 1, equation (1) has a unique solution.

Intuitive understanding of relative importance in our examples: In the steady state, if there are no jumps, then the browser goes through a cycle V, [X or Y], Z, V, [X or Y], Z ... , so it goes through V and Z about twice as often as through X and Y. (If we had chosen a smaller value of e, this would be more exact.)

Algorithm 1: for all pages P do I(P) := 1/N; /* start with a uniform distribution */ repeat { for all pages P do { J(P) := E; for all pages Q such that Q links to P do J(P) += I(Q)/OConvergence can be accellerated by extrapolating the direction of change:_{Q}; } if the difference between J and I is small then return(J); I := J; }

Algorithm 2: for all pages P do I(P) = 1/N; /* start with a uniform distribution */ repeat { for all pages P do { J(P) = 0; for all pages Q such that Q links to P do J(P) += I(Q)/OPage et al. report that for N=322 million, this converges to reasonable tolerance in 52 iterations._{Q}; } D = sum of absolute values of elements of J - sum of absolute values of elements of I; for all pages P do J(P) += D*E; if the difference between J and I is small then return(J); I := J; }

For e close to 1, the distribution is nearly uniform. The order of Page Rank is just number of in-links.

At e=0, things go kerflooey. There still exists a stable distribution, but there may be more than one, and the system may not converge to it. The above algorithms may not converge.

At e close to 0, the system is unstable (i.e. small changes to structure make large change to solution.) The above algorithms converge only slowly.

Experiments of Page et al. used e=0.15.

Another solution in principle would be to say that, if there are no outlinks, you jump randomly (i.e. with 1/N probability to each known page.) Algorithm 1 can be adapted to this, at the cost of a little more complexity. I don't know about algorithm 2.

Then the importance of P gets an increment of k*e*(1-e)*N, (actually a little more) in addition to whatever it gets from outside. In general, quite aside from malicious spamming, this measure favors web sites that consist of many interacting pages rather than one big page.

Solution: Put a maximum on the number of links allows to P from the same site (or on their total weight).

** Countertrick: ** Buy two domains and put P1 ... Pk in one domain and
P in the other.

** Anti-countertrick: ** For any domain D put a maximum on the number
(or total weight) of links from D to page P.

One has to presume that it will not pay for the spammer to get *lots*
of domains. There's not much that could be done about that.

Observation:
The Web is not actually important pages pointing to other important pages.
Rather, the good pages in any domain tend to be either * authorities
* with good content, or *hubs * with good links. A hub has a lot
of out-links to authorities; an authority has a lot of in-links from hubs.
Authorities tend not to point to authorities, either because they are
in competition, or because that's not what they're concerned with.
We can perhaps improve importance evaluation by taking this structure
into account.

Also, we can try to use this to solve the problem of "non-self-descriptive" important Web pages. For example, one would want the query "Japanese auto manufacturer" to return the home pages for the major Japanese auto manufacturer, but in fact, only Isuzu is in the top 30 returned by Google (9/19/02) because its home page happens to contain those words. However, if we can find good hubs with those words, they probably tend to point to these home pages

Algorithm:

(This is actually the rational reconstruction of the model due to
Ng et al., which they show to be more stable than Kleinberg's original
proposal. There are bunches of alternative formulations in the literature.)

C := top K pages for query, as returned by search engine (K is a fixed parameter.) O := follow all outlinks from C. I := follow at most D inlinks from C. (D is another fixed parameter.) /* There is an inherent limitations on the number of outlinks from any page, but there are pages with vast numbers of inlinks. */ PAGES := C union O union I; M := the following markov model on PAGES: { TURN := TRUE; P := arbitrary starting point. repeat { with probability e, { P := random choice in PAGES; TURN = randomly either TRUE or FALSE; } with probability (1-e) { if TURN then { choose a random outlink from P to Q; P := Q; mark this as an "authority" visit to P; } else { choose a random inlink into P from Q; P := Q; mark this as a "hub" visit to P; } TURN := not TURN; } Find the stable distribution for this Markov model as in previous section.Thus, in this Markov model, when you are not jumping randomly, you are alternating moving across an outlink (presumably from a hub to an authority) and moving across an inlink (presumably from an authority to a hub.) You thus have two vectors L(P), the probability that a random hub visit is P and M(P), the probability that a random authority visit is P. Let A be the transition matrix when TURN = TRUE and let B be the transition matrix when TURN = FALSE; it is easily shown that B = transpose of A. Then

M = AL and L=BM so M = (AB)M

and you solve this equation in the same way as the equations above.

Seems to work in many cases. Example from Kleinberg (1999):

Query: "search engines"

Top from HITS: Home pages for Yahoo, Excite, Magellan, Lycos, AltaVista.

Google (9/02) has only one of the major search engine home pages (Google
itself) in the top 30.

Problem: Topic drift. Algorithm drifts toward tightly connected regions of Web, which may not be exactly what was asked about. Example: Search on "jaguar" converged to a collection of sites about Cincinatti. Explanation: Large number of articles in the Cincinatti Enquirer about the Jacksonville Jaguars football team, all link to same Cincinatti Enquirer service pages. Various partial solution studied, none wholly successful. (See Borodin et al. for discussion, examples)

- Clustering: Search results can be divided into groups based on link relations. (Will discuss later.)
- Spider optimization: You can improve the quality of a (partial)
crawl by keeping a running estimate of PageRank and using that
to prioritize the URL queue. (See
Efficient Crawling through URL ordering by Junghoo Cho, Hector
Garcia-Molina, and Lawrence Page.)

- Web mining
- Web "anthropology"