Required reading

Chakrabarti, secs. 7.1 and 7.2
Web Search for a Planet: The Google Cluster Architecture by Luiz Andre Barroso et al., IEEE Micro, 2003, pp. 22-28.
The PageRank Citation Ranking: Bringing Order to the Web by Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd.

Further reading

Stable Algorithms for Link Analysis by Andrew Ng, Alice Zheng, Michael Jordan Postscript -- PDF
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg
Finding Authorities and Hubs from Link Structures on the World Wide Web by Allan Borodin, Gareth Roberts, Jeffrey Rosenthal, and Panayiotis Tsaparas

Lecture 3

Jury Duty

I will be serving jury duty starting Monday, October 4. I am hopeful that this will not interfere, or will interfere at most slightly, with this class, but there is no way to be sure. If I will be obliged to be late or to miss a class, I will let you know as soon as I can.

Abby Anecdote

Hardware issues in Google Query Servers

Reliability from replication.
Multiple clusters distributed worldwide.
A few thousand machines per cluster.

Copied from Barroso et al. 2003.

Steps in query answering:

"On average, a single query in Google reads hundreds of megabytes of data and consumes tens of billions of CPU cycles. Supporting a peak request stream of thousands of queries per second requires an infrastructure comparable in size to that of the largest supercomputer installations. Combining more than 15,000 commodity-class PC's with fault-tolerant software creates a solution that is more cost-effective than a comparable system build out of a smaller number of high-end servers."

A rack: 40 to 80 custom-made servers. CPU range from 533 MHz Celeron to 1.4 GHz Pentium III. Each server has one or more 80GByte disk drive. Server last two or three years; then is outdated by faster machines. A comparable array of standard machines costs $7,700 per month.

Cooling: Power density = 400 Watts / square foot.
Commercial data centers generally can handle 70-150 Watts per square foot.

Link Analysis for Ranking

In-link count

If many pages point to page P, then P is presumably a good page.

However, some links are more important than others:

Page Rank Algorithm

Suppose that each page P has an importance I(P) computed as follows: First, every page has an inherent importance E (a constant) just because it is a web page. Second, if page P has importance I(P) then P contributes an indirect importance F*I(P) that is shared among the pages that P points to. (F is another constant). That is. let OP be the number of outlinks from P. Then if there is a link from P to Q, P constributes F*I(P)/OP "units of importance" to Q.

(What happends if P has no outlinks, so that OP = 0? This actually turns out to create trouble for our model. For the time being, we will assume that every page has at least one outlink, and we will return to this problem below.)

We therefore have the following equation: for every page Q, if Q has in-links from P1 ... Pm,

(1)     I(Q) = E + F * (I(P1)/OP1) + ... + I(Pm)/OPm)
This looks circular, but it is just a set of linear equations in the quantities I(P). Let N be the total number of pages on the Web (or in our index.) We have N equations (one for each value of Q on the left) in N unknowns (the values I(P) on the right), so that, at least looks promising.

We now make the following observation. Suppose we write down all the above equations for all the different pages Q on the Web. Now we add up all the left hand sides and all right hand sides. On the left we have the sum of I(Q) over all Q on the web; call this sum S. On the right we have N occurrences of E and for every page P, OP occurrences of F*I(P) / OP. Therefore, over all the equations, we have for every page P a total of F*I(P), and these add up to F*S. (Note the importance here of our assumption that OP > 0). Therefore, we have

(2) S = NE + FS    so F = 1 - NE/S.
Since the quantities E,F,N,S are all positive, it follows that F < 1, E < S/N.

For example suppose we have pages U,V,W,X,Y,Z with the following links:

Note that X and Y each have three in-links but two of these are from the "unimportant" pages U and W. Z has two in-links and V has one, but these are from "important" pages. Let E = 0.05 and let F=0.7. We get the following equations (for simplicity, I use the page name rather than I(page name)):

    U = 0.05
    V = 0.05 + 0.7 * Z
    W = 0.05 
    X = 0.05 + 0.7*(U/2+V/2+W/2)
    Y = 0.05 + 0.7*(U/2+v/2+W/2)
    Z = 0.05 + 0.7*(X+Y)
The solution to these is
   U = 0.05
   V = 0.256
   W = 0.05
   X = 0.175
   Y = 0.175
   Z = 0.295

Markov model

A useful way to think about the above model is as follows: Imagine someone who is browsing the web for a very long time. Each time he reads a page P, he decides where to go next using the following procedure:
1. Flip a weighted coin, that comes up heads with probability e and tails
with probability (1-e).

2. If coin was heads, he picks a page at random in the Web and goes there.

3. If coin was tails, he picks an outlink from P at random and follows 
it.  (Again, we assume that every page has at least one outlink.)

The browser does this for eons (posit that the web stays constant all that time) and we keep track of where he has been. At the end of this we state that the importance of each page is the fraction of the time that he has spent on that page while browsing. It is easy to show that "importance", thus defined, satisfies the equations above, where E=e/N and F=(1-e). Or, equivalently, we can ask "What is the probability that after an eon of browsing, the browser is on page P," and that probability is equal to the "importance".

Proof

We will prove half of the probabilistic formulation; namely, we will prove that if the probability distribution converges, then it must converge to the "importance" as defined in equation (1). We omit the proof that it does converge; see any text on Markov processes.

Markov process: States and transitions. The event that at time T+1 you will be in state S depends only on state at time T, not on anything else. (States are Web pages.)

AT(P,T) = event that you are in state P at time T.
A[Q,P] = Prob(AT(Q,I+1) | AT(P,I)). NxN matrix. LT(P) = Prob(AT(P,T)). Vector of length N.

E.g. in the example above, A is the 6x6 matrix (Q is row, P is column).

          0.05    0.05   0.05   0.05   0.05   0.05
          0.05    0.05   0.05   0.05   0.05   0.75
          0.05    0.05   0.05   0.05   0.05   0.05
          0.40    0.40   0.40   0.05   0.05   0.05
          0.40    0.40   0.40   0.05   0.05   0.05
          0.05    0.05   0.05   0.75   0.75   0.05
Note that the sum of every column is 1.0, because from each state P you go to exactly 1 state Q.

Now,
(3) LT+1(Q) = Prob(AT(Q,T+1)) = sumP Prob(AT(Q,T+1) | AT(P,T)) Prob(AT(P,T)) = sumP A(Q,P) LT(P).

In matrix notation we have
(4) LT+1 = A * LT

If we have attained a steady state, then LT+1 = LT, So we have
(5) L = A L.
or equivalently
(6) 0 = (A-I)L
where I is the unit matrix (1's on the diagonals, 0's elsewhere). In the terminology of linear algebra, we say that L is an eigenvector of A with eigenvalue 1.0.

Now, note that A = (e/N)U + (1-e)B where U is the matrix of all 1's and B is the matrix

B[Q,P] = 1/OP if P links to Q
            0            if P does not link to Q
(e.g. in our example e=0.3; so e/N = 0.05. B=
    0     0     0     0     0     0
    0     0     0     0     0     1
    0     0     0     0     0     0
    0.5   0.5   0.5   0     0     0
    0.5   0.5   0.5   0     0     0
    0     0     0     1.0   1.0   0

Also if L is any probability distribution, then the sum of the elements of L is 1, so (e/N)UL = (e/N)U' where U' is the unit vector (all 1's) So equation (5) L = AL is equivalent to
(7) L = [(e/N)U + (1-e)B]L = (e/N)U' + (1-e)BL.
But this is just the same thing as equation (1) written in matrix notation. So L, the steady state probability distribution, is the same thing as I, the importance.

Another way of visualizing this is using a "hydraulic'' model. Think of each of the web pages as "reservoirs" of "importance" and of there being a fixed rate of flow from each page to every other page. Then, the page rank is the steady state, in which each page has a constant level of importance. (This model makes it easy to see why it is important that every page has an outlink; if there is not enough outflow from a given page, then depending on how you set up the equations, either "importance" simply accumulates at dead end pages, ultimately leaving all the importance there, or "importance" evaporates out of the dead end pages, gradually sucking the entire system dry. In either case, there is no useful steady state.

Theorem: (I will not prove) If 0 < e < = 1, equation (1) has a unique solution.

Intuitive understanding of relative importance in our examples: In the steady state, if there are no jumps, then the browser goes through a cycle V, [X or Y], Z, V, [X or Y], Z ... , so it goes through V and Z about twice as often as through X and Y. (If we had chosen a smaller value of e, this would be more exact.)

Computing importance

Equation (1) can be used iteratively to converge on the steady state for the importance I(P):
Algorithm 1:
for all pages P do I(P) := 1/N; /* start with a uniform distribution */
repeat {
   for all pages P do {
      J(P) := E;
      for all pages Q such that Q links to P do 
        J(P) += I(Q)/OQ;
     }
   if the difference between J and I is small then return(J);
   I := J;
}
Convergence can be accellerated by extrapolating the direction of change:
Algorithm 2:
for all pages P do I(P) = 1/N; /* start with a uniform distribution */
repeat {
   for all pages P do {
      J(P) = 0;
      for all pages Q such that Q links to P do 
        J(P) += I(Q)/OQ;
     }
   D = sum of absolute values of elements of J - sum of absolute values of
         elements of I;
   for all pages P do J(P) += D*E;
   if the difference between J and I is small then return(J);
   I := J;
}
Page et al. report that for N=322 million, this converges to reasonable tolerance in 52 iterations.

Dependence on e (the probability of "tails")

For e=1, just a uniform distribution (uniformly random transition at each step).

For e close to 1, the distribution is nearly uniform. The order of Page Rank is just number of in-links.

At e=0, things go kerflooey. There still exists a stable distribution, but there may be more than one, and the system may not converge to it. The above algorithms may not converge.

At e close to 0, the system is unstable (i.e. small changes to structure make large change to solution.) The above algorithms converge only slowly.

Experiments of Page et al. used e=0.15.

Pages with no outlinks

The simplest solution is to say that every page has a self-loop (an outlink to itself.) Page et al. use a more complicated system, in which you (1) first prune all pages with no outlinks, then prune all pages which had only onlinks to the pages you just pruned, and keep on doing this until all pages have an outlink; (2) compute importance over this set; (3) reinstate the pages you pruned and let importance propagate forward without changing the values of the importances calculated in (2).

Another solution in principle would be to say that, if there are no outlinks, you jump randomly (i.e. with 1/N probability to each known page.) Algorithm 1 can be adapted to this, at the cost of a little more complexity. I don't know about algorithm 2.

Non-uniform "source of importance"

It is not necessary for the "inherent" importance of all pages P to be the same value E; one can have any distribution E(P), representing some other evaluation of inherant importance. For example, one could have E(P) ranked higher for pages on a .edu site, or for the Yahoo home page, or for your own page, etc. The algorithms above work exactly the same; just change E to E(P).

Non-uniform outlinks

Or there may be a reason to think that some outlinks are better than others (e.g. font, font size, links to a different domain are more important than links within a domain.) You can assign W(Q,P) be the weight of the link from Q to P however you want; the only constraints are that the weights are non-negative and that the weights of the outlinks out of Q add up to 1. Then just replace 1/OQ in the algorithm by W(Q,P).

Tricking the Page Rank algorithm

Suppose you organize your web site as follows:

Then the importance of P gets an increment of k*e*(1-e)*N, (actually a little more) in addition to whatever it gets from outside. In general, quite aside from malicious spamming, this measure favors web sites that consist of many interacting pages rather than one big page.

Solution: Put a maximum on the number of links allows to P from the same site (or on their total weight).

Countertrick: Buy two domains and put P1 ... Pk in one domain and P in the other.

Anti-countertrick: For any domain D put a maximum on the number (or total weight) of links from D to page P.

One has to presume that it will not pay for the spammer to get lots of domains. There's not much that could be done about that.

HITS model (Kleinberg)

Observation: The Web is not actually important pages pointing to other important pages. Rather, the good pages in any domain tend to be either authorities with good content, or hubs with good links. A hub has a lot of out-links to authorities; an authority has a lot of in-links from hubs. Authorities tend not to point to authorities, either because they are in competition, or because that's not what they're concerned with. We can perhaps improve importance evaluation by taking this structure into account.

Also, we can try to use this to solve the problem of "non-self-descriptive" important Web pages. For example, one would want the query "Japanese auto manufacturer" to return the home pages for the major Japanese auto manufacturer, but in fact, only Isuzu is in the top 30 returned by Google (9/19/02) because its home page happens to contain those words. However, if we can find good hubs with those words, they probably tend to point to these home pages

Algorithm:
(This is actually the rational reconstruction of the model due to Ng et al., which they show to be more stable than Kleinberg's original proposal. There are bunches of alternative formulations in the literature.)

C := top K pages for query, as returned by search engine 
        (K is a fixed parameter.)
O := follow all outlinks from C.
I := follow at most D inlinks from C. (D is another fixed parameter.)
     /* There is an inherent limitations on the number of outlinks from
         any page, but there are pages with vast numbers of inlinks. */
PAGES := C union O union I;
M := the following markov model on PAGES: {
   TURN := TRUE;
   P := arbitrary starting point.
   repeat
     { with probability e, 
       { P := random choice in PAGES;
         TURN = randomly either TRUE or FALSE;
        }
      with probability (1-e)
       { if TURN then { choose a random outlink from P to Q;
                        P := Q;
                        mark this as an "authority" visit to P;
                      }
         else { choose a random inlink into P from Q;
                P := Q;
                mark this as a "hub" visit to P;
              }
         TURN := not TURN;
   }
Find the stable distribution for this Markov model as in previous
section.
Thus, in this Markov model, when you are not jumping randomly, you are alternating moving across an outlink (presumably from a hub to an authority) and moving across an inlink (presumably from an authority to a hub.) You thus have two vectors L(P), the probability that a random hub visit is P and M(P), the probability that a random authority visit is P. Let A be the transition matrix when TURN = TRUE and let B be the transition matrix when TURN = FALSE; it is easily shown that B = transpose of A. Then
M = AL and L=BM so M = (AB)M
and you solve this equation in the same way as the equations above.

Seems to work in many cases. Example from Kleinberg (1999):

Query: "search engines"
Top from HITS: Home pages for Yahoo, Excite, Magellan, Lycos, AltaVista.
Google (9/02) has only one of the major search engine home pages (Google itself) in the top 30.

Problem: Topic drift. Algorithm drifts toward tightly connected regions of Web, which may not be exactly what was asked about. Example: Search on "jaguar" converged to a collection of sites about Cincinatti. Explanation: Large number of articles in the Cincinatti Enquirer about the Jacksonville Jaguars football team, all link to same Cincinatti Enquirer service pages. Various partial solution studied, none wholly successful. (See Borodin et al. for discussion, examples)

Other Uses of Link Analysis