However, some links are more important than others:
(What happends if P has no outlinks, so that OP = 0? This actually turns out to create trouble for our model. For the time being, we will assume that every page has at least one outlink, and we will return to this problem below.)
We therefore have the following equation: for every page Q, if Q has in-links from P1 ... Pm,
(1) I(Q) = E + F * (I(P1)/OP1) + ... + I(Pm)/OPm)This looks circular, but it is just a set of linear equations in the quantities I(P). Let N be the total number of pages on the Web (or in our index.) We have N equations (one for each value of Q on the left) in N unknowns (the values I(P) on the right), so that, at least looks promising.
We now make the following observation. Suppose we write down all the above equations for all the different pages Q on the Web. Now we add up all the left hand sides and all right hand sides. On the left we have the sum of I(Q) over all Q on the web; call this sum S. On the right we have N occurrences of E and for every page P, OP1 occurrences of F*I(P) / OP. Therefore, over all the equations, we have for every page P a total of F*I(P), and these add up to F*S. (Note the importance here of our assumption that OP > 0). Therefore, we have
(2) S = NE + FS so F = 1 - NE/S.Since the quantities E,F,N,S are all positive, it follows that F < 1, E < S/N.
For example suppose we have pages U,V,W,X,Y,Z with the following links:
Note that X and Y each have three in-links but two of these are from the "unimportant" pages U and W. Z has two in-links and V has one, but these are from "important" pages. Let E = 0.05 and let F=0.7. We get the following equations (for simplicity, I use the page name rather than I(page name)):
U = 0.05
V = 0.05 + 0.7 * Z
W = 0.05
X = 0.05 + 0.7*(U/2+V/2+W/2)
Y = 0.05 + 0.7*(U/2+v/2+W/2)
Z = 0.05 + 0.7*(X+Y)
The solution to these is
U = 0.05 V = 0.256 W = 0.05 X = 0.175 Y = 0.175 Z = 0.295
1. Flip a weighted coin, that comes up heads with probability e and tails with probability (1-e). 2. If coin was heads, he picks a page at random in the Web and goes there. 3. If coin was tails, he picks an outlink from P at random and follows it. (Again, we assume that every page has at least one outlink.)
The browser does this for eons (posit that the web stays constant all that time) and we keep track of where he has been. At the end of this we state that the importance of each page is the fraction of the time that he has spent on that page while browsing. It is easy to show that "importance", thus defined, satisfies the equations above, where E=e/N and F=(1-e). Or, equivalently, we can ask "What is the probability that after an eon of browsing, the browser is on page P," and that probability is equal to the "importance".
We will prove half of the probabilistic formulation; namely, we will prove that if the probability distribution converges, then it must converge to the "importance" as defined in equation (1). We omit the proof that it does converge; see any text on Markov processes.
Markov process: States and transitions. The event that at time T+1 you will be in state S depends only on state at time T, not on anything else. (States are Web pages.)
AT(P,T) = event that you are in state P at time T.
A[Q,P] = Prob(AT(Q,I+1) | AT(P,I)). NxN matrix.
LT(P) = Prob(AT(P,T)). Vector of length N.
E.g. in the example above, A is the 6x6 matrix (Q is row, P is column).
0.05 0.05 0.05 0.05 0.05 0.05
0.05 0.05 0.05 0.05 0.05 0.75
0.05 0.05 0.05 0.05 0.05 0.05
0.40 0.40 0.40 0.05 0.05 0.05
0.40 0.40 0.40 0.05 0.05 0.05
0.05 0.05 0.05 0.75 0.75 0.05
Note that the sum of every column is 1.0, because from each state P you
go to exactly 1 state Q.
Now,
(3) LT+1(Q) = Prob(AT(Q,T+1)) =
sumP Prob(AT(Q,T+1) | AT(P,T)) Prob(AT(P,T)) =
sumP A(Q,P) LT(P).
In matrix notation we have
(4) LT+1 = A * LT
If we have attained a steady state, then LT+1 = LT,
So we have
(5) L = A L.
or equivalently
(6) 0 = (A-I)L
where I is the unit matrix (1's on the diagonals, 0's elsewhere). In the
terminology of linear algebra, we say that L is an eigenvector of A
with eigenvalue 1.0.
Now, note that A = (e/N)U + (1-e)B where U is the matrix of all 1's and B is the matrix
B[Q,P] = 1/OP if P links to Q
0 if P does not link to Q
(e.g. in our example e=0.3; so e/N = 0.05. B=
0 0 0 0 0 0
0 0 0 0 0 1
0 0 0 0 0 0
0.5 0.5 0.5 0 0 0
0.5 0.5 0.5 0 0 0
0 0 0 1.0 1.0 0
Also if L is any probability distribution, then the sum of the elements
of L is 1, so (e/N)UL = (e/N)U' where U' is the unit vector (all 1's)
So equation (5) L = AL is equivalent to
(7) L = [(e/N)U + (1-e)B]L = (e/N)U' + (1-e)BL.
But this is just the same thing as equation (1) written in matrix notation.
So L, the steady state probability distribution, is the same thing as
I, the importance.
Another way of visualizing this is using a "hydraulic'' model. Think of each of the web pages as "reservoirs" of "importance" and of there being a fixed rate of flow from each page to every other page. Then, the page rank is the steady state, in which each page has a constant level of importance. (This model makes it easy to see why it is important that every page has an outlink; if there is not enough outflow from a given page, then depending on how you set up the equations, either "importance" simply accumulates at dead end pages, ultimately leaving all the importance there, or "importance" evaporates out of the dead end pages, gradually sucking the entire system dry. In either case, there is no useful steady state.
Theorem: (I will not prove) If 0 < e < = 1, equation (1) has a unique solution.
Intuitive understanding of relative importance in our examples: In the steady state, if there are no jumps, then the browser goes through a cycle V, [X or Y], Z, V, [X or Y], Z ... , so it goes through V and Z about twice as often as through X and Y. (If we had chosen a smaller value of e, this would be more exact.)
Algorithm 1:
for all pages P do I(P) := 1/N; /* start with a uniform distribution */
repeat {
for all pages P do {
J(P) := E;
for all pages Q such that Q links to P do
J(P) += I(Q)/OQ;
}
if the difference between J and I is small then return(J);
I := J;
}
Convergence can be accellerated by extrapolating the direction of change:
Algorithm 2:
for all pages P do I(P) = 1/N; /* start with a uniform distribution */
repeat {
for all pages P do {
J(P) = 0;
for all pages Q such that Q links to P do
J(P) += I(Q)/OQ;
}
D = sum of absolute values of elements of J - sum of absolute values of
elements of I;
for all pages P do J(P) += D*E;
if the difference between J and I is small then return(J);
I := J;
}
Page et al. report that for N=322 million, this converges to reasonable
tolerance in 52 iterations.
For e close to 1, the distribution is nearly uniform. The order of Page Rank is just number of in-links.
At e=0, things go kerflooey. There still exists a stable distribution, but there may be more than one, and the system may not converge to it. The above algorithms may not converge.
At e close to 0, the system is unstable (i.e. small changes to structure make large change to solution.) The above algorithms converge only slowly.
Experiments of Page et al. used e=0.15.
Another solution in principle would be to say that, if there are no outlinks, you jump randomly (i.e. with 1/N probability to each known page.) Algorithm 1 can be adapted to this, at the cost of a little more complexity. I don't know about algorithm 2.
Solution: Put a maximum on the number of links allows to P from the same site (or on their total weight).
Countertrick: Buy two domains and put P1 ... Pk in one domain and P in the other.
Anti-countertrick: For any domain D put a maximum on the number (or total weight) of links from D to page P.
One has to presume that it will not pay for the spammer to get lots of domains. There's not much that could be done about that.
Observation: The Web is not actually important pages pointing to other important pages. Rather, the good pages in any domain tend to be either authorities with good content, or hubs with good links. A hub has a lot of out-links to authorities; an authority has a lot of in-links from hubs. Authorities tend not to point to authorities, either because they are in competition, or because that's not what they're concerned with. We can perhaps improve importance evaluation by taking this structure into account.
Also, we can try to use this to solve the problem of "non-self-descriptive" important Web pages. For example, one would want the query "Japanese auto manufacturer" to return the home pages for the major Japanese auto manufacturer, but in fact, only Isuzu is in the top 30 returned by Google (9/19/02) because its home page happens to contain those words. However, if we can find good hubs with those words, they probably tend to point to these home pages
Algorithm:
(This is actually the rational reconstruction of the model due to
Ng et al., which they show to be more stable than Kleinberg's original
proposal. There are bunches of alternative formulations in the literature.)
C := top K pages for query, as returned by search engine
(K is a fixed parameter.)
O := follow all outlinks from C.
I := follow at most D inlinks from C. (D is another fixed parameter.)
/* There is an inherent limitations on the number of outlinks from
any page, but there are pages with vast numbers of inlinks. */
PAGES := C union O union I;
M := the following markov model on PAGES: {
TURN := TRUE;
P := arbitrary starting point.
repeat
{ with probability e,
{ P := random choice in PAGES;
TURN = randomly either TRUE or FALSE;
}
with probability (1-e)
{ if TURN then { choose a random outlink from P to Q;
P := Q;
mark this as an "authority" visit to P;
}
else { choose a random inlink into P from Q;
P := Q;
mark this as a "hub" visit to P;
}
TURN := not TURN;
}
Find the stable distribution for this Markov model as in previous
section.
Thus, in this Markov model, when you are not jumping randomly, you are
alternating moving across an outlink (presumably from a hub to an
authority) and moving across an inlink (presumably from an authority to
a hub.) You thus have two vectors L(P), the probability that a random
hub visit is P and M(P), the probability that a random authority visit
is P. Let A be the transition matrix when TURN = TRUE and let B be
the transition matrix when TURN = FALSE; it is easily shown that B =
transpose of A. Then Seems to work in many cases. Example from Kleinberg (1999):
Query: "search engines"
Top from HITS: Home pages for Yahoo, Excite, Magellan, Lycos, AltaVista.
Google (9/02) has only one of the major search engine home pages (Google
itself) in the top 30.
Problem: Topic drift. Algorithm drifts toward tightly connected regions of Web, which may not be exactly what was asked about. Example: Search on "jaguar" converged to a collection of sites about Cincinatti. Explanation: Large number of articles in the Cincinatti Enquirer about the Jacksonville Jaguars football team, all link to same Cincinatti Enquirer service pages. Various partial solution studied, none wholly successful. (See Borodin et al. for discussion, examples)