## Lecture 11: Graph Structure of the Web

Graph Structure in the Web Andrei Broder et al. 2000

Self-Similarity in the Web Stephen Dill et al. 2001.

Trawling the web for emerging cyber-communities Ravi Kumar et al. 1999.

### Graph structure in the Web

(See paper). Main points:

#### Structure: (AltaVista, May 1999).

Total: 203.5 million
WCC: 186.7 million
SCC: 56.5 million
IN: 43.3 million
OUT: 43.1 million
TENDRILS: 43.7 million
DISC: 16.8 million
Links: 1466 million = 7.2 per page.

10 million WCC's (weakly connected components) almost all of size 1 (isolated page with no inlinks or outlinks.)

Two interesting points: 1. The seed set for AltaVista must be at least 10 million (actually, presumably, much larger). Creating a huge seed set almost as important as crawling.
2. How on earth does AltaVista find URLs for 10 million isolated pages?

• Server logs
• Mentioned in text or other non-HTML indices.
• Initial part of path names
• ???
Still mysterious.
I hope to hear from Kumar next week.

#### Other interesting measurements

The diameter of the SCC is at least 28.
The depth of IN is about 475? The depth of OUT is about 430? Very few pages are anywhere near these depths. The probability that there is a directed path from page U to page V is about 24%.
The probability that there is a undirected path from page U to page V is about 82%.
If there is a path from U to V, the average directed distance is about 16.15; the average undirected distance is 6.83.

The connectivity of the WCC is not dependent on the existence of pages of high in-degree. If all pages of indegree => k are removed, the WCC still has size W

```k             1000    100    10   5   4   3
W million      177    165   105  59  41  15
```

### Inverse power distribution on the Web

The following Web quantities follow an inverse power distribution:

• Number of pages with K inlinks vs K.
Exponent = -2.1, very stably over many significant subsets of Web. Same whether consider all inlinks or only remote inlinks.
• Number of pages with K outlinks vs K.
• Number of weak connected components of size K vs K.
• Number of strong connected components of size K vs K.
Why is this surprising?

### Digression on inverse power distributions

For further reading, see Stochastic Models for the Web Graph Ravi Kumar et al. FOCS 2000.

A Brief History of Generative Models for Power Law and Lognormal Distributions Michael Mitzenmacher, 2001.

References on Zipf's Law Wentian Li. Large bibliography.

Zipf's Law, Benford's Law Readable discussion. Benford's Law is the even stranger observation that the first digit of numbers that come up in practice is distributed with frequences log10(1+1/D). X is a random variable.
X.f is a property of X whose value is a positive integer.
In a inverse power distribution Prob(X.f = i) = C / iA for A > 1.
(C is a constant equal to 1 over the sum from k=1 to infinity of 1/kA).

If A =< 1, then the above sum diverges. Have to give an upper bound M.

If a collection O of N items is chosen according to the above distribution the expected number of items Q such that Q.f=i is N*C/iA. Conversely, O fits the inverse power distribution, if the above condition holds over O.

#### Zipf distribution

Let O be a bag of items of size N over a space W. For w in W, let count(w,O) be the number of times w occurs in O. Let w1, w2 ... be the items in W sorted by decreasing value of count(w,O) (break ties arbitrarily). We define the rank of wi in O to be i. That is, the most common item has rank 1; the second most common has rank 2; etc. O follows the Zipf distribution if
count(wi,O) = NC/rank(w,O)A for some a > 1.

As before if A =< 1, then W has to be finite, of size L.

Note: The power distribution is often called the "power law" and the Zipf distribution generally called "Zipf's law", but the only "law" is that a lot of things follow this distribution. They are also called "long tail" or "fat tail" distributions because the probability of extreme values decreases much more slowly than with other distributions such as the exponential or the normal.

#### Things that follow the Zipf distribution

• Frequency of words in a corpus of text. This is one of the main reasons that learning from corpora is so hard. Having seen a training corpus of N words, the probability that the next word you see will be new to you goes up only like log(N).
• Population of cities.
• Incomes
• Distribution of species among genera
• Access statistics for web pages.
• Number of times users at a single site access particular pages.
In many applications, the exponent "A" is quite close to 1, which is the "standard" Zipf distribution. As indicated above, there is in principle a large qualitative difference between an exponent that is greater than 1, which can accommodate an infinite range of values and an exponent less than or equal to 1, where the tail must ultimately depart from the power law.

If A < 2 in an inverse power distribution, then the variance (and the standard deviation) is infinite. Conversely, if you're doing statistical analysis of something and you come up with a strangely large standard deviation, you might consider whether you're looking at an inverse power distribution.

To find the optimal inverse power distribution matching a data set, plot the data set on a log-log graph. If y = C/xA then log(y) = log C - A log(x). So the log-log graph of the distribution is a straight line with slope = -A and y-intercept = C. You can find the best fit by doing any standard linear regression (e.g. least squares.)

#### Connection between the inverse power and the Zipf distribution.

If X.f follows the inverse power distribution, then count(X.f) follows the Zipf distribution. Argument: count(X.f=i) is (characteristically) a decreasing function of i; hence, the rank of the value X.f=i is just i, so it's the same distribution. In fact, the fit to the Zipf distribution of count as a function of rank is often better than the fit to the power law distribution, because
• by definition, count is a non-increasing function of rank;
• tends to be true even beyond that effect.
(If you plot count against X.f, you can turn this into a plot of count against rank by (a) reordering the points on the x-axis so as to go indecreasing order by count; (b) packing the point to the left to eliminate gaps.)

#### Two directions for inverse power distribution

The inverse power law can apply over a collection of sets in either of two directions:
• 1. The size of the Kth largest set is about C/KA.
• 2. Let Q(K) be the number of sets of size K Then |Q(K)| is about C/KA.
E.g. with (1) we would plot the Kth largest indegree vs. K; with (2) we would plot the number of nodes of indegree K against K.

The two directions are actually the same rule (under reasonable smoothness and monotonicity assumptions); if a distribution satisfies (1) with exponent A then it satisfies (2) with exponent 1+1/A, and vice versa. However, with approximate data (i.e. any actual data)

• the quality of fit may not be the same;
• the best fit exponents to the two graphs may not exactly satisfy the above relation
• power laws, like other distributions, often break down at one or both extremes.
of Kth most common word vs. K); inlink count is an example of the second (rank 1 = the most common number of inlinks, rather than the page with the most inlinks).

#### Why do you get power law distributions?

Stochastic model. Suppose that at every time step, a new page and a new link is created. With probability P, the link points to a page chosen at random uniformly; with probability 1-P the link points to a page chosen by random choice weighted by the number of inlinks. Then for large values of I, the distribution of nodes with I inlinks follows a power-law distribution with exponent (2-P)/(1-P). (Herbert Simon)

By contrast, 2 different stochastic models leading to very different distributions. Let N be the number of pages = 203 million and L be the number of links = 1.46 billion.

Method 1

```for I := 1 to L {
choose a random page U;
choose a random page V;
place a link from U to V;
}
```
Number of inlinks follows a binomial distribution. Prob. that there exists any page with 100 inlinks is about 10-62. (Note: these and the next figure were done on the back of an envelope and could easily be off by a factor of 100 or so, but they give the right idea.) Prob. that there exists any page with 200 inlinks is about 10-192.

Method 2

```for each page V do {
flip a coin with weight 7.2 for heads, 1 for tails;
if tails exitloop;
choose a random page U;
place a link from U to V
}
```
Total number of inlinks will be very nearly L, to within about 0.0025%. Number of inlinks follows an exponential distribution. Expected number of pages with 100 inlinks = about 200. Probability that there exists any page with 200 inlinks = about 2*10-4.

Contrast actual value, predicted by power law, of about 30,000 pages (? eyeballed from small graph) with 100 inlinks and about 8000 pages with 200 inlinks. The point is that "ordinary" random processes don't give power laws, or anything close to them.

Of course this is not a plausible model of the Web, so there's a small industry in constructing more plausible stochastic models of the Web. No one (as far I as I have found) has yet constructed a model which is known to give all the structural features of the Web.

Multiplicative model. If you have a large collection of identically distributed independent random variables, and you multiply them, then you are just adding the logs, so the sum of the logs follows a normal curve, so you get what's called a log-normal distribution, which looks a lot like a power law in the middle range.

Information theoretic model (Mandelbrot). Can show that a power-law distribution of word frequencies maximizes information over cost.

#### Significance of these Web measurements

Well, that a little harder. We've seen the implication for crawlers; need immense seed set, techniques for finding URLs without links. Also:
• Support for efficient algorithm, or explanation of why algorithms such as PageRank run as rapidly as they do. Problems that would be intractable on a random graph (e.g. finding small cliques) may be tractable due to structure.
• Improved charaterization for browsing and ranking.
• Data compression. Whenever you understand structure, you can use that to compress.
• Data mining.

### Self-Similarity

Made measurements over various subsets of the Web. Generally, the results were that many structural properties of the Web apply to significant subsets as well.

#### Subsets

• Keyword sets: All pages that include specified keyword(s).
• Baseball
• Golf
• Math
• MP3
• Restaurant
• Baseball Yankees
• Golf Tiger Woods
• Math Geometry
• MP3 Napster
• Restautant Sushi.
• IBM INTRANET
• 100 web sites
• Geographic location: Web sites that have references to geographic locations between Denver on east, Nilolski Alaska on west, Vancouver on north, and Brownsville Texas on south.
• 7 random collections of websites: STREAM1 ... STREAM7, each with about 6 million pages.
Some others, but these are the most interesting.

#### Some Results

Different measurements for different subgraphs: Why?

STREAM1 ... 7:
Remarkable consistency as regards:

• Expansion factor (between 2.01 and 2.06).
• Indegree exponent (between 2.06 and 2.13)
• Outdegree exponent (between 2.12 and 2.32)
• SCC exponent (between 2.11 and 2.16)
• WCC exponent (between 2.25 and 2.32)
• WCC/nodes (between 0.69 and 0.72)
• SCC/WCC (betweeen 0.22 and 0.24)
• IN/WCC (between 0.19 and 0.23)
• OUT/WCC (between 0.23 and 0.24)
• K5,7 factor (ratio of size of set divided by number of nodes that are in a bipartite K5,7 graph. (between 43.5 and 50.1)

Single keyword sets: Sizes between 336,500 (baseball) and 831.7 (math).

• Arcs/node: between 4.55 (MATH) to 14.54 (MP3)
• Indegree exp: between 2.06 and 2.33 except MATH at 2.85
• SCC exp: between 2.06 and 2.66
• WCC exp. between 2.18 and 2.73.
• WCC/nodes: between 0.013 (RESTAURANT) and 0.10 (BASEBALL)
• SCC/WCC: between 0.12 and 0.31.
• K5,7 factor: between 44.48 (BASEBALL) and 148.7 (MATH)

Double keyword sets: Sizes between 7400 and 44,000 show great variation. (e.g. Arcs/node vary from 1.98 (Math Geometry) to 13.33 (BASEBALL YANKEES)

### Bipartite graphs (hubs and authorities) in the Web

Trawling the web for emerging cyber-communities Ravi Kumar et al. 1999.

Look for small complete bipartite graphs on Web: Ki,j, for i=3..6, j=3,5,7,9. Infer cyber-communities.

Technical issues: Exclude aliases, copies, near-copies, nepotistic cores (hubs from same website). There are lots of copies of Yahoo pages.

Algorithmics:

• 1. Prune all potential hubs with out-degree < i and all authorities with indegree < j. Iterate. (As we have seen, eliminates most of the pages on Web.)

2. "Inclusion/exclusion:" To check whether a hub H of out-degree i is in a Ki,j, you (a) collect all the i authorities it points to; (b) compute the interesection of all their sets of tails of inlinks. If this interesction has q elements then H is in a Ki,q graph, which can be pruned. If q < j, then prune H and propagate. This step finds most of the bipartite graphs. 3. The remaining graph is now small enough for a search for the remaining graphs to be tractable.

Results: About 200,000 bipartite graph; almost all are coherent in terms of topics. Many unknown to Yahoo. Hubs in a community last longer than average Web pages.