Lecture 9: Web Structure and Evolution

Stanford WebBase Components and Applications by Junghoo Cho et al. Interesting paper on crawling and on Stanford WebBase

Required Reading

Chakrabarti, sec. 7.6

Graph Structure in the Web Andrei Broder et al. 2000
The Indexable Web is more than 11.5 Billion Pages A. Gulli and A. Signorini

Suggested Reading

Pierre Baldi, Paolo Frasconi, and Padhraic Smyth, Modeling the Internet and the Web, , chap. 3.

Self-Similarity in the Web Stephen Dill et al. 2001.

Trawling the web for emerging cyber-communities Ravi Kumar et al. 1999.

Graph structure in the Web

(See paper). Main points:

Structure: (AltaVista, May 1999).

Total: 203.5 million
WCC (large weakly connected component. That is, there is a path from any page in WCC to any other following outlinks and inlink): 186.7 million
SCC (large strongly connected component. That is, there is a path from any page in SCC to any other following outlinks): 56.5 million
IN (there is a path following outlinks from any page in IN to SCC): 43.3 million
OUT: (there is a path following outlinks from SCC to any page in OUT) 43.1 million
TENDRILS (= WCC-(SCC union IN union OUT): 43.7 million
DISC (pages not in WCC): 16.8 million
Links: 1466 million = 7.2 per page.

10 million WCC's (weakly connected components) almost all of size 1 (isolated page with no inlinks or outlinks.)

Two interesting points: 1. The seed set for AltaVista must be at least 10 million (actually, presumably, much larger). Creating a huge seed set almost as important as crawling.
2. How on earth does AltaVista find URLs for 10 million isolated pages?

Still mysterious.

Other interesting measurements

The diameter of the SCC is at least 28.
The depth of IN is about 475? The depth of OUT is about 430? Very few pages are anywhere near these depths. The probability that there is a directed path from page U to page V is about 24%.
The probability that there is a undirected path from page U to page V is about 82%.
If there is a path from U to V, the average directed distance is about 16.15; the average undirected distance is 6.83.

The connectivity of the WCC is not dependent on the existence of pages of high in-degree.

If all pages of indegree => k are removed, the WCC still has size W

k             1000    100    10   5   4   3
W million      177    165   105  59  41  15
Serrano et al. report that the relative sizes of IN, OUT, and SCC depend strongly on the particular crawl, especially IN, as one might expect.

The Size of the Indexable Web

The Indexable Web is more than 11.5 Billion Pages A. Gulli and A. Signorini (2005)

(Comparable studies gave figures for 200 million in 1997 and 800 million in 1998. Accessibility of information on the web, Steve Lawrence and C. Lee Giles, Nature 400, 1999, pp. 107-109.)

Comparative size: Let A and B be the set of pages indexed by two different search engines. Let P(B|A) be the probability that a randomly chosen page in A is also indexed by B. Then P(B|A) is approximately |A&B|/|A| and P(A|B) is approximately |A&B|/|B|, so |A|/|B| = P(A|B) / P(B|A).

Moreover, in 2005 the search engines still published a figure for |A|. If you know |A| and |B| and you assume that A and B are independent samples of the space W of indexable web pages, then P(A|B) = P(A|W) = |A|/|W|, so |W|=|A|/P(A|B).

So there are two issues: randomly sampling a page indexed by a search engine and checking whether a given page is indexed by another search engine

Random sampling. Extract 2,000,000 terms in 75 languages from DMOZ. Sort by occurrence; divide into blocks of 20 terms, extract one query term for each of four search engines; thus, 440,000 one term queries. 110,000 queries each for Google, MSG, Yahoo, and Ask/Teoma. Pick random page from first 100 pages returned.

Checking: Check directly on normalized URL.

BTW, this huge data set is completely unnecessary. The experiment would have given essentially the same results with 5000 queries per search engines or even fewer, as one would expect. (Using a larger data set does not usually solve the problem of systematic bias.)

P(Google|X) = 68.2% (X is any other search engine)
P(Yahoo|X) = 59.1%
P(MSN|X) = 49.2%
P(Ask/Teoma|X) = 43.5%.

Assuming that search engines are random, these figures are also the fraction of the indexable web indexed.

Note: the overlap here has gone way up since 1998, when Lawrence and Giles estimated that no search engine indexed more than 16% of the web. So either the coverage has improved, or the correlation between different search engines has increased. In any case, the value of doing searches in multiple different general search engines (either manually or using a meta-search engine) has gone down. If a page is indexed by any of these search engines, there is a better than 2/3 chance that it is indexed by Google.

Inverse power distribution on the Web

f(X) = C/(X+D)A

The following Web quantities follow an inverse power distribution:

Digression on inverse power distributions

Further reading:

Baldi et al. sec. 1.7.

A Brief History of Generative Models for Power Law and Lognormal Distributions Michael Mitzenmacher, 2001.

References on Zipf's Law Wentian Li. Large bibliography.

Zipf's Law, Benford's Law Readable discussion. Benford's Law is the even stranger observation that the first digit of numbers that come up in practice is distributed with frequences log10(1+1/D).

For an argument that true inverse power distributions are rarer than claimed see "Strong, Weak, and False Inverse Power Laws," by Richard Perline, Statistical Science, vol. 20, 2005, pp. 68-99.

X is a random variable.
X.f is a property of X whose value is a positive integer.
In a inverse power distribution Prob(X.f = i) = C / iA for A > 1.
(C is a constant equal to 1 over the sum from k=1 to infinity of 1/kA).

If A =< 1, then the above sum diverges. Have to give an upper bound M.

If a collection O of N items is chosen according to the above distribution the expected number of items Q such that Q.f=i is N*C/iA. Conversely, O fits the inverse power distribution, if the above condition holds over O.

Zipf distribution

Let O be a bag of items of size N over a space W. For w in W, let count(w,O) be the number of times w occurs in O. Let w1, w2 ... be the items in W sorted by decreasing value of count(w,O) (break ties arbitrarily). We define the rank of wi in O to be i. That is, the most common item has rank 1; the second most common has rank 2; etc. O follows the Zipf distribution if
count(wi,O) = N*C/rank(w,O)A for some A > 1.

As before if A =< 1, then W has to be finite, of size L.

Note: The power distribution is often called the "power law" and the Zipf distribution generally called "Zipf's law", but the only "law" is that a lot of things follow this distribution. They are also called "long tail" or "fat tail" distributions because the probability of extreme values decreases much more slowly than with other distributions such as the exponential or the normal.

Two features of the inverse power law

First: The rich have a large percentage of the money.
Second: There are a large number of instances with 1 unit.

For example: Suppose there are 1,000,000,000 web pages, an average of 7 links per page, and the distribution of inlinks goes like 1/(N+4)2. (4 is the offset). Then

The higher the exponent A and the lower the offset, the more the wealth is concentrated in the head, and the less long the tail.

By contrast, 2 different stochastic models leading to very different distributions.

Method 1

for I := 1 to L {
  choose a random page U;
  choose a random page V;
  place a link from U to V;
}
Number of inlinks follows a binomial distribution. Then there is probably no page with more than 30 inlinks, and the expected number of pages with no inlinks is about 9 million. The probability that there exists a page with 100 inlinks is about 10-63. The probability that there exists a page with 280,000,000 inlinks is about 10^{- 2 billion}.

Method 2

for each page V do {
  flip a coin with weight 7 for heads, 1 for tails;
  if tails exitloop;
  choose a random page U;
  place a link from U to V
}
Total number of links will be very nearly 7 billion, to within about 0.0002%. Number of inlinks follows an exponential distribution. Expected number of pages with 100 inlinks = about 28. Probability that there exists any page with 280,000,000 inlinks is about 10^{- 18 million}

Another feature of inverse power law: The continued likelihood of finding new objects.

Suppose that you have a vocabulary of 1,000,000 words, and the frequency of the Kth most frequent word is 1/14.39K. The following table shows for various corpus sizes, the number of different words in the corpus, the success rate, o i.e.the probability that the next word you will see is one you have seen already; and the failure rate. (This calculation is not exact, but I'm pretty sure it is quite close.)

Corpus size Number of distinct words Success rate Error rate
10,000 5700 0.64 0.36
20,000 13,000 0.68 0.32
50,000 23,000 0.73 0.27
100,000 41,000 0.77 0.23
200,000 73,000 0.81 0.19
500,000 150,000 0.86 0.14
1,000,000 260,000 0.90 0.10

Thus, over this range doubling the corpus size increases the success rate by 4%. With larger vocabularies, the effect is even starker.

(Formula: The harmonic function H(n) = sumk 1/k = (approximately) ln(n) + Euler's constant 0.5772. Let V be the size of the vocabulary. The Kth most common word occurs with frequency 1/K*H(V). Assume, for simplicity, that an N word corpus is filled with the W most frequent words, appearing according to the specified frequency. All of the words of rank less than W0=N/H(V) appear at least once. These account for N0=N*H(W0)/H(V) words in the corpus. The remainder of the corpus is filled in with less frequent words that occur exactly once; there are N-N0 of these. Thus, the total number of different words in the corpus W=W0+(N-N0). The probability that the next word will have already been seen is H(W)/H(V).)

This is one of the major reasons that promising prototypes in intelligent systems often cannot be extended to high-quality systems; it is just much more work to get a 90% success rate than a 70% success rage. (Another major reason is combinatorial explosion.)

Things that follow the Zipf distribution

In many applications, the exponent "A" is quite close to 1, which is the "standard" Zipf distribution. As indicated above, there is in principle a large qualitative difference between an exponent that is greater than 1, which can accommodate an infinite range of values and an exponent less than or equal to 1, where the tail must ultimately be truncated.

If A <= 2 in an inverse power distribution, then the mean is infinite. If A <= 3, then the variance and the standard deviation are infinite. Conversely, if you're doing statistical analysis of something and you find that as you take more and more samples your estimate of the mean or the standard deviation keeps getting bigger and bigger, you might consider whether you're looking at an inverse power distribution.

To find the optimal inverse power distribution matching a data set, plot the data set on a log-log graph. If y = C/(x+D)A then log(y) = log C - A log(x+D). So if you graph log(y) against log(x+D) you a straight line with slope = -A and y-intercept = C. The value of D of course affects the graph only for small values of x, so find A and C by fitting the curve over large values of x by doing any standard linear regression (e.g. least squares), then with that fixed value of A and C do binary search to find the value of D that best fits the small values of x.

Connection between the inverse power and the Zipf distribution.

If X.f follows the inverse power distribution, then count(X.f) follows the Zipf distribution. Argument: count(X.f=i) is (characteristically) a decreasing function of i; hence, the rank of the value X.f=i is just i, so it's the same distribution. In fact, the fit to the Zipf distribution of count as a function of rank is often better than the fit to the power law distribution, because (If you plot count against X.f, you can turn this into a plot of count against rank by (a) reordering the points on the x-axis so as to go in decreasing order by count; (b) packing the point to the left to eliminate gaps.)

Two directions for inverse power distribution

The inverse power law can apply over a collection of sets in either of two directions: E.g. with (1) we would plot the Kth largest indegree vs. K; with (2) we would plot the number of nodes of indegree K against K.

The two directions are actually the same rule (under reasonable smoothness and monotonicity assumptions); if a distribution satisfies (1) with exponent A then it satisfies (2) with exponent B=1+1/A. However, with approximate data (i.e. any actual data)

Why do you get power law distributions?

Stochastic model. Suppose that at every time step, a new page and a new link is created. With probability P, the link points to a page chosen at random uniformly; with probability 1-P the link points to a page chosen by random choice weighted by the number of inlinks. Then for large values of I, the distribution of nodes with I inlinks follows a power-law distribution with exponent (2-P)/(1-P) (Herbert Simon). Since the observed exponent is 2.1, P=0.09.

Of course this is not a plausible model of the Web, (only one outlink per page) so there's a small cottage industry in constructing more plausible stochastic models of the Web. See Baldi et al. chap. 3 for a very extensive discussion.

Multiplicative model. If you have a large collection of identically distributed independent random variables, and you multiply them, then you are just adding the logs, so the sum of the logs follows a normal curve, so you get what's called a log-normal distribution, which looks a lot like a power law in the middle range.

Information theoretic model (Mandelbrot). Can show that a power-law distribution of word frequencies maximizes information over cost.

Significance of these Web measurements

One implication for crawlers; you can get 50% of the web (SCC+OUT) with a single seed; getting all the web requires an immense number of seeds.

Self-Similarity

Made measurements over various subsets of the Web. Generally, the results were that many structural properties of the Web apply to significant subsets as well.

Subsets

Some others, but these are the most interesting.

Some Results

Different measurements for different subgraphs: Why?

STREAM1 ... 7:
Remarkable consistency as regards:

Single keyword sets: Sizes between 336,500 (baseball) and 831.7 (math).

Double keyword sets: Sizes between 7400 and 44,000 show great variation. (e.g. Arcs/node vary from 1.98 (Math Geometry) to 13.33 (BASEBALL YANKEES)

Failure of Self-Similarity

You might think that IN and OUT would each be a structure similar to the overall web, with one large SCC in the middle. However this is not the case according to Donato et al., Mining the inner structure of the Web graph. They report that in the Stanford Web Base, though OUT has 53M pages, the largest SCC in OUT is 9,349. Some other crawls have relatively larger SCCs e.g. 19,000 SCC out of 11M for a crawl through Italian web sites, but none of them are anything like the relative size of the large SCC in the overall crawl. The largest WCC is around 30% of OUT as contrasted with the WCC for the overall graph which is 90% of the size of the graph. 90% of all nodes in OUT are within five links of the large SCC.

Bipartite graphs (hubs and authorities) in the Web

Trawling the web for emerging cyber-communities Ravi Kumar et al. 1999.

Look for small complete bipartite graphs on Web: Ki,j, for i=3..6, j=3,5,7,9. Infer cyber-communities.

Technical issues: Exclude aliases, copies, near-copies, nepotistic cores (hubs from same website). There are lots of copies of Yahoo pages.

Algorithmics:

Results: About 200,000 bipartite graph; almost all are coherent in terms of topics. Many unknown to Yahoo. Hubs in a community last longer than average Web pages.

The Dynamic Web

Change in the Web

Anecdotal remark: A significant fractions of the links from my class notes for this course from five years ago are now dead. These are, of course, links to published scientific papers, which ought to be extremely stable. (This is mostly migration, of course; most of these papers are available somewhere on the web and can be found using Google.)

Types of change:

Accessibility of information on the web, Steve Lawrence and C. Lee Giles, Nature, July 8, 1999.

December 1997: Estimate at least 320 million pages on the Web.

Feb. 1999. Estimate at least 800 miliion pages on the web. Test random IP addresses. Attempted to exhaustively search web site. Extrapolate. 2.8 million servers. Avg. of 289 pages per site = 800 million pages. Avg of 18.7 KBytes per page / 7.3 KBytes of textual content per page = 15 Terabytes / 6 TBytes of textual content. Avg. of 62.8 images per server, mean image size of 15.2 KBytes per image = 180 million images, 3 TBytes

The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho and Hector Garcia-Molina.

Method: Once a day, for four months, crawl 3000 pages at each of 270 "popular" sites (that gave permission out of 400 asked) using a breadth-first crawler from the root page. fixed seed set. Thus, not a stable set either of content or of URL's.

Results:

A large scale study of the evolution of Web pages
Dennis Fetterly et al. Software -- Practice and Experience , vol. 34, 2004, pp. 213-237.

Method: Fix a set of 151 million HTML Web pages + 62 million non-HTML pages. Download once a week for 11 weeks. (Thus, strictly a URL-based perspective.)

Note that this method cannot detect the creation of a Web page during the experimental period, but reliably detects its deletion (except if the server fails to deliver it for some other reason.) By contrast the method of Cho and Garcia-Molina can detect appearance, but both its reports of appearance and disappearance are subject to many false positives, as the page may simply have come into or gone out of the 3000 page crawler window.

(Lots of technical detail on how you manage such a huge dataset effectively.)

Results:

Avg length of HTML page = 16KB. "Looks like" a Log-normal curve. 66.3% of documents have length between 4 and 32KB. .com pages a little longer, .edu pages a little shorter. Closer if you measure word length suggesting that the difference is that .edu pages have less HTML markup.

85% of URL's are downloadable over the entire 11 week experiment. (.edu better than .com, .net). An increasing number become unavailable due to robot.txt exclusions, probably a result of the experiment itself.

Web pages in .cn (China), .com, and .net expire sooner than average.

Similarity measure: Let U and V be two pages. Let S(U) and S(V) be the set of all 5-word shingles in text of U and V. Then similarity is measured as (I think) |S(U) intersect S(V)| / |S(U) union S(V)|. For efficiency, this is calculated using the random function fingerprinting method discussed earlier.

Considering all pairs of successive downloads of the same page (1 week apart): 65.2% are identical. 9.2% differ only in HTML elements. 3% have similarity less than 1/3. 0.8% have 0 similarity.

Among pages that changed only in markup: 62% are changes to an attribute. 48% are changes to an attribute that follows ? or ;. Most of these are just changes to a session ID.

Observing link evolution of this type may help a crawler in spotting session identifiers. If a crawler were able to recognize embedded session identifiers and remove them, then it could avoid recrawling the same content multiple times. In our experience, such unwarranted recrawling accounts for a non-negligible fraction of total crawl activity.
A smaller fraction are advertisements, chosen by embedding some identifier in the query portion of a URL. etc.

Pages in .com change more frequently than in .gov or .edu.

Fastest changes were in .de (Germany). 27% of pages underwent a large or complete change every week, compared with 3% for Web. Why?

Of the first half dozen pages we examined, all but one contained disjoint, but perfectly grammatical phrases of an adult nature together with a redirection to an adult Web site. It soon became clear that the phrases were automatically generated on the fly, for the purpose of ``stuffing'' search engines such as Google with topical keywords surrounded by sensible-looking context, in order to draw visitors to the adult Web site. Upon further investigation, we discovered that our data set contains 1.03 million URLs drawn from 116,654 hosts (4745 of them outside the .de domain) which all resolved to a single IP address. This machine provided over 15% of the .de URL's in our data set!

Point: (1) to circumvent politeness policy of search engine in not downloading too many pages from a server at once. (2) to trick Google, by making links between these pages look non-nepotistic.

Large documents tend to change much more frequently than small ones.