Chakrabarti, sec. 7.6
Graph Structure in the Web Andrei Broder et al. 2000
The Indexable Web is more than 11.5 Billion Pages
A. Gulli and A. Signorini
Pierre Baldi, Paolo Frasconi, and Padhraic Smyth, Modeling the Internet and the Web, , chap. 3.
Self-Similarity in the Web Stephen Dill et al. 2001.
Trawling the web for emerging cyber-communities Ravi Kumar et al. 1999.
10 million WCC's (weakly connected components) almost all of size 1 (isolated page with no inlinks or outlinks.)
Two interesting points:
1. The seed set for AltaVista must be at least 10 million (actually,
presumably, much larger). Creating a huge seed set almost as important
as crawling.
2. How on earth does AltaVista find URLs for 10 million isolated pages?
The connectivity of the WCC is not dependent on the existence of pages of high in-degree.
If all pages of indegree => k are removed, the WCC still has size W
k 1000 100 10 5 4 3 W million 177 165 105 59 41 15Serrano et al. report that the relative sizes of IN, OUT, and SCC depend strongly on the particular crawl, especially IN, as one might expect.
(Comparable studies gave figures for 200 million in 1997 and 800 million in 1998. Accessibility of information on the web, Steve Lawrence and C. Lee Giles, Nature 400, 1999, pp. 107-109.)
Comparative size: Let A and B be the set of pages indexed by two different search engines. Let P(B|A) be the probability that a randomly chosen page in A is also indexed by B. Then P(B|A) is approximately |A&B|/|A| and P(A|B) is approximately |A&B|/|B|, so |A|/|B| = P(A|B) / P(B|A).
Moreover, in 2005 the search engines still published a figure for |A|. If you know |A| and |B| and you assume that A and B are independent samples of the space W of indexable web pages, then P(A|B) = P(A|W) = |A|/|W|, so |W|=|A|/P(A|B).
So there are two issues: randomly sampling a page indexed by a search engine and checking whether a given page is indexed by another search engine
Random sampling. Extract 2,000,000 terms in 75 languages from DMOZ. Sort by occurrence; divide into blocks of 20 terms, extract one query term for each of four search engines; thus, 440,000 one term queries. 110,000 queries each for Google, MSG, Yahoo, and Ask/Teoma. Pick random page from first 100 pages returned.
Checking: Check directly on normalized URL.
BTW, this huge data set is completely unnecessary. The experiment would have given essentially the same results with 5000 queries per search engines or even fewer, as one would expect. (Using a larger data set does not usually solve the problem of systematic bias.)
P(Google|X) = 68.2% (X is any other search engine)
P(Yahoo|X) = 59.1%
P(MSN|X) = 49.2%
P(Ask/Teoma|X) = 43.5%.
Assuming that search engines are random, these figures are also the fraction of the indexable web indexed.
Note: the overlap here has gone way up since 1998, when Lawrence and Giles estimated that no search engine indexed more than 16% of the web. So either the coverage has improved, or the correlation between different search engines has increased. In any case, the value of doing searches in multiple different general search engines (either manually or using a meta-search engine) has gone down. If a page is indexed by any of these search engines, there is a better than 2/3 chance that it is indexed by Google.
The following Web quantities follow an inverse power distribution:
Baldi et al. sec. 1.7.
A Brief History of Generative Models for Power Law and Lognormal Distributions Michael Mitzenmacher, 2001.
References on Zipf's Law Wentian Li. Large bibliography.
Zipf's Law, Benford's Law Readable discussion. Benford's Law is the even stranger observation that the first digit of numbers that come up in practice is distributed with frequences log_{10}(1+1/D).
For an argument that true inverse power distributions are rarer than claimed see "Strong, Weak, and False Inverse Power Laws," by Richard Perline, Statistical Science, vol. 20, 2005, pp. 68-99.
X is a random variable.
X.f is a property of X whose value is a positive integer.
In a inverse power distribution
Prob(X.f = i) = C / i^{A} for A > 1.
(C is a constant equal to 1 over the sum from k=1 to infinity of
1/k^{A}).
If A =< 1, then the above sum diverges. Have to give an upper bound M.
If a collection O of N items is chosen according to the above distribution the expected number of items Q such that Q.f=i is N*C/i^{A}. Conversely, O fits the inverse power distribution, if the above condition holds over O.
As before if A =< 1, then W has to be finite, of size L.
Note: The power distribution is often called the "power law" and the Zipf distribution generally called "Zipf's law", but the only "law" is that a lot of things follow this distribution. They are also called "long tail" or "fat tail" distributions because the probability of extreme values decreases much more slowly than with other distributions such as the exponential or the normal.
For example: Suppose there are 1,000,000,000 web pages, an average of 7 links per page, and the distribution of inlinks goes like 1/(N+4)^{2}. (4 is the offset). Then
By contrast, 2 different stochastic models leading to very different distributions.
Method 1
for I := 1 to L { choose a random page U; choose a random page V; place a link from U to V; }Number of inlinks follows a binomial distribution. Then there is probably no page with more than 30 inlinks, and the expected number of pages with no inlinks is about 9 million. The probability that there exists a page with 100 inlinks is about 10^{-63}. The probability that there exists a page with 280,000,000 inlinks is about 10^{- 2 billion}.
Method 2
for each page V do { flip a coin with weight 7 for heads, 1 for tails; if tails exitloop; choose a random page U; place a link from U to V }Total number of links will be very nearly 7 billion, to within about 0.0002%. Number of inlinks follows an exponential distribution. Expected number of pages with 100 inlinks = about 28. Probability that there exists any page with 280,000,000 inlinks is about 10^{- 18 million}
Corpus size | Number of distinct words | Success rate | Error rate |
---|---|---|---|
10,000 | 5700 | 0.64 | 0.36 |
20,000 | 13,000 | 0.68 | 0.32 |
50,000 | 23,000 | 0.73 | 0.27 |
100,000 | 41,000 | 0.77 | 0.23 |
200,000 | 73,000 | 0.81 | 0.19 |
500,000 | 150,000 | 0.86 | 0.14 |
1,000,000 | 260,000 | 0.90 | 0.10 |
Thus, over this range doubling the corpus size increases the success rate by 4%. With larger vocabularies, the effect is even starker.
(Formula: The harmonic function H(n) = sum_{k} 1/k = (approximately) ln(n) + Euler's constant 0.5772. Let V be the size of the vocabulary. The Kth most common word occurs with frequency 1/K*H(V). Assume, for simplicity, that an N word corpus is filled with the W most frequent words, appearing according to the specified frequency. All of the words of rank less than W0=N/H(V) appear at least once. These account for N0=N*H(W0)/H(V) words in the corpus. The remainder of the corpus is filled in with less frequent words that occur exactly once; there are N-N0 of these. Thus, the total number of different words in the corpus W=W0+(N-N0). The probability that the next word will have already been seen is H(W)/H(V).)
This is one of the major reasons that promising prototypes in intelligent systems often cannot be extended to high-quality systems; it is just much more work to get a 90% success rate than a 70% success rage. (Another major reason is combinatorial explosion.)
If A <= 2 in an inverse power distribution, then the mean is infinite. If A <= 3, then the variance and the standard deviation are infinite. Conversely, if you're doing statistical analysis of something and you find that as you take more and more samples your estimate of the mean or the standard deviation keeps getting bigger and bigger, you might consider whether you're looking at an inverse power distribution.
To find the optimal inverse power distribution matching a data set, plot the data set on a log-log graph. If y = C/(x+D)^{A} then log(y) = log C - A log(x+D). So if you graph log(y) against log(x+D) you a straight line with slope = -A and y-intercept = C. The value of D of course affects the graph only for small values of x, so find A and C by fitting the curve over large values of x by doing any standard linear regression (e.g. least squares), then with that fixed value of A and C do binary search to find the value of D that best fits the small values of x.
The two directions are actually the same rule (under reasonable smoothness and monotonicity assumptions); if a distribution satisfies (1) with exponent A then it satisfies (2) with exponent B=1+1/A. However, with approximate data (i.e. any actual data)
Of course this is not a plausible model of the Web, (only one outlink per page) so there's a small cottage industry in constructing more plausible stochastic models of the Web. See Baldi et al. chap. 3 for a very extensive discussion.
Multiplicative model. If you have a large collection of identically distributed independent random variables, and you multiply them, then you are just adding the logs, so the sum of the logs follows a normal curve, so you get what's called a log-normal distribution, which looks a lot like a power law in the middle range.
Information theoretic model (Mandelbrot). Can show that a power-law distribution of word frequencies maximizes information over cost.
STREAM1 ... 7:
Remarkable consistency as regards:
Single keyword sets: Sizes between 336,500 (baseball) and 831.7 (math).
Double keyword sets: Sizes between 7400 and 44,000 show great variation. (e.g. Arcs/node vary from 1.98 (Math Geometry) to 13.33 (BASEBALL YANKEES)
Trawling the web for emerging cyber-communities Ravi Kumar et al. 1999.
Look for small complete bipartite graphs on Web: K_{i,j}, for i=3..6, j=3,5,7,9. Infer cyber-communities.
Technical issues: Exclude aliases, copies, near-copies, nepotistic cores (hubs from same website). There are lots of copies of Yahoo pages.
Algorithmics:
Results: About 200,000 bipartite graph; almost all are coherent in terms of topics. Many unknown to Yahoo. Hubs in a community last longer than average Web pages.
Types of change:
December 1997: Estimate at least 320 million pages on the Web.
Feb. 1999. Estimate at least 800 miliion pages on the web. Test random IP addresses. Attempted to exhaustively search web site. Extrapolate. 2.8 million servers. Avg. of 289 pages per site = 800 million pages. Avg of 18.7 KBytes per page / 7.3 KBytes of textual content per page = 15 Terabytes / 6 TBytes of textual content. Avg. of 62.8 images per server, mean image size of 15.2 KBytes per image = 180 million images, 3 TBytes
The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho and Hector Garcia-Molina.
Method: Once a day, for four months, crawl 3000 pages at each of 270 "popular" sites (that gave permission out of 400 asked) using a breadth-first crawler from the root page. fixed seed set. Thus, not a stable set either of content or of URL's.
Results:
Method: Fix a set of 151 million HTML Web pages + 62 million non-HTML pages. Download once a week for 11 weeks. (Thus, strictly a URL-based perspective.)
Note that this method cannot detect the creation of a Web page during the experimental period, but reliably detects its deletion (except if the server fails to deliver it for some other reason.) By contrast the method of Cho and Garcia-Molina can detect appearance, but both its reports of appearance and disappearance are subject to many false positives, as the page may simply have come into or gone out of the 3000 page crawler window.
(Lots of technical detail on how you manage such a huge dataset effectively.)
Results:
Avg length of HTML page = 16KB. "Looks like" a Log-normal curve. 66.3% of documents have length between 4 and 32KB. .com pages a little longer, .edu pages a little shorter. Closer if you measure word length suggesting that the difference is that .edu pages have less HTML markup.
85% of URL's are downloadable over the entire 11 week experiment. (.edu better than .com, .net). An increasing number become unavailable due to robot.txt exclusions, probably a result of the experiment itself.
Web pages in .cn (China), .com, and .net expire sooner than average.
Similarity measure: Let U and V be two pages. Let S(U) and S(V) be the set of all 5-word shingles in text of U and V. Then similarity is measured as (I think) |S(U) intersect S(V)| / |S(U) union S(V)|. For efficiency, this is calculated using the random function fingerprinting method discussed earlier.
Considering all pairs of successive downloads of the same page (1 week apart): 65.2% are identical. 9.2% differ only in HTML elements. 3% have similarity less than 1/3. 0.8% have 0 similarity.
Among pages that changed only in markup: 62% are changes to an attribute. 48% are changes to an attribute that follows ? or ;. Most of these are just changes to a session ID.
Observing link evolution of this type may help a crawler in spotting session identifiers. If a crawler were able to recognize embedded session identifiers and remove them, then it could avoid recrawling the same content multiple times. In our experience, such unwarranted recrawling accounts for a non-negligible fraction of total crawl activity.A smaller fraction are advertisements, chosen by embedding some identifier in the query portion of a URL. etc.
Pages in .com change more frequently than in .gov or .edu.
Fastest changes were in .de (Germany). 27% of pages underwent a large or complete change every week, compared with 3% for Web. Why?
Of the first half dozen pages we examined, all but one contained disjoint, but perfectly grammatical phrases of an adult nature together with a redirection to an adult Web site. It soon became clear that the phrases were automatically generated on the fly, for the purpose of ``stuffing'' search engines such as Google with topical keywords surrounded by sensible-looking context, in order to draw visitors to the adult Web site. Upon further investigation, we discovered that our data set contains 1.03 million URLs drawn from 116,654 hosts (4745 of them outside the .de domain) which all resolved to a single IP address. This machine provided over 15% of the .de URL's in our data set!
Point: (1) to circumvent politeness policy of search engine in not downloading too many pages from a server at once. (2) to trick Google, by making links between these pages look non-nepotistic.
Large documents tend to change much more frequently than small ones.