Lecture 14: Dynamic Web

Temporal Web Repositories

Retrieve past state of web. Issues:

Internet Archive

Historical archive, ideally of the entire state of internet, in reality of some substantial number of significant sites. Intended as a permanent record. All formats of files.

"Way Back Machine" finds contemporary included files and hyperlinks.

Storage: Vast quantities of data. Large arrays of disk plus mag tape.

Permanence:

IProxy

Proxy-side archive. Web page archived when proxy requests it from server.

Author-requested archiving

Managing versions of web documents in a transaction-time web server

Curtis Dyreson, Hui-Ling Lin, and Yingxia Wang, WWW-2004.

TTApache -- Server side archiving

Temporal model: Sample queries: Vacuuming strategies:

Change in the Web

Anecdotal remark: A significant fractions of the links from my class notes for this course from two years ago are now dead. These are, of course, links to published scientific papers, which ought to be extremely stable. (This is mostly migration, of course; most of these papers are available somewhere on the web and can be found using Google.) Types of change: Accessibility of information on the web, Steve Lawrence and C. Lee Giles, Nature, July 8, 1999.

December 1997: Estimate at least 320 million pages on the Web.

Feb. 1999. Estimate at least 800 miliion pages on the web. Test random IP addresses. Attempted to exhaustively search web site. Extrapolate. 2.8 million servers. Avg. of 289 pages per site = 800 million pages. Avg of 18.7 KBytes per page / 7.3 KBytes of textual content per page = 15 Terabytes / 6 TBytes of textual content. Avg. of 62.8 images per server, mean image size of 15.2 KBytes per image = 180 million images, 3 TBytes

The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho and Hector Garcia-Molina.

Method: Once a day, for four months, crawl 3000 pages at each of 270 "popular" sites (that gave permission out of 400 asked) using a breadth-first crawler from the root page. fixed seed set. Thus, not a stable set either of content or of URL's.

Results:

A large scale study of the evolution of Web pages
Dennis Fetterly et al. Software -- Practice and Experience , vol. 34, 2004, pp. 213-237.

Method: Fix a set of 151 million HTML Web pages + 62 million non-HTML pages. Download once a week for 11 weeks. (Thus, strictly a URL-based perspective.)

Note that this method cannot detect the creation of a Web page during the experimental period, but reliably detects its deletion (except if the server fails to deliver it for some other reason.) By contrast the method of Cho and Garcia-Molina can detect appearance, but both its reports of appearance and disappearance are subject to many false positives, as the page may simply have come into or gone out of the 3000 page crawler window.

(Lots of technical detail on how you manage such a huge dataset effectively.)

Results:

Avg length of HTML page = 16KB. "Looks like" a Log-normal curve. 66.3% of documents have length between 4 and 32KB. .com pages a little longer, .edu pages a little shorter. Closer if you measure word length suggesting that the difference is that .edu pages have less HTML markup.

85% of URL's are downloadable over the entire 11 week experiment. (.edu better than .com, .net). An increasing number become unavailable due to robot.txt exclusions, probably a result of the experiment itself.

Web pages in .cn (China), .com, and .net expire sooner than average.

Similarity measure: Let U and V be two pages. Let S(U) and S(V) be the set of all 5-word shingles in text of U and V. Then similarity is measured as (I think) |S(U) intersect S(V)| / |S(U) union S(V)|. For efficiency, this is calculated using the random function fingerprinting method discussed earlier.

Considering all pairs of successive downloads of the same page (1 week apart): 65.2% are identical. 9.2% differ only in HTML elements. 3% have similarity less than 1/3. 0.8% have 0 similarity.

Among pages that changed only in markup: 62% are changes to an attribute. 48% are changes to an attribute that follows ? or ;. Most of these are just changes to a session ID.

Observing link evolution of this type may help a crawler in spotting session identifiers. If a crawler were able to recognize embedded session identifiers and remove them, then it could avoid recrawling the same content multiple times. In our experience, such unwarranted recrawling accounts for a non-negligible fraction of total crawl activity.
A smaller fraction are advertisements, chosen by embedding some identifier in the query portion of a URL. etc.

Pages in .com change more frequently than in .gov or .edu.

Fastest changes were in .de (Germany). 27% of pages underwent a large or complete change every week, compared with 3% for Web. Why? o

Of the first half dozen pages we examined, all but one contained disjoint, but perfectly grammatical phrases of an adult nature together with a redicertion to an adult Web site. It soon became clear that the phrases were automatically generated on the fly, for the purpose of ``stuffing'' search engines such as Google with topical keywords surrounded by sensible-looking context, in order to draw visitors to the adult Web site. Upon further investigation, we discovered that our data set contains 1.03 million URLs drawn from 116,654 hosts (4745 of them outside the .de domain) which all resolved to a single IP address. This machine provided over 15% of the .de URL's in our data set!

Point: (1) to circumvent politeness policy of search engine in not downloading too many pages from a server at once. (2) to trick Google, by making links between these pages look non-nepotistic.

Large documents tend to change much more frequently than small ones.