Lecture 14: Dynamic Web
Temporal Web Repositories
Retrieve past state of web.
Historical archive, ideally of the entire state of internet, in reality
of some substantial number of significant sites. Intended as a permanent
record. All formats of files.
- Only static pages. (Dynamic pages depend on arbitrary state that is
- Included files must be archived as well. The state of a "page" at
one time is the state of the main files plus the states of the included
- Hyperlinks: If page P at date D points to Q, and you click on it,
do you want the version of Q at date D or the current version of Q?
- Date indeterminacy. Collection must be observational ;
that is, new versions collected from time to time, not every time the
page is modified. Often, the creation date can be determined;
just know the earliest date it was collected. All you can ever know
about its ending is the last day it was collected. Thus between the last
day of version V1 is collected, and the earliest establishable date
for the next version, one can only assume V1 was the active
version. Versions that exist only between collections are of course entirely
- Significant change: Is it worth recording small changes?
Tradeoff time vs. space. In first version can't eliminate middle states.
Mixed strategy: Checkpointing at regular intervals or when change is
- Change from previous version.
- Change from original version.
- Complete state of each version.
- Query language
"Way Back Machine" finds contemporary included files and hyperlinks.
Storage: Vast quantities of data. Large arrays of disk plus mag tape.
- Against accident.
- Against physical deterioration. Migrate tapes every year.
- Against impermanent software infrastructure. Vast quantities of
data from the 60's, 70's even 80's is now forever lost, because reading it
required software that is no longer in existence anywhere.
Proxy-side archive. Web page archived when proxy requests it from server.
Curtis Dyreson, Hui-Ling Lin, and Yingxia Wang, WWW-2004.
TTApache -- Server side archiving
- Current version archived when requested from server.
- HTTP-compatible queries. Dates etc. are appended to URL as arguments
so queries are backward compatible with other servers (just get current
- Link rewriting. User may choose to follow any date of link.
- Assumed version. User may choose either to accept or reject
a version that is assumed to have existed at a given date.
- Support for range of vacuuming strategies.
- Efficient. Slows down server only slightly.
- Distinguish pages whose state is known on a given date from those
whose state is assumed, as above.
- A page version is the state of the main page file and all
the included files. If any included file changes, it is a new version of
the page. The state of the page at a given date is known, rather than
assumed, only if the state of all the associated files is known.
- sports.html or sports.html?now -- Current version.
- sports.html?pre -- Previous version. Clicking on links will give
current version of linked pages.
- sports.html?pre,timeOf -- Previous version of sports.html
Clicking on links will give the linked pages
in the state that they were in at the time of that version (? at the
earliest time that version is known, the last time that version is known,
or the last time that version is assumed? Presumably the first.)
- sports.html?pre,pre -- Previous version of sports.html.
Clicking on link gives linked page in state previous to the current time.
- sports.html?26-Sep-2003. Return the state of sports.html as of
that date if it is known. If it is only assumed, return 404 error.
(Can change this default.)
- sports.html?assumed.26-Sep-2003. Return the state of the page whether
it is known or assumed.
- sports.html?26-Sep-2003.next. Return next state of page after specified
- sports.html?history(1-Jan-2003, 31-Dec-2003). Return history of
page between dates. Result is a list of links to known and assumed pages.
- sports.html?vacuum(t-window,begin,begin+365). Vacuum states
of sports.html for the first year after its creation.
- sports/?vacuum(v-window,1,2). Vacuum every odd-numbered version
of every tile in sports. (Discrepancy in interpretation of arguments.)
- Automatic vs. user-specified. Automatic strategy setable for different
- Periodic sieve: Keep every kth version
- Version-window sieve: Keep only 5 most recent versions.
- Time-window sieve: Keep only past year.
- Percent difference sieve: Keep new version only if it is at least
K% different from previous version.
- If version specified in query has been vacuumed, redirect to
previous version / next version / return error message.
Change in the Web
Anecdotal remark: A significant fractions of the links from my class notes
for this course from two years ago are now dead. These are, of course,
links to published scientific papers, which ought to be extremely stable.
(This is mostly migration, of course; most of these papers are available
somewhere on the web and can be found using Google.)
Types of change:
Accessibility of information on the web,
Steve Lawrence and C. Lee Giles, Nature, July 8, 1999.
- Growth: (a) in number of sites; (b) in number of pages at each site;
(c) in size of page (d) in linkages.
- URL-centered viewpoint:
- Change of content.
- Replacement by forward pointer.
- Content-centered viewpoint
- Migration: With or without forward pointer.
- Link structure
December 1997: Estimate at least 320 million pages on the Web.
Feb. 1999. Estimate at least 800 miliion pages on the web.
Test random IP addresses. Attempted to exhaustively search web site.
Extrapolate. 2.8 million servers. Avg. of 289 pages per site =
800 million pages. Avg of 18.7 KBytes per page / 7.3 KBytes of
textual content per page = 15 Terabytes / 6 TBytes of textual content.
Avg. of 62.8 images per server, mean image size of 15.2 KBytes per image
= 180 million images, 3 TBytes
The Evolution of
the Web and Implications for an Incremental Crawler
Junghoo Cho and Hector Garcia-Molina.
Method: Once a day, for four months,
crawl 3000 pages at each of 270 "popular" sites
(that gave permission out of 400 asked) using a breadth-first crawler
from the root page.
fixed seed set. Thus, not a stable set either of content or of URL's.
A large scale study of the evolution of Web pages
- 20% of pages had an average change interval of less than 1 day.
30% did not change over the 4 months of the experiment.
- Highly dependent on domain. In .com, 40% of pages changed every day,
and 10% lasted more than 4 months. In .edu and .gov, 1 or 2 percent changed
every day, and 50% lasted more than 4 months.
- 50% of the web changes in 50 days. 11 days for .com.
Dennis Fetterly et al. Software -- Practice and Experience ,
vol. 34, 2004, pp. 213-237.
Method: Fix a set of 151 million HTML Web pages + 62 million non-HTML
pages. Download once a week for 11 weeks. (Thus, strictly a URL-based
Note that this method cannot detect the creation of a Web page
during the experimental period, but reliably detects its deletion (except
if the server fails to deliver it for some other reason.) By contrast
the method of Cho and Garcia-Molina can detect appearance, but both
its reports of appearance and disappearance are subject to many false
positives, as the page may simply have come into or gone out of the
3000 page crawler window.
(Lots of technical detail on how you manage such a huge dataset effectively.)
Avg length of HTML page = 16KB. "Looks like" a Log-normal curve.
66.3% of documents have length between 4 and 32KB. .com pages a little
longer, .edu pages a little shorter. Closer if you measure word length
suggesting that the difference is that .edu pages have less HTML markup.
85% of URL's are downloadable over the entire 11 week experiment.
(.edu better than .com, .net). An increasing number become unavailable
due to robot.txt exclusions, probably a result of the experiment itself.
Web pages in .cn (China), .com, and .net expire sooner than average.
Similarity measure: Let U and V be two pages. Let S(U) and S(V) be the
set of all 5-word shingles in text of U and V. Then similarity is
measured as (I think) |S(U) intersect S(V)| / |S(U) union S(V)|.
For efficiency, this is calculated using the random function fingerprinting
method discussed earlier.
Considering all pairs of successive downloads of the same page (1 week apart):
65.2% are identical. 9.2% differ only in HTML elements. 3% have similarity
less than 1/3. 0.8% have 0 similarity.
Among pages that changed only in markup: 62% are changes to an attribute.
48% are changes to an attribute that follows ? or ;. Most of these
are just changes to a session ID.
Observing link evolution of this type may help a crawler in spotting session
identifiers. If a crawler were able to recognize embedded session identifiers
and remove them, then it could avoid recrawling the same content
multiple times. In our experience, such unwarranted recrawling accounts
for a non-negligible fraction of total crawl activity.
A smaller fraction are advertisements, chosen by embedding some
identifier in the query portion of a URL. etc.
Pages in .com change more frequently than in .gov or .edu.
Fastest changes were in .de (Germany). 27% of pages underwent a large or
complete change every week, compared with 3% for Web. Why? o
Of the first half dozen pages we examined, all but one contained
disjoint, but perfectly grammatical phrases of an adult nature
together with a redicertion to an adult Web site. It soon became clear that
the phrases were automatically generated on the fly, for the purpose
of ``stuffing'' search engines such as Google with topical keywords surrounded
by sensible-looking context, in order to draw visitors to the adult
Web site. Upon further investigation, we discovered that our data set
contains 1.03 million URLs drawn from 116,654 hosts (4745 of them outside
the .de domain) which all resolved to a single IP address. This machine
provided over 15% of the .de URL's in our data set!
Point: (1) to circumvent politeness policy of search engine in not downloading
too many pages from a server at once.
(2) to trick Google, by making links between these pages look non-nepotistic.
Large documents tend to change much more frequently than small ones.