Lecture 10: Archiving / Usage Mining
Retrieve past state of web.
- Only static pages. (Dynamic pages depend on arbitrary state that is
- Included files must be archived as well. The state of a "page" at
one time is the state of the main files plus the states of the included
- Hyperlinks: If page P at date D points to Q, and you click on it,
do you want the version of Q at date D or the current version of Q?
- Date indeterminacy. Collection must be observational ;
that is, new versions collected from time to time, not every time the
page is modified. Often, the creation date can be determined;
just know the earliest date it was collected. All you can ever know
about its ending is the last day it was collected. Thus between the last
day of version V1 is collected, and the earliest establishable date
for the next version, one can only assume V1 was the active
version. Versions that exist only between collections are of course entirely
- Significant change: Is it worth recording small changes?
Tradeoff time vs. space. In first version can't eliminate middle states.
Mixed strategy: Checkpointing at regular intervals or when change is
- Change from previous version.
- Change from original version.
- Complete state of each version.
- Query language
Library of Congress: Digital
Historical archive, ideally of the entire state of internet, in reality
of some substantial number of significant sites. Intended as a permanent
record. All formats of files.
Library of Congress: Web Capture
"Way Back Machine" finds contemporary included files and hyperlinks.
Storage: Vast quantities of data. Large arrays of disk plus mag tape.
- Against accident.
- Against physical deterioration. Migrate tapes every year.
- Against impermanent software infrastructure. Vast quantities of
data from the 60's, 70's even 80's is now forever lost, because reading it
required software that is no longer in existence anywhere.
Proxy-side archive. Web page archived when proxy requests it from server.
Curtis Dyreson, Hui-Ling Lin, and Yingxia Wang, WWW-2004.
TTApache -- Server side archiving
- Current version archived when requested from server.
- HTTP-compatible queries. Dates etc. are appended to URL as arguments
so queries are backward compatible with other servers (just get current
- Link rewriting. User may choose to follow any date of link.
- Assumed version. User may choose either to accept or reject
a version that is assumed to have existed at a given date.
- Support for range of vacuuming strategies.
- Efficient. Slows down server only slightly.
- Distinguish pages whose state is known on a given date from those
whose state is assumed, as above.
- A page version is the state of the main page file and all
the included files. If any included file changes, it is a new version of
the page. The state of the page at a given date is known, rather than
assumed, only if the state of all the associated files is known.
- sports.html or sports.html?now -- Current version.
- sports.html?pre -- Previous version. Clicking on links will give
current version of linked pages.
- sports.html?pre,timeOf -- Previous version of sports.html
Clicking on links will give the linked pages
in the state that they were in at the time of that version (? at the
earliest time that version is known, the last time that version is known,
or the last time that version is assumed? Presumably the first.)
- sports.html?pre,pre -- Previous version of sports.html.
Clicking on link gives linked page in state previous to the current time.
- sports.html?26-Sep-2003. Return the state of sports.html as of
that date if it is known. If it is only assumed, return 404 error.
(Can change this default.)
- sports.html?assumed.26-Sep-2003. Return the state of the page whether
it is known or assumed.
- sports.html?26-Sep-2003.next. Return next state of page after specified
- sports.html?history(1-Jan-2003, 31-Dec-2003). Return history of
page between dates. Result is a list of links to known and assumed pages.
- sports.html?vacuum(t-window,begin,begin+365). Vacuum states
of sports.html for the first year after its creation.
- sports/?vacuum(v-window,1,2). Vacuum every odd-numbered version
of every title in sports. (Discrepancy in interpretation of arguments.)
- Automatic vs. user-specified. Automatic strategy setable for different
- Periodic sieve: Keep every kth version
- Version-window sieve: Keep only 5 most recent versions.
- Time-window sieve: Keep only past year.
- Percent difference sieve: Keep new version only if it is at least
K% different from previous version.
- If version specified in query has been vacuumed, redirect to
previous version / next version / return error message.
Reading: Liu, Web Data Mining, chap. 12, "Web Usage Mining" by Bamshad
Mining interesting knowledge from weblogs: a survey
Federici M. Facca, and Pier Luca Lanzi
Data Preparation for Mining World Wide Web Browsing Patterns
Cooley, Mobasher, and Srivastava
Most common source.
- IP address of user.
- Userid (not always available)
- Method/URL/Protocol e.g. "GET A.html HTTP/1.0"
- Status (200, 404, etc.)
- Size (Bytes)
- Browser used.
Difficulties and limitations
- Only data for the one server.
- Data cleaning. Eliminate requests from robots. Well-behaved robots
either declare themselves as such, or start out by requesting "robots.txt".
Badly behaved robots can sometimes be detected by various patterns:
- Temporal patterns. E.g. large number of pages downloaded every night
- Searching patterns. Breadth-first search, or patterns that resemble
those of known robots.
- Cached pages. Can be solved using cache busting. E.g. SurfAid
time the Web pages is loaded, Web Bug sends a request to the server asking
for a 1x1 pixel image; the request is generated with parameters identifying
the Web page containing the script and a random numeric parameter; the overall
request cannot be cached [either] by the proxy [or] by the browser, but it is
logged by the Web server so as to solve caching problems."
- User identification:
Can solve using cookies, or (in large part) using URL rewriting: Whenever the
server delivers an HTML page, any internal URL's are rewritten to include
the session ID. Thus, if the user clicks on them, the session can be
- Single IP address, several users: Users communicate through proxy.
- Multiple IP addresses, one session: Some ISP's do this for privacy.
- Multiple IP address, one user over time: User uses more than one account.
- Boundaries for session. 30 minute timeout.
- For popular web sites, server logs are immense. Must often be reduced
before data mining can be carried out.
Reconstructing search path: Pages previously looked at are
cached at the client side and not recorded in
server log. Hence, if user request page A and has previously requested C
with a link to A, then infer that user has returned to C and followed the link
Facca and Lanzi state: "Users' behavior can also be tracked down on the server
side by means of TCP/IP packet sniffers. Even in this case the identification
of users' sessions is still an issue, but the use of packet sniffers provides
some advantages. In fact (i) data are collected in real time; [??]
coming from different Web servers can be easily merged together into a
unique log [I presume the point is that different servers generated logs
of different formats, whereas TCP/IP is standard] (iii) the use of special
buttons (e.g. the stop button) can be detected so as to collect
information usually unavailable in log files. Notwithstanding the many
advantages, packet sniffers are rarely used in practice. Packet sniffers
raise scalability issues on Web servers with high traffic, moreover they
cannot access encrypted packets ..."
Integrating E-Commerce and Data Mining: Architecture and Challenges
S. Ansari et al.
Advantages: Record [description of] dynamically generated content.
Sessionize, identify users, using cookies. Save information missing
from server logs: Stop button, local time of user, speed of user's connection.
Disadvantage: Have to either write or modify web server code.
Proxy server logs
Advantage: many servers, many users (though generally not a representative
collection of users).
Disadvantage: Low quality information.
Induce client to use "bugged" browser. Get all the information you want
(though the analysis of content is generally easier at the server side.)
Better yet, you can bug some of their other programs as well and get
even more information. E.g. if you monitor the browser, email, and
text editing programs, you can see how often the user is stuffing information
from the browser into email and text files.
- Only certain users. Generally only certain types of users (e.g.
university personnel), which may not be representative.
- Just because a browser is looking at a page all night doesn't mean
a user is looking at the page.
Any usage collection runs into privacy issues; the more complete the
data, the more serious the issue.
Information to be extracted
- Frequency of access (by page/item/etc.)
--- by demographic category of user
- Average view time
- Average path length
- Invalid URLs
- Unexpected/unauthorized entry points
- Correlation of pages / association rules
- Sequence rules.
- Cluster of pages
- Collaborative filtering pages/users.
- Cluster transactions.
Search patterns and paths.
Statistical analysis, classification techniques, clustering,
Markov models, sequential patterns.
Examples from 1996 Olympics Web site: (Cooley et al.)
Indoor volleyball => Handball (Confidence: 45%)
Badminton, Diving => Table Tennis (Confidence: 59.7%)
Atlanta home page followed by Sneakpeek main page (Support: 9.81%)
Sports main page followed by Schedules main page (Support: 0.42%)
Relate web activities to user profile.
E.g. 30% of users who ordered music are 18-25 and live on the West Coast.
Inferring User Objectives
Reorganizing web sites based on user access patterns
Y. Fu, M. Shih, M. Creado, and C. Ju.
Plan recognition. Distinguish target/content pages from
pages. Similar to authority vs. hub, except that within a hierarchical site
a content page may have only one inlink, unlike an authority. Assume that
the user's ultimate objective in a session is to collect the content pages
that he accesses. Criteria for distinguishing index from content pages:
- Number of outlinks. In particular, leaf pages and pages not in HTML
are always content pages.
- Time spent browsing (though this has to be statistical since some of the
time this is coffee breaks.)
The literature is noticeably weaker here.
On the other hand, there are dozens of commercial software systems to do this
kind of analysis; so people must be buying them; so they must be using them
somehow. (See Facca and Lanzi; also
Web mining for web personalization
M. Eirinaki and M. Varigiannis)
WebWatcher: A Tour Guide for the World Wide Web
T. Joachims, D. Freitag, and T. Mitchell (1997)
Browser. User specifies "interest", starts browsing
WebWatcher highlights links it considers of particular interest.
Learns function LinkQuality = Prob(Link | Page, Interest)
Learning from previous tours:
Annotate each link with interest of users who followed it, plus anchor.
Find links whose annotation best matches interest of user.
(Qy: Why annotate links rather than pages? Perhaps to achieve directionality)
Learning from hypertext structure
Value of page is TDIDF match from interest to page.
Value of path P1, P2, ... is discounted sum:
Value(P1,P2 ...Pk) = value(P1) + D*value(P2) + D2value(P3) + ...
where D < 1.
Value of link is the value of best path starting at target of link.
Dynamic programming algorithm to compute this.
For links on new pages: distance-weighted 3 nearest-neighbor approximator.
That is: We are on page P and deciding between links L1, L2 ... Lk.
Distance between link L1 on P and Lx on Px is
dist(L1,Lx) = TFIDF(anchor(L1),anchor(L2)) + 2*TFIDF(text(P),text(Px)).
Let Lx, Ly, Lz be closest links to L1.
The quality of L1 for interest I is
qual(L1,I) = TFIDF(Lx,I)/dist(L1,Lx)
Recommend links of highest quality.
Evaluation: Accuracy = percentage of time user followed a recommended link.
Achieved accuracy of 48.9% as compared top 31.3% for random recommendations.
Automatic Personalization Based on Web Usage Mining
Bamshad Mobasher, Robert Cooley, Jaideep Srivastava.
Recommended links on department web site, some other web sites.
Caching and pre-loading
Hard to believe that you get much leverage on this problem, but many people
Web site design
Adaptive Web Sites: Automatically Synthesizing Web Pages
Mike Perkowitz, Oren Etzioni
- Cluster pages at a site by co-occurrence in sessions
- For each cluster, create an index page
Mining Web Logs to Improve Website Organization
Ramakrishnan Srikant and Yinghui Yang
Insight: If a user searches down a hierarchy to index page B, backtracks, and
then ends up at target page T, and T is the first target page looked at
in the search, then it seems likely that the user expected to find T under B,
and therefore one can suggest that it might be good to add a link from B to T.
Test case: Wharton School of Business web site. The web site has 240 leaf
pages. Based on 6 days worth of server logs with 15,000 visitors and
3,000,000 records (200 records per visitor seems like a large number, but of
course that includes imbedded image files and other junk), the program
suggested new links for 25 pages.
Visitors expect to find the answer to "Why choose Wharton?" under
"Student to Student Program's Question and Answer Session" directory
"Student to Student Program's General Description"
Visitors expect to find "MBA Student Profiles" under "Student" instead of
Visitors expect to find "Calendar" under "Programs" instead of "WhartonNow".
Visitors expect to find "Concentrations" and "Curriculum" under "Students"
instead of "Programs" (less convincing).
The program also made 20 other suggestions.
I couldn't find any good papers on this. But I'd guess the main applications
Customer Profiling: You can find out what kind of customers are
buying what kind of items if you can get demographic information (which,
apparently, you often can get just by asking --- one writer was shocked at
how readily online shoppers provided personal information that the company
had no business asking.)
Advertisement placement (my own guess). The "referred" place in
the server log tells you what advertisements are attracting what kind of