Lecture 10: Archiving / Usage Mining

Web Archives

Retrieve past state of web. Issues:

Library of Congress: Digital Preservation
Library of Congress: Web Capture

Internet Archive

Historical archive, ideally of the entire state of internet, in reality of some substantial number of significant sites. Intended as a permanent record. All formats of files.

"Way Back Machine" finds contemporary included files and hyperlinks.

Storage: Vast quantities of data. Large arrays of disk plus mag tape.

Permanence:

IProxy

Proxy-side archive. Web page archived when proxy requests it from server.

Author-requested archiving

Managing versions of web documents in a transaction-time web server

Curtis Dyreson, Hui-Ling Lin, and Yingxia Wang, WWW-2004.

TTApache -- Server side archiving

Temporal model: Sample queries: Vacuuming strategies:

Usage Mining

Reading: Liu, Web Data Mining, chap. 12, "Web Usage Mining" by Bamshad Mobasher.

Mining interesting knowledge from weblogs: a survey Federici M. Facca, and Pier Luca Lanzi

Data Preparation for Mining World Wide Web Browsing Patterns Cooley, Mobasher, and Srivastava

Data sources

Server Log

Most common source.

Difficulties and limitations

Reconstructing search path: Pages previously looked at are cached at the client side and not recorded in server log. Hence, if user request page A and has previously requested C with a link to A, then infer that user has returned to C and followed the link to A.

Packet sniffing

Facca and Lanzi state: "Users' behavior can also be tracked down on the server side by means of TCP/IP packet sniffers. Even in this case the identification of users' sessions is still an issue, but the use of packet sniffers provides some advantages. In fact (i) data are collected in real time; [??] (ii) information coming from different Web servers can be easily merged together into a unique log [I presume the point is that different servers generated logs of different formats, whereas TCP/IP is standard] (iii) the use of special buttons (e.g. the stop button) can be detected so as to collect information usually unavailable in log files. Notwithstanding the many advantages, packet sniffers are rarely used in practice. Packet sniffers raise scalability issues on Web servers with high traffic, moreover they cannot access encrypted packets ..."

Server Application

Integrating E-Commerce and Data Mining: Architecture and Challenges S. Ansari et al.

Advantages: Record [description of] dynamically generated content. Sessionize, identify users, using cookies. Save information missing from server logs: Stop button, local time of user, speed of user's connection.

Disadvantage: Have to either write or modify web server code.

Proxy server logs

Advantage: many servers, many users (though generally not a representative collection of users).

Disadvantage: Low quality information.

Client level

Induce client to use "bugged" browser. Get all the information you want (though the analysis of content is generally easier at the server side.)

Better yet, you can bug some of their other programs as well and get even more information. E.g. if you monitor the browser, email, and text editing programs, you can see how often the user is stuffing information from the browser into email and text files.

Limitations

Any usage collection runs into privacy issues; the more complete the data, the more serious the issue.

Information to be extracted

Statistical measures

Associations

Search patterns and paths.

Mining techniques

Statistical analysis, classification techniques, clustering, Markov models, sequential patterns.

Association rules:
Examples from 1996 Olympics Web site: (Cooley et al.)
Indoor volleyball => Handball (Confidence: 45%)
Badminton, Diving => Table Tennis (Confidence: 59.7%)

Sequential patterns: Atlanta home page followed by Sneakpeek main page (Support: 9.81%) Sports main page followed by Schedules main page (Support: 0.42%)

Relate web activities to user profile.
E.g. 30% of users who ordered music are 18-25 and live on the West Coast.

Inferring User Objectives

Reorganizing web sites based on user access patterns Y. Fu, M. Shih, M. Creado, and C. Ju.

Plan recognition. Distinguish target/content pages from index/navigation pages. Similar to authority vs. hub, except that within a hierarchical site a content page may have only one inlink, unlike an authority. Assume that the user's ultimate objective in a session is to collect the content pages that he accesses. Criteria for distinguishing index from content pages:

Applications

The literature is noticeably weaker here.

On the other hand, there are dozens of commercial software systems to do this kind of analysis; so people must be buying them; so they must be using them somehow. (See Facca and Lanzi; also Web mining for web personalization M. Eirinaki and M. Varigiannis)

Personalization/recommenders

WebWatcher

WebWatcher: A Tour Guide for the World Wide Web T. Joachims, D. Freitag, and T. Mitchell (1997)

Browser. User specifies "interest", starts browsing
WebWatcher highlights links it considers of particular interest.

Learns function LinkQuality = Prob(Link | Page, Interest)

Learning from previous tours:
Annotate each link with interest of users who followed it, plus anchor.
Find links whose annotation best matches interest of user.
(Qy: Why annotate links rather than pages? Perhaps to achieve directionality)

Learning from hypertext structure
Value of page is TDIDF match from interest to page.
Value of path P1, P2, ... is discounted sum:
Value(P1,P2 ...Pk) = value(P1) + D*value(P2) + D2value(P3) + ...
where D < 1. Value of link is the value of best path starting at target of link.
Dynamic programming algorithm to compute this.

For links on new pages: distance-weighted 3 nearest-neighbor approximator.
That is: We are on page P and deciding between links L1, L2 ... Lk.
Distance between link L1 on P and Lx on Px is
dist(L1,Lx) = TFIDF(anchor(L1),anchor(L2)) + 2*TFIDF(text(P),text(Px)).
Let Lx, Ly, Lz be closest links to L1.
The quality of L1 for interest I is
qual(L1,I) = TFIDF(Lx,I)/dist(L1,Lx) + TFIDF(Ly,I)/dist(L1,Ly) + TFIDF(Lz,I)/dist(L1,Lz)
Recommend links of highest quality.

Evaluation: Accuracy = percentage of time user followed a recommended link.
Achieved accuracy of 48.9% as compared top 31.3% for random recommendations.

Automatic Personalization Based on Web Usage Mining Bamshad Mobasher, Robert Cooley, Jaideep Srivastava. Recommended links on department web site, some other web sites.

Caching and pre-loading

Hard to believe that you get much leverage on this problem, but many people have tried.

Web site design

Adaptive Web Sites: Automatically Synthesizing Web Pages Mike Perkowitz, Oren Etzioni

Mining Web Logs to Improve Website Organization Ramakrishnan Srikant and Yinghui Yang

Insight: If a user searches down a hierarchy to index page B, backtracks, and then ends up at target page T, and T is the first target page looked at in the search, then it seems likely that the user expected to find T under B, and therefore one can suggest that it might be good to add a link from B to T.

Test case: Wharton School of Business web site. The web site has 240 leaf pages. Based on 6 days worth of server logs with 15,000 visitors and 3,000,000 records (200 records per visitor seems like a large number, but of course that includes imbedded image files and other junk), the program suggested new links for 25 pages. Some examples:

Visitors expect to find the answer to "Why choose Wharton?" under "Student to Student Program's Question and Answer Session" directory instead of "Student to Student Program's General Description"

Visitors expect to find "MBA Student Profiles" under "Student" instead of "MBA Admission".

Visitors expect to find "Calendar" under "Programs" instead of "WhartonNow".

Visitors expect to find "Concentrations" and "Curriculum" under "Students" instead of "Programs" (less convincing).

The program also made 20 other suggestions.

E-commerce

I couldn't find any good papers on this. But I'd guess the main applications are:

Customer Profiling: You can find out what kind of customers are buying what kind of items if you can get demographic information (which, apparently, you often can get just by asking --- one writer was shocked at how readily online shoppers provided personal information that the company had no business asking.)

Advertisement placement (my own guess). The "referred" place in the server log tells you what advertisements are attracting what kind of business.