Lecture 9: Web Mining

Survey Papers

Web Mining Research: A Survey Raymond Kosala, Hendrik Blockeel

Web Mining: Information and Pattern Discovery on the World Wide Web R. Cooley, B. Mobasher, and J. Srivastava

Data mining for hypertext: A tutorial survey Soumen Chakrabarti

Research Issues in Web Data Mining Sanjay Madria et al.

Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data J. Srivastava et al.

Data Categories

Steps of data mining



Learning to Extract Symbolic Knowledge from the World Wide Web Mark Craven et al. (AAAI-98)
Relational Learning with Statistical Predicate Invention Mark Craven and Sean Slattery

Given: Domain ontology. Training set of Web pages classified in ontology.
Learns: To classify new pages, add new individuals to ontology.


Classes: Persons, Students, Professors, Courses, Departments, Universities.
Individuals: Mark Jones, Sheila Quinlan, Operating Systems I.
Relations: Teaches(P,C); Advisor(P,S); ...


Training set: 8000 Web pages for individuals, 1400 Web page pairs for relations, from four other universities.
Test set: 2722 Web pages at CMU.
Precision: 73%. Didn't measure recall.


Methods for recognizing class instances

Naive Bayes based on bag of words, combining text, title, and anchor.
Prob(C | W1 ... Wk) = Prob(W1 ... Wk|C) Prob(C)/Z (Bayes' Law) =
Prob(W1 | C) ... Prob(Wk | C) Prob(C).
Choose C that maximizes.
Smooth for words that don't appear with class.

Most significant words:
Student: my, page, home, am, university, computer, science, me, at, here
Faculty: DDDD (digit), of, and, professor, computer, research, science, university, DDD, systems
Course: course, DD:DD, homework, will, D, assignments, class, hours, assignment, due.
Research project: group, project, research, of, laboratory, systems, and, our, system, projects.
Department: department, science, computer, faculty, information, undergraduate, graduate, staff, server, courses
(Note: stopwords are significant in this setting.)

Hyperlinks or title can give better information than text, depending on the category of page.
E.g. At 100% recall, "Department" predictions from text have only 9% accuracy; prediction from hyperlinks has 57% accuracy; better at every level of recall.
(Threshholding rather than choosing the best category.)
Part of the reason: There are only 4 department home pages, but many links to department home pages, hence much more data.
Title/heading curve for both "Faculty" and "Research Project" are better than full text at low levels of recall (high precision). (I find the result for "Faculty" surprising, since in most of the examples I checked the title/heading was just [name] or [name]'s home page.

First-Order Text Classification

Learn Prolog-style rules to characterize pages.

Base primitives:

Sample rules learned
faculty(A) :- has_professor(A), has_ph(A), link_to(B,A), has_faculti(B).

course(A) :- has_instructor(A), not(has_good(A)), link_to(A,B)), 
             not(link_to(B,_)), has_assign(B).

student(A) :- not(has_data(A)), not(has_comment(A)), link_to(B,A), 
              has_jame(B), has_paul(B), not(has_mail(B)).
High precision, low recall.

FOIL algorithm (sketch)

RULES := empty;
repeat {
 RULE := "Category :-."  /* Everything satisfies category */
 loop {
    CONDITION := condition that, when added to r.h.s, most effectively 
                 excludes negative examples while keeping positive examples;
    RULE1 := add CONDITION to r.h.s. of RULE 
    if (score(RULE1) > score(RULE))
      then RULE := RULE1
      else exitloop
  add RULE to RULES }
until (all positive examples are covered)

Multi-page sites

Task: Group multi-page sites together, identify main page.
Solution: Based on URL.
[path]/index.html or [path]/home.html or [path]homepage.html or [path]/cs???.html or [path]/[name]/[name].html
is the main page for all pages of form [path]/file or [path]/[subpath]/file.

Substantially improves precision, can lower maximal recall.

Learning relations

Learn rules for relations. Variant of FOIL.

Base relations

Sample rules:
instructors_of(A,B) :- course(A), person(B), link_to(_,B,A).
members_of_project(A,B) :- research_project(A), person(B), link_to(C,A,D),
       link_to(E,D,B), neighborhood_word_people(E).
department_of(A,B) :- person(A), department(B), link_to(C,D,A), link_to(E,F,D),
    link_to(G,B,F), neighborhood_word_graduate(E)
/* Path from department to F to D to person, link in F to D is near word
   "graduate", presumably D is a directory of graduate students. */
Members_of_project is more than 70% accurate at recall = 46%.
Instructors_of is 80% accurate at recall = 66%.
Department_of is 97% accurate at recall = 84%.

Extracting Text Fields

e.g. name, address, course_name, etc.

Learn rules: Another variant of FOIL. Powerful but complicated pattern matching predicates.

Combining Content, Anchor, and Link Information

Improving A Page Classifier with Anchor Extraction and Link Analysis William W. Cohen


Trying to learn a categorization I.C. For each instance I you have two attributes I.A and I.B, each of which is separately predictive of category C, whose errors are independent. You have a small corpus of labelled instances L, and a large corpus of unlabelled instances U.

One-step co-training algorithm:

1. Run a learning algorithm over L to derive a rule f(I.A) that predicts
     I.C from I.A.
2. Use f(I.A) to label all the instances in U. 
3. Run a learning algorithm over L union U to derive a rule g(I.B) that
   predicts f(I.A) from I.B
4. Use g(I.B) as your final categorization.
f(I) is an approximation to I.C. Moreover, it is an approximation whose errors are random relative to I.B. That is, the relation between B and f is the same as the relation between B and C plus random noise. Therefore, if U is large enough, then learning algorithm in (3) should learn to ignore this random noise, so learning f becomes the same as learning C over the large set U.

(The learning algorithm in 3 can be the same or different from the learning algorithm in 1.)
(Co-training is usually run iteratively; once you have found g, you recompute f over U, and so on.)
(Actual implementation here used a more complicated variant of this, to deal with the fact that U was not actually very large.)


To identify "executive biographies" in a collection of web pages from company web sites.

I.C == Is I an executive biography or not?
I.A == Words in I
I.B == Internal structure of hub pages pointing to I.
|L| = about 88
|U| = about 790


C can be predicted just from A with accuracy of 91.6%.
Co-learning raises accuracy to 96.4%. (that is, half error rate.)

Usage Mining

Data Preparation for Mining World Wide Web Browsing Patterns Cooley, Mobasher, and Srivastava

Data sources: Server level, client level, [proxy level].

Server level


Difficulties and limitations

Can solve using cookies

Client level

Induce client to use "bugged" browser. Get all the information you want.


Any usage collection runs into privacy issues; the more complete the data, the more serious the issue.

Pattern analysis

Statistical analysis

Association rules: Correlations among pages visited in a session.

Clusters of users who view similar sets of pages.
Clusters of pages that are viewed together.

Association rules:
Examples from 1996 Olympics Web site:
Indoor volleyball => Handball (Confidence: 45%)
Badminton, Diving => Table Tennis (Confidence: 59.7%)

Sequential patterns: Atlanta home page followed by Sneakpeek main page (Support: 9.81%) Sports main page followed by Schedules main page (Support: 0.42%)

Relate web activities to user profile.
E.g. 30% of users who ordered music are 18-25 and live on the West Coast.


WebWatcher: A Tour Guide for the World Wide Web T. Joachims, D. Freitag, and T. Mitchell (1997)

Browser. User specifies "interest", starts browsing
WebWatcher highlights links it considers of particular interest.

Learns function LinkQuality = Prob(Link | Page, Interest)

Learning from previous tours:
Annotate each link with interest of users who followed it, plus anchor.
Find links whose annotation best matches interest of user.
(Qy: Why annotate links rather than pages? Perhaps to achieve directionality)

Learning from hypertext structure
Value of page is TDIDF match from interest to page.
Value of path P1, P2, ... is discounted sum:
Value(P1,P2 ...Pk) = value(P1) + D*value(P2) + D2value(P3) + ...
where D < 1. Value of link is the value of best path starting at target of link.
Dynamic programming algorithm to compute this.

For links on new pages: distance-weighted 3 nearest-neighbor approximator.
That is: We are on page P and deciding between links L1, L2 ... Lk.
Distance between link L1 on P and Lx on Px is
dist(L1,Lx) = TFIDF(anchor(L1),anchor(L2)) + 2*TFIDF(text(P),text(Px)).
Let Lx, Ly, Lz be closest links to L1.
The quality of L1 for interest I is
qual(L1,I) = TFIDF(Lx,I)/dist(L1,Lx) + TFIDF(Ly,I)/dist(L1,Ly) + TFIDF(Lz,I)/dist(L1,Lz)
Recommend links of highest quality.

Evaluation: Accuracy = percentage of time user followed a recommended link.
Achieved accuracy of 48.9% as compared top 31.3% for random recommendations.