Lecture 12: Usage Mining / Focussed Crawling

Clusty A search engine with clustering.

Usage Mining

Client level

Induce client to use "bugged" browser. Get all the information you want.

Limitations

Any usage collection runs into privacy issues; the more complete the data, the more serious the issue.

Pattern analysis

Statistical analysis

Association rules: Correlations among pages visited in a session.

Clustering:
Clusters of users who view similar sets of pages.
Clusters of pages that are viewed together.

Association rules:
Examples from 1996 Olympics Web site:
Indoor volleyball => Handball (Confidence: 45%)
Badminton, Diving => Table Tennis (Confidence: 59.7%)

Sequential patterns: Atlanta home page followed by Sneakpeek main page (Support: 9.81%) Sports main page followed by Schedules main page (Support: 0.42%)

Relate web activities to user profile.
E.g. 30% of users who ordered music are 18-25 and live on the West Coast.

WebWatcher

WebWatcher: A Tour Guide for the World Wide Web T. Joachims, D. Freitag, and T. Mitchell (1997)

Browser. User specifies "interest", starts browsing
WebWatcher highlights links it considers of particular interest.

Learns function LinkQuality = Prob(Link | Page, Interest)

Learning from previous tours:
Annotate each link with interest of users who followed it, plus anchor.
Find links whose annotation best matches interest of user.
(Qy: Why annotate links rather than pages? Perhaps to achieve directionality)

Learning from hypertext structure
Value of page is TDIDF match from interest to page.
Value of path P1, P2, ... is discounted sum:
Value(P1,P2 ...Pk) = value(P1) + D*value(P2) + D2value(P3) + ...
where D < 1. Value of link is the value of best path starting at target of link.
Dynamic programming algorithm to compute this.

For links on new pages: distance-weighted 3 nearest-neighbor approximator.
That is: We are on page P and deciding between links L1, L2 ... Lk.
Distance between link L1 on P and Lx on Px is
dist(L1,Lx) = TFIDF(anchor(L1),anchor(L2)) + 2*TFIDF(text(P),text(Px)).
Let Lx, Ly, Lz be closest links to L1.
The quality of L1 for interest I is
qual(L1,I) = TFIDF(Lx,I)/dist(L1,Lx) + TFIDF(Ly,I)/dist(L1,Ly) + TFIDF(Lz,I)/dist(L1,Lz)
Recommend links of highest quality.

Evaluation: Accuracy = percentage of time user followed a recommended link.
Achieved accuracy of 48.9% as compared top 31.3% for random recommendations.

Focussed Crawling

Why Focussed Crawling?

Aspects of a focussed crawler

Seed set

Collect pages from search engine: general or specialized.
Either

Search algorithms

Frontier = Links under consideration that have not been downloaded.
How to choose next link from frontier.

(Note: ideally the search algorithm is separable from the heuristic function, but not always.)

Breadth-first search

Baseline.

Best-first search (BFS)

Maintain entire frontier; choose best element.

Limited memory BFS

Keep only top N elements in frontier; choose best element.

BFSk search (beam search)

Maintain entire frontier. Find K top elements, download all K. Iterate.

Parallel versions of the above

Each process starts from different set of seeds, keeps its own queue.
Check to avoid overlap.

Fish-search algorithm

Metaphor: School of fish. If food (relevant info) is found, reproduce and continue looking. If no food (no relevant info) or polluted water (poor bandwidth), die. Search only to fixed depth from seed.
fishSearch(in : seed-pages, width, max-depth, size, query)
   return: set of pages:

{ queue = seed-pages;
  repeat { P := pop(queue);
           if (depth(P) < max_depth) {
             R := value(P,query) /* 0 or 1 */
             if  (R == 0) {
                for the first width children C of P do
                     C.potential := 0.5
                for the rest of the children C of P do
                     C.potential := 0 }
             else
                for the first 1.5*width children C of P do
                     C.potential := 1.0
                for the rest of the children C of P do
                     C.potential := 0 }
            for each child C of P do
                 C.score := (C is new) ? C.potential : max(C.score, C.potential)
                 insert C in queue according to C.score
         }
         until (queue is empty or number of processed nodes > size);
   return (set of relevant pages found)
}

Shark-search algorithm

The shark-search algorithm -- An Application: tailored Web site mapping, Micheal Hersovici et al.

Modifies fish-search in two ways:

1. if  relevance(P) > 0
      then C.inherited_score := d*sim(P,query)
      else C.inherited_score := d*P.inherited_score;
2. a := sim(anchor,q)
3. if (a > 0) 
     then C.anchor_score := b*a + (1-b)
     then C.anchor_score := b*a + (1-b)*sim(anchor_neighborhood(C,P),Q)
4. C.potential_score := g*C.inherited_score + (1-g)*C.anchor_score
d, b, and g are tunable parameters. So is "anchor_neighborhood".

Info-spiders

Collection of agents (spiders).
Agent state = vector of keywords
At each interation, a randomly chosen agent: Note: An agent learns clues from anchor, which it passes onto offspring. The population of agents learns clues from text.

Intelligent browser = human-driven crawler

Human in the loop. Automatically crawl a little ahead, suggest pages of possible interest.

Heuristic function

Link L from P to C. (P has been seen, C has not). Query Q.

Tunable parameters

The theory of tuning parameters is pretty much the same whatever the application. Depending on there are however-many hundreds of statistical/machine learning/interpolation techniques for doing this. It very often turns out that the values of the parameters hardly matter, within a pretty large range.

Learning

Most learning applications are characterized by Most principled study I have found for focussed crawler is Using Reinforcement Learning to Spider the Web Efficiently Jason Rennie and Andrew McCallum 1999.

Training set = some connected region of Web.

If link L leads in K steps to page P, then discounted value of P for L is DKrelevance(P,Q) for D < 1.

Overall value of link L = sum of discounted values of pages reached from L for optimal policy. Calculable exactly over training set by dynamic programming algorithm.

Learn: function from words in text, anchor to value of link.
Method: Naive Bayes.

Terminating search

Evaluation

Different from ordinary retrieval:

Case Studies

A General Evaluation Framework for Topical Crawlers

Padmini Srinivasan, Fillippo Menczer, and Gautam Pant

Find set of target pages from DMOZ. Follow back-links K steps. Try to find target pages.

Obviously not a good measure of absolute quality, but perhaps an interesting comparative test.

Intelligent Crawling on the World Wide Web with Arbitrary Predicates

Aggarwal, Al-Garawi and Yu, 2001

Information used: Content, Sibling, URL, anchor.
Learning from successes during crawl.
Probabilistic model of learning.

N = number of pages crawled.

Seed = www.yahoo,com, Predicate = "TRAVEL related sites containing keyword Paris.
Harvest rate increases from 0.1 at start to 0.2 at N = 2500.

Seed = www.ebay.com. Predicate = "AUTOMOTIVE related sites containing keywords Toyota SUV."
Harvest rate starts at 0.17, rises to peak of 0.42 at N=400, declines to 0.35 at N = 2000.

Seed = www.amazon.com. Predicate = category "SPORTS". Harvest rate starts at 0.2, rises to 0.5 at N=2500.

Seed = www.ucla.com. Predicate = category "SPORTS". Harvest rate starts at 0.05, rises to 0.85 at N=2000.
Very similar results with seed = salon.com, usatoday.com and yahoo.com/ARTS

Composite information source better than any single info source.

Web Crawling Agents for Retrieving Biomedical Information

P. Srinivasan et al. 2002

Task: Given a gene, to collect information about associated diseases.

Experimental method:
Use Web to collect all information you can about gene.
Find set of keywords from this data source, other than disease.
Find subset of keywords for query that retrieves at least 5 pages as seed set.
Test a number of web crawling algorithms, using keyword match as heuristic function.

Evaluation:
1. Collect relevant target pages from MedlinePlus. Therefore, exclude MedlinePlus from seed set.
2. Hand construct information page describing relation of gene to disease.
Measure similarity of retrieved page to this information page.

Results: Quite poor. After crawling 1000 pages, the average similarity was between 0.0145 and 0.017, and was declining if anything. The recall of target pages was between 0.04 and 0.06.

Problems:

Bottom Line

Recall

Obviously, if someone has stuck a relevant page in some odd corner of the web, then a focussed crawler is less likely to find it than Google. Where the focussed crawler wins, if at all, has got to be pages reasonably near an obvious seed page, which Google has failed to index. As we've seen, this can often happen with large web-sites with a database interface. Alternatively a sophisticated, topic-specific, algorithm can recognise to be relevant but Google's algorithm does not recognize as relevant.