Mining the Web to Create Minority Language Corpora

Rayid Ghani, Rosie Jones, and Dunja Mladenic.

Relevance feedback rather than actually web mining.

Task: To create a corpus of Web documents in a minority language. E.g. Slovenian.
Difficult because Slovenian is very similar to Czech, Slovak and Croatian, and there are certainly more Czech and Croatian documents on the Web.

{ Initialize: Either sets of relevant and non-relevant documents,
     or sets of words in Slovenian vs. stop words in English.
  repeat {  Generate  and submit query to Altavista;
            retrieve most relevant unseen doc;
            Language filter: Mark doc as Slovenian or not;
            update frequencies and scores;
Query: Conjunction of K positive and K negative terms.
Term selection criteria: (switch "relevant" and "irrelevant" for neg. terms)

Language filter: These exist for many languages based in N-gram model. 90-95% accurate.

Note: Since the irrelevant documents found are those that are retrieved by query, they will mostly be in closely related languages, and the negative query words will be chosen to distinguish the target language from its near relations.

Best results for Odds ratio with K=3.

Language                               Slovenian   Croatian Czech Tagalog
# of relevant docs out of 1000 docs      835         677     743    664
# of relevant docs in 1000 queries      1409          93     680    236
Does better than other techniques, such as looking for pages with a .si domain.

 The whole thing is a bit of a cheat, since it relies critically on the Language Filter, which was certainly built from a corpus of documents In the language.

Focussed Crawling

Why Focussed Crawling?

Aspects of a focussed crawler

Seed set

Collect pages from search engine: general or specialized.

Search algorithms

Frontier = Links under consideration that have not been downloaded.
How to choose next link from frontier.

(Note: ideally the search algorithm is separable from the heuristic function, but not always.)

Breadth-first search


Best-first search (BFS)

Maintain entire frontier; choose best element.

Limited memory BFS

Keep only top N elements in frontier; choose best element.

BFSk search (beam search)

Maintain entire frontier. Find K top elements, download all K. Iterate.

Parallel versions of the above

Each process starts from different set of seeds, keeps its own queue.
Check to avoid overlap.

Fish-search algorithm

Metaphor: School of fish. If food (relevant info) is found, reproduce and continue looking. If no food (no relevant info) or polluted water (poor bandwidth), die. Search only to fixed depth from seed.
fishSearch(in : seed-pages, width, max-depth, size, query)
   return: set of pages:

{ queue = seed-pages;
  repeat { P := pop(queue);
           if (depth(P) < max_depth) {
             R := value(P,query) /* 0 or 1 */
             if  (R == 0) {
                for the first width children C of P do
                     C.potential := 0.5
                for the rest of the children C of P do
                     C.potential := 0 }
                for the first 1.5*width children C of P do
                     C.potential := 1.0
                for the rest of the children C of P do
                     C.potential := 0 }
            for each child C of P do
                 C.score := (C is new) ? C.potential : max(C.score, C.potential)
                 insert C in queue according to C.score
         until (queue is empty or number of processed nodes > size);
   return (set of relevant pages found)

Shark-search algorithm

The shark-search algorithm -- An Application: tailored Web site mapping Micheal Hersovici et al.

Modifies fish-search in two ways:

1. if  relevance(P) > 0
      then C.inherited_score := d*sim(P,query)
      else C.inherited_score := d*P.inherited_score;
2. a := sim(anchor,q)
3. if (a > 0) 
     then C.anchor_score := b*a + (1-b)
     then C.anchor_score := b*a + (1-b)*sim(anchor_neighborhood(C,P),Q)
4. C.potential_score := g*C.inherited_score + (1-g)*C.anchor_score
d, b, and g are tunable parameters. So is "anchor_neighborhood".


Collection of agents (spiders).
Agent state = vector of keywords
At each interation, a randomly chosen agent: Note: An agent learns clues from anchor, which it passes onto offspring. The population of agents learns clues from text.

Intelligent browser = human-driven crawler

Human in the loop. Automatically crawl a little ahead, suggest pages of possible interest.

Heuristic function

Link L from P to C. (P has been seen, C has not). Query Q.

Tunable parameters

The theory of tuning parameters is pretty much the same whatever the application. Depending on there are however-many hundreds of statistical/machine learning/interpolation techniques for doing this. It very often turns out that the values of the parameters hardly matter, within a pretty large range.


Most learning applications are characterized by Most principled study I have found for focussed crawler is Using Reinforcement Learning to Spider the Web Efficiently Jason Rennie and Andrew McCallum 1999.

Training set = some connected region of Web.

If link L leads in K steps to page P, then discounted value of P for L is DKrelevance(P,Q) for D < 1.

Overall value of link L = sum of discounted values of pages reached from L for optimal policy. Calculable exactly over training set by dynamic programming algorithm.

Learn: function from words in text, anchor to value of link.
Method: Naive Bayes.

Terminating search


Different from ordinary retrieval:

Case Studies

A General Evaluation Framework for Topical Crawlers

Padmini Srinivasan, Fillippo Menczer, and Gautam Pant

Find set of target pages from DMOZ. Follow back-links K steps. Try to find target pages.

Obviously not a good measure of absolute quality, but perhaps an interesting comparative test.

Intelligent Crawling on the World Wide Web with Arbitrary Predicates

Aggarwal, Al-Garawi and Yu, 2001

Information used: Content, Sibling, URL, anchor.
Learning from successes during crawl.
Probabilistic model of learning.

N = number of pages crawled.

Seed =,com, Predicate = "TRAVEL related sites containing keyword Paris.
Harvest rate increases from 0.1 at start to 0.2 at N = 2500.

Seed = Predicate = "AUTOMOTIVE related sites containing keywords Toyota SUV."
Harvest rate starts at 0.17, rises to peak of 0.42 at N=400, declines to 0.35 at N = 2000.

Seed = Predicate = category "SPORTS". Harvest rate starts at 0.2, rises to 0.5 at N=2500.

Seed = Predicate = category "SPORTS". Harvest rate starts at 0.05, rises to 0.85 at N=2000.
Very similar results with seed =, and

Composite information source better than any single info source.

Web Crawling Agents for Retrieving Biomedical Information

P. Srinivasan et al. 2002

Task: Given a gene, to collect information about associated diseases.

Experimental method:
Use Web to collect all information you can about gene.
Find set of keywords from this data source, other than disease.
Find subset of keywords for query that retrieves at least 5 pages as seed set.
Test a number of web crawling algorithms, using keyword match as heuristic function.

1. Collect relevant target pages from MedlinePlus. Therefore, exclude MedlinePlus from seed set.
2. Hand construct information page describing relation of gene to disease.
Measure similarity of retrieved page to this information page.

Results: Quite poor. After crawling 1000 pages, the average similarity was between 0.0145 and 0.017, and was declining if anything. The recall of target pages was between 0.04 and 0.06.


Bottom Line


I have yet to find a paper that claims to have better recall than Google, or to have found any pages not indexed by Google, or, indeed, to have accomplished anything that could not be more easily accomplished in some other way. Few do the comparison.

Obviously, if someone has stuck a relevant page in some odd corner of the web, then a focussed crawler is less likely to find it than Google. Where the focussed crawler wins, if at all, has got to be pages reasonably near an obvious seed page, which Google either by chance fails to index (which has got to be reasonably likely) or which a sophisticated, topic-specific, algorithm can recognise to be relevant but Google's algorithm does not recognize as relevant.


Unevaluated, as far as I can tell.

Other reading

Evaluating Topic-Driven Web Crawlers Filippo Menczer et al. 2001

Topic-Driven Crawlers: Machine Learning Issues Filippo Menczer, Gautam Pant, Padmini Srinivasan

Building Topic-Specific Collections with Intelligent Agents Igor Nekrestyanov et al.

WebSail From On-line Learning to Web Search