Lecture 8: The Invisible Web, Specialized Search Engines, and Metasearch

Required reading

Structured Databases on the Web: Observations and Implications Kevin Chang et. al
Indexing the invisible web: a survey by Yanbo Ru and Ellis Horowitz, 2005.

Recommended Reading

Crawling the Hidden Web (Sriram Raghavan, Hector Garcia-Molina)
The Deep Web: Surfacing Hidden Value by Michael K. Bergman, Journal of Electronic Publishing

Invisible Web

Indexable Web: Pages that can be found by crawling
Invisible Web (Hidden web, deep web): Information that can be read with a web browser, but will not be found by a standard crawler for various reasons.

Specialized Search Engines

Advantages to specialized seach engine derive from Advantages include:

Froogle

Merchant sends database (and thereafter updates) in a uniform format to Froogle. Froogle collates, enables a uniform search engine.

Similarly for engines for restricted categories of merchandise. (Cars, real estate, books, etc.) Restricting the categories enables domain-specific query attributes, drop-down value lists, appropriate presentation and interaction, greater precision.

Citeseer

(Now superseded by Google Scholar, but has the advantage of having been published.) Collects CS research papers; structures by citation. 700,000 documents.

Collection

Use search engine with keywords like "publications" or "papers" as starting point for crawl. Also uses known online journals, proceedings.

Single document processing

Converts Postscript, PDF to plain text. Translate to various formats
Extracts index terms.
Extracts fields: Author, Title, Date, Pub, Abstract.
Identifies bibliography and references.
Identifies citations in text.
Locate author home pages.

Cross-document processing

Identify reference with document. Note: Wide variance in form of referece. plus typos, errors
Identify common references to external document (i.e. not online)
Textual similarity between documents.
Co-citation similariy between documents.

Query answer

Query results
Quotation from document with snippet.
Order by decreasing number of citations.

Document summary Title, date, authors, abstract, Citations to paper, similar docs at sentence level. bibliography, similar docs based on text, related docs from co-citation, histogram by year of number of citing articles.

Meta-search

Will discuss in a later lecture

Categorization of Deep Web

Structured Databases on the Web: Observations and Implications Kevin Chang et. al, 2004

Figures are 3 years old, and no longer accurate, but suggestive; and methods and question still relevant.

Randomly sampled dataset: Generated 1 million random IP address out of 2.2 billion IP address space. Generated HTTP request. 2256 sites respond to HTTP request: Estimate 4.4 million web sites total.

On subset of 100,000 pages, searched to depth 10 looking for deep Web interface page. Found 129 interface pages at 24 sites. 91% of these are at depth at most 3 from root.

Over the entire sample of 1,000,000 sites, looked to depth 3 for database interface page.
Sampling Results Total Estimate
Deep Web sites 126 307,000
Web databases 190 450,000
- structured 43 102,000
- unstructured 147 348,000
Query interfaces 406 1,258,000
E.g. barnesandnoble.com is a Deep Web site. It has several databases: books, music, etc. Each of these has two interfaces: A simple search and an advanced search. (The simple search interface is repeated on several pages; these are not counted separately.) The databases on barnesandnoble.com are structured: Relational database with attributes "title" "author" "price" etc. The database on cnn.com is unstructured: news articles.

Domain-specific studies

In many domains, database pages are reachable by hyperlinks via a browsing structure (e.g. 80% of book databases and movie databases.) Therefore at least in principle attainable by standard crawler. In other domains much fewer (e.g. airfares only 3%). In fact, however, 75% of crawlable deep web pages are not indexed in Google, and in another 24% the copy cached in Google is out of date. Only 5% are indexed and fresh in Google.

Query schemas have between 1 and 18 attributes, with an average of 6.

Frequency of attribute names follow a power-law distribution (we will discuss in a future lecture). Top 20 attributes account for 38% of occurrences. (Top 5 are : title, keyword, price, make, and artist) 61% of attributes occur in only one source. Attributes are strongly clustered by domain.

Complexity of most "advanced" search:

The Deep Web: Surfacing Hidden Value by Michael K. Bergman, Journal of Electronic Publishing

Note: Many of the techniques used are proprietary and not adequately described. The whole paper was an advertisement for BrightPlanet.com (later CompletePlanet.com; no longer functional, though still on the web) and therefore should be taken with a grain of salt. Nonetheless, a lot of good information.

Definition of deep web site: A site that dynamically generates pages based on a search query.

Number of deep web sites:
Compile ten lists of deep web sites. Total: 43,348.
Automatically filter search sites. Total: 17,579 search sites
Hand examine 700 random sites, filter on further criteria. 13.6% excluded. Apply this correction throughout.

Estimating total number of deep web sites.
Technique: Overlap analysis.

Suppose A and B are randomly chosen subsets of O.
Let AB=A intersect B.
Expected size of AB = |A||B|/|O|.
Therefore if you know |A|, |B|, |AB|, you can estimate |O|=|A||B|/|AB|.
In this case A, B are listings of deep Web site.
Problem: Independence assumption not near valid; tend to be positively correlated.
Then |AB| > |A||B|/|O| so |O| > |A||B|/|AB|, so you have an underestimate of |O|.

Values range from 43,000 to 96,000. Estimate that the latter is an underestimate by a factor of 2; hence 200,000. 94% are publicly available.

An enormous invisible Web site, though not a very interesting one:
NOAA (National Oceanic and Atmospheric Administration) National Climatic Data Center Doppler Weather Radar Data: 1600 TBytes (as of 2006)
NOAA projects that they will be storing 140 PetaBytes by 2020.

Indexing the Invisible Web

Indexing the invisible web: a survey by Yanbo Ru and Ellis Horowitz, 2005.

Locating invisible web sites

1. Search (in Google etc.) under "X search" or "X database". X is often a supercategory of what you're actually looking for.
2. Search engine for search engines: e,g, Search Engine Guide. Ru and Horowitz have a list of 24 others, not all of which are still extant.

Classifying pages as invisible web sites

Pre-query: J. Cope, N. Craswell, D. Hawkings "Automated discovery of search interfaces on the web". Learn a classifier (decision tree from C4.5) to learn the features of a deep web query page. Precision = 87%, recall = 85%.

Post query: Crawling for Domain-Specific Hidden Web Resources Andre Bergholz, Boris Chidlovskii.

Step 1: Domain-specific crawl to collect candidate query forms.
Start with 10,000 categories from top 5 level of Google hierarchy.
For each category, collect 200 pages and 20 relevant keywords.
From each page, do a breadth-first crawl of HTML pages within the same site: maximum distance = 8, max number of pages = 1000.

Step 2: For each crawled page, check whether this is a queryable form with text input using features of the HTML. If so:

Step 3: Run query prober. Fill in non-text values (buttons, drop-down menus, etc.) randomly. For text fields, try two experiments:
A. Fill in values with domain keywords.
B. Fill in values with nonsense string.

If the page is a true hidden web query form, then you expect to see:

Results: Collected 4800 hidden web query pages.

Categorizing the subject matter of invisible web sites

Pre-query: Use features of the query page.
Post-query: Fill out query page and use features of answers.
However, the papers cited in Ru and Horowitz on this are not actually very interesting.

Indexing invisible web sites content offline

Crawling the Hidden Web (Sriram Raghavan, Hector Garcia-Molina)

HiWE: Hidden Web Exposer

Analyze form into form elements.
Form element = type, label, domain
Type = selection list, text box, text area, checkbox, or radio button.
Label = label on element e.g. "company name". "state" etc.
Domain = set of values. E.g. { IBM, Microsoft, Intel ...} or { Alabama, Alaska ...}
Need to parse physical layout of form; associate labels with nearby form elements. Metatags would help, except that no one ever uses them. LITE (Layout-based Information Extraction)

Task-specific database holds known values relevant to query. These are (a) given by user; (b) built-in (e.g. date fields, state field); (c) extracted using wrapper from data source (either task-specific (Semiconductor research corporation) or general (Yahoo)); (d) extracted from form (labels and values).

Matcher fills out form. Finds standard label name closest to form label. Either enumerates values given in form, or enumerates values associate with label in database.

Response analyzer saves responses (excluding failures), adjusts weights on relation of value for label. (If a query gives failure, then reduce the weight of each of the values used.)

Evaluation metrics
Metric 1: number of successful submissions (answer gotten) / number of total submissions.
The problem is that this penalizes HiWE if the question was meaningful and appropriate, but the database just doesn't happen to have any information about it. That is, HiWE is supposed to be clairvoyant.
Metric 2: number of semantically correct submissions (meaningful query in form) / number of total submission. Fairer measure but requires manual evaluation.

Results

Success rate increases as a function of elements per form. Over all forms with at least 2 elements, 78.9% of queries correct on metric 1. On forms with at least 5 elements 90% correct on metric 1.

On the Automatic Extraction of Data from the Hidden Web Stephen W. Liddle, Sai Ho Yau, and David W. Embley

Does not try to use text elements; only elements with finitely many values (buttons, drop down menus etc.) Assume that, if you leave text element blank, it will match all or many possibilities; not always true, of course.

Main result: For most such web sites, unnecessary try all combinations of the finite values. Systematic way (stratified sampling) to generate a sequence of combinations and a halting condition so that, if a significantly smaller number of queries will get all the information, then following this strategy is likely to get all this information with much fewer than all possible combinations.

Automatic Information Discovery from the "Invisible Web" King-Ip Lin and Hui Chen.

Short article, few details, not very convincing, but interesting architecture.

Other issues

Metasearch engines

Building efficient and effective metasearch engines Weiyi Meng, Clement Yu, and King-Lup Liu, ACM Computing Survey, vol. 34 no. 1, March 2002, pp. 48-89. (Not a very good paper, but it does cover the major points and it has a good bibliography.)

Overview

A metasearch engine sends a user query to a collection of search engines (either general or specialized search engines), combines the answers, and presents them to the user. Applies mostly to search engine in the sense of a program that returns a ranked list of web pages with snippets.

Select appropriate search engine

Rough representative approach

The content of the database is characterized by a small number of keywords culled from the home page etc.

Statistical representative approach

Collect statistics about frequency of words in documents in the collection. Meng et al. go to town on this.

Learning based approach

Offline: present a collection of queries to each search engine. Evaluate quality of search engine response.

Query time: Compare user query to previous queries and choose the search engines that scored highest on the related queries.

Options:
Queries: Generated by experimenter or culled from user logs.
Query specific evaluation: Manual; or number of pages returned; or download sample pages and calculate relevance; or click through data.
Query-independent evaluation: Average query specific evaluation over queries; or excessive delay. Comparison: Apply a learning technique to the prediction of evaluation from query. Note that if queries are generated by known category as in HiWE, then the categories can be used to guide the learner (e.g. learn one classifier per category.) O

Identifying the same page

If you have the same or equivalent URL then it's easy.

Otherwise, identifying duplicate pages from snippets provided by two different snippet extractors has got to be tough to do with any degree of accuracy. I didn't find much information on this. `

Merging rankings

See Liu Web Data Mining p. 227.

Combining separate systems of rankings to get an overall ranking is a well known problem in "voting theory". There is provably no method for doing this that does not sometimes give rise to some kind of anomalies. (Arrow's theorem etc.) Figure skating judging went through a whole sequence of algorithms before going to a system of cardinals a couple of years ago. The situation here is complicated by the fact that you have null values, which count only somewhat as a vote against the page.

One method: "Reciprocal ranking". Let RP,E be the rank of page P returned by engine E. For each page, compute the sum of 1/RP,E, where this reciprocal is considered 0 if E does not return P. Order the pages by decreasing value of reciprocal sum.

Clustering

From downloaded pages or from snippets.

A personalized search engine based on web-snippet hierarchical clustering Pablo Ferragina and Antonio Gulli, Software: Practice and Experience 2007.

Two offline knowledge bases:
1. A collection of anchor texts on inlinks within a 50 million page collection.
2. A modified TF-IDF measure for terms based on Dmoz. Terms are ranked with respect to Dmoz categories. More specific categories are ranked higher than more general categories. Terms are considered similar if they have high ranks in the same categories.

Online:

Example output: (from figure 8)

Allergy

The "personalization" here is pretty slight. If I'm understanding this, it's just a user-interface feature. The user can mark certain clusters as of interest, and have the rest disappear off the screen.