Lecture 9: The Invisible Web and Specialized Search Engines

Required reading

Structured Databases on the Web: Observations and Implications Kevin Chang et. al
Crawling the Hidden Web (Sriram Raghavan, Hector Garcia-Molina)

Recommended Reading

The Deep Web: Surfacing Hidden Value by Michael K. Bergman, Journal of Electronic Publishing

Invisible Web

Invisible Web: Information that can be read with a web browser, but will not be found by a standard crawler for various reasons.

Specialized Search Engines

Advantages to specialized seach engine derive from Advantages include:

Froogle

Merchant sends database (and thereafter updates) in a uniform format to Froogle. Froogle collates, enables a uniform search engine.

Similarly for engines for restricted categories of merchandise. (Cars, real estate, books, etc.) Restricting the categories enables domain-specific query attributes, drop-down value lists, appropriate presentation and interaction, greater precision.

Citeseer

Collects CS research papers; structures by citation. 700,000 documents.

Collection

Use search engine with keywords like "publications" or "papers" as starting point for crawl. Also uses known online journals, proceedings.

Single document processing

Converts Postscript, PDF to plain text. Translate to various formats
Extracts index terms.
Extracts fields: Author, Title, Date, Pub, Abstract.
Identifies bibliography and references.
Identifies citations in text.
Locate author home pages.

Cross-document processing

Identify reference with document. Note: Wide variance in form of referece. plus typos, errors
Identify common references to external document (i.e. not online)
Textual similarity between documents.
Co-citation similariy between documents.

Query answer

Query results
Quotation from document with snippet.
Order by decreasing number of citations.

Document summary Title, date, authors, abstract, Citations to paper, similar docs at sentence level. bibliography, similar docs based on text, related docs from co-citation, histogram by year of number of citing articles.

Finding specialized search engine

1. Search (in Google etc.) under "X search" or "X database". X is often a supercategory of what you're actually looking for.
2. Search engine for search engines: e,g, CompletePlanet (lists 70,000 databases), and www.profusion.com.

3. List of 18 search engines for search engines. Guides to Specialized Search Engines (with Descriptions)
(Perhaps out of date? Doesn't include CompletePlanet) Also 1000's of specialized search engines.

Meta-search

Will discuss in a later lecture

Categorization of Web

Structured Databases on the Web: Observations and Implications Kevin Chang et. al
Randomly sampled dataset: Generated 1 million random IP address out of 2.2 billion IP address space. Generated HTTP request. 2256 sites respond to HTTP request: Estimate 4.4 million web sites total.

On subset of 100,000 pages, searched to depth 10 looking for deep Web interface page. Found 129 interface pages at 24 sites. 91% of these are at depth at most 3 from root.

Over the entire sample of 1,000,000 sites, looked to depth 3 for database interface page.
Sampling Results Total Estimate
Deep Web sites 126 307,000
Web databases 190 450,000
- structured 43 102,000
- unstructured 147 348,000
Query interfaces 406 1,258,000
E.g. barnesandnoble.com is a Deep Web site. It has several databases: books, music, etc. Each of these has two interfaces: A simple search and an advanced search. (The simple search interface is repeated on several pages; these are not counted separately.) The databases on barnesandnoble.com are structured: Relational database with attributes "title" "author" "price" etc. The database on cnn.com is unstructured: news articles.

Coverage of existing directories

By far the largest is completeplanet.com with 70,000 databases = 15.6%. Rest are much smaller.

Subject distribution

See paper.

Domain-specific studies

In many domains, database pages are reachable by hyperlinks via a browsing structure (e.g. 80% of book databases and movie databases.) Therefore at least in principle attainable by standard crawler. In other domains much fewer (e.g. airfares only 3%). In fact, however, 75% of crawlable deep web pages are not indexed in Google, and in another 24% the copy cached in Google is out of date. Only 5% are indexed and fresh in Google.

Query schemas have between 1 and 18 attributes, with an average of 6.

Frequency of attribute names follow a Zipf distribution. Top 20 attributes account for 38% of occurrences. (Top 5 are : title, keyword, price, make, and artist) 61% of attributes occur in only one source. Attributes are strongly clustered by domain.

Complexity of most "advanced" search:

The Deep Web: Surfacing Hidden Value by Michael K. Bergman, Journal of Electronic Publishing

Note: Many of the techniques used are proprietary and not adequately described. The whole paper must be consider as an advertisement for CompletePlanet.com (used to be BrightPlanet.com) and therefore should be taken with a grain of salt. Nonetheless, a lot of good information.

Definition of deep web site: A site that dynamically generates pages based on a search query.

Number of deep web sites:
Compile ten lists of deep web sites. Total: 43,348.
Automatically filter search sites. Total: 17,579 search sites
Hand examine 700 random sites, filter on further criteria. 13.6% excluded. Apply this correction throughout.

Estimating total number of deep web sites.
Technique: Overlap analysis.

Suppose A and B are randomly chosen subsets of O.
Let AB=A intersect B.
Expected size of AB = |A||B|/|O|.
Therefore if you know |A|, |B|, |AB|, you can estimate |O|=|A||B|/|AB|.
In this case A, B are listings of deep Web site.
Problem: Independence assumption not near valid; tend to be positively correlated.
Then |AB| > |A||B|/|O| so |O| > |A||B|/|AB|, so you have an underestimate of |O|.

Values range from 43,000 to 96,000. Estimate that the latter is an underestimate by a factor of 2; hence 200,000. 94% are publicly available.

Total size of Invisible Web
There are two truly enormous Web sites:
National Climactic Data Center (NOAA) is 366 Tbytes.
NASA EOSDIS is 219 TBytes.
Sixty largest known (to the CompletePlanet folks) deep Web sites total 750 TBytes. (These figures are three years old; they are still listed on the CompletePlanet site, which is not encouraging.)

There may be others. An analysis of 700 randomly selected turned up 3 that weren't on the original list; if this proportion applies to the set of 200,000, there would be 840 more "really big sites". Seems unlikely.

Crawling the Hidden Web

(Sriram Raghavan, Hector Garcia-Molina)

HiWE: Hidden Web Exposer

Analyze form into form elements.
Form element = type, label, domain
Type = selection list, text box, text area, checkbox, or radio button.
Label = label on element e.g. "company name". "state" etc.
Domain = set of values. E.g. { IBM, Microsoft, Intel ...} or { Alabama, Alaska ...}
Need to parse physical layout of form; associate labels with nearby form elements. Metatags would help, except that no one ever uses them. LITE (Layout-based Information Extraction)

Task-specific database holds known values relevant to query. These are (a) given by user; (b) built-in (e.g. date fields, state field); (c) extracted using wrapper from data source (either task-specific (Semiconductor research corporation) or general (Yahoo)); (d) extracted from form (labels and values).

Matcher fills out form. Finds standard label name closest to form label. Either enumerates values given in form, or enumerates values associate with label in database.

Response analyzer saves responses (excluding failures), adjusts weights on relation of value for label. (If value gives failure, reduce its value)

Evaluation metrics Metric 1: number of successful submissions (answer gotten) / number of total submissions.
Metric 2: number of semantically correct submissions (meaningful query in form) / number of total submission. Requires manual evaluation.