Lecture 7: The Invisible Web and Specialized Search Engines

Required reading

The Deep Web: Surfacing Hidden Value by Michael K. Bergman, Journal of Electronic Publishing)
"Accessibility of information on the Web" by Steve Lawrence and C. Lee Giles, Nature vol.400, July 8, 1999, pp. 107-109. (class handout) Invisible Web: Information that can be read with a web browser, but will not be found by a standard crawler for various reasons.

Specialized Search Engines

Advantages to specialized seach engine derive from Advantages include:

bookfinder.com

Used-book sites send updates (? or maybe entire database?) to search engine; engine collates, orders by price, presents uniform interface.

Citeseer

Collects CS research papers; structures by citation.

Collection

Use search engine with keywords like "publications" or "papers" as starting point for crawl. Also uses known online journals, proceedings.

Single document processing

Converts Postscript, PDF to plain text. Translate to various formats
Extracts index terms.
Extracts fields: Author, Title, Date, Pub, Abstract.
Identifies bibliography and references.
Identifies citations in text.
Locate author home pages.

Cross-document processing

Identify reference with document. Note: Wide variance in form of referece. plus typos, errors
Identify common references to external document (i.e. not online)
Textual similarity between documents.
Co-citation similariy between documents.

Query answer

Query results
Quotation from document with snippet.
Order by decreasing number of citations.

Document summary Title, date, authors, abstract, Citations to paper, similar docs at sentence level. bibliography, similar docs based on text, related docs from co-citation, histogram by year of number of citing articles.

Finding specialized search engine

1. Search (in Google etc.) under "X search" or (less successful) "X database". X is often a supercategory of what you're actually looking for.
2. Search engine for search engines: e,g, www.invisibleweb.com, and BrightPlanet.com 3. List of 31 search engines for search engines. Guides to Specialized Search Engines (with Descriptions)

Automatic search engine selection

SavvySearch by Daniel Dreilinger and Adele Howe. Index search engine by search term. Do search on set of search engines. Add points for finding a page that user clicks on. Deduct points for not finding any pages.

Experiment: January 1996

20 search engines
46,568 word stems
211,887 incoming queries
1,091,630 outgoing queries
437,243 visit events
154,962 no result events

Crawling the Hidden Web

(Sriram Raghavan, Hector Garcia-Molina)

Task-specific, human assisted.
2 parts: Resource discovery -- not dealt with here.
Content extraction.

Sample tasks
"Articles relating to semiconductor industry in the last ten years" "Information about movies by Oscar-winning directors in the last 30 years"

HiWE: Hidden Web Exposer

Analyze form into form elements.
Form element = type, label, domain
Type = selection list, text box, text area, checkbox, or radio button.
Label = label on element e.g. "company name". "state" etc.
Domain = set of values. E.g. { IBM, Microsoft, Intel ...} or { Alabama, Alaska ...}
Need to parse physical layout of form; associate labels with nearby form elements. Metatags would help, except that no one ever uses them. LITE (Layout-based Information Extraction)

Task-specific database holds known values relevant to query. These are (a) given by user; (b) built-in (e.g. date fields, state field); (c) extracted using wrapper from data source (either task-specific (Semiconductor research corporation) or general (Yahoo)); (d) extracted from form (labels and values).

Matcher fills out form. Finds standard label name closest to form label. Either enumerates values given in form, or enumerates values associate with label in database.

Response analyzer saves responses (excluding failures), adjusts weights on relation of value for label. (If value gives failure, reduce its value)

Evaluation metrics Metric 1: number of successful submissions (answer gotten) / number of total submissions.
Metric 2: number of semantically correct submissions (meaningful query in form) / number of total submission. Requires manual evaluation.

Categorization of Web

"Accessibility of information on the Web" by Steve Lawrence and C. Lee Giles. (Ironically enough, not accessible on the Web).

Characterizes "publically indexable web" (i.e. Web data findable with simple path URL) as of February 1999.

Number of web servers: 2.8 million.
Experimental technique: Generate random IP addresses, find the occupied fraction of IP space (= 2564), exclude bogus addresses and repetitions (= 82.5% of IP addresses).

Average number of indexable web pages per server: 289.
Experimental technique: Crawl all pages on first 2500 servers.
Mean size of page: 18.7 Kbytes (median: 3.9)
Mean textual content (after remobing HTML, white space): 7.3 KBytes (median 0.98 KBytes)
62 images per server, avg 15.2 Kbytes per image.

Total on web: (Multiply).
800 million pages, 15 Tbytes, 6 Tbytes text, 180 million images, 3 Tbytes image data. May be an underestimate, because of a small number of very large servers (e.g. Geocities has 34 million pages).

Coverage of web search engines. Number of pages indexed.
Technique: Northern Light and AltaVista report (query "NOT qqqhdfwxiwngol")
Estimate for other search engines as follows: submit 1050 queries, compare number of responses, count number of documents containing all query words. Compare count to Northern Light and AltaVista, extrapolate size of index.

Combined coverage: estimated to be 42% of entire web.

Delay in indexing Submit queries every day, collect new documents, compare to date when page added or modified. Average delay: 186 days. (Ranges from 141 for Northern Light to 235 for Yahoo, 240 for Snap).

Categories of servers Commercial: 83%
Scientific/educational: 6%
Pornography: 1.5%
Government: 1%
Health: 2.8%
Personal: 2.2%
Community: 1.5%
Religion: 0.8%
Societies: 2%
(Numbers eyeballed from bar chart; aware that they don't add up.)

Categorization of Invisible Web

The Deep Web: Surfacing Hidden Value by Michael K. Bergman, Journal of Electronic Publishing

Note: Many of the techniques used are proprietary and not adequately described. The whole paper must be consider as an advertisement for BrightPlanet.com, and therefore should be taken with a grain of salt. Nonetheless, this is by far the deepest and most extensive analysis that I have seen.

Definition of deep web site: A site that dynamically generates pages based on a search query.

Number of deep web sites:
Compile ten lists of deep web sites. Total: 43,348.
Automatically filter search sites. Total: 17,579 search sites
Hand examine 700 random sites, filter on further criteria. 13.6% excluded. Apply this correction throughout.

Estimating total number of deep web sites.
Technique: Overlap analysis.

Suppose A and B are randomly chosen subsets of O.
Let AB=A intersect B.
Expected size of AB = |A||B|/|O|.
Therefore if you know |A|, |B|, |AB|, you can estimate |O|=|A||B|/|AB|.
In this case A, B are listings of deep Web site.
Problem: Independence assumption not near valid; tend to be positively correlated.
Then |AB| > |A||B|/|O| so |O| > |A||B|/|AB|, so you have an underestimate of |O|.

Values range from 43,000 to 96,000. Estimate that the latter is an underestimate by a factor of 2; hence 200,000. 94% are publicly available.

Total size of Invisible Web
There are two truly enormous Web sites:
National Climactic Data Center (NOAA) is 366 Tbytes.
NASA EOSDIS is 219 TBytes.
Sixty largest known (to the BrightPlanet folks) deep Web sites total 750 TBytes.
There may be others. An analysis of 700 randomly selected turned up 3 that weren't on the original list; if this proportion applies to the set of 200,000, there would be 840 more "really big sites". Seems unlikely.

Quality Some evidence that quality, as measured by vector model, is higher for deep sites than for standard Web sites. Not very convincing argument.