Lecture 8: Invisible Web; Tables; Sentiment Analysis

Required reading

Structured Data on the Web , M.J. Cafarella, A. Halevy, and J. Madhavan, CACM, 54:2, Feb. 2011.

Google's Deep Web crawl Madhavan et al. Proc. VLDB Endowment, 1:2, August 2008.
This is the central, current article on the topic, but unfortunately very poorly written. Do what you can with it.

WebTables: Exploring the Power of Tables on the Web, Michael Cafarella et al.

Sentiment Analysis and Subjectivity, Bing Liu, in Handbook of Natural Language Processing 2nd edn. ed. N. Indurkhya and F. J. Damerau.

More Reading

Mining, Indexing, and Searching for Textual Chemical Molecule Information on the Web, Bingjun Sun, Prasenjit Mitra, and C. Lee Giles, WWW 2008.
Harnessing the Deep Web: Present and Future J. Madhavan et al., CIDR (Conference on Innovative Data Systems Research), 2009.
Understanding deep web search interfaces: A survey. Ritu Khare, Yuan An, and Il-Yeol Song, ACM SIGMOD Record 39:1, March 2010.
Uncovering the Relational Web, Michael Cafarella et al., WebDB 2008.
Opinion Mining and Sentiment Analysis by Bo Pang and Lillian Lee, Foundations and Trends in Information Retrieval 2:1-2, 2008. (essentially a book on the subject). Excellent but long.

Older but still interesting

Structured Databases on the Web: Observations and Implications Kevin Chang et. al
Indexing the invisible web: a survey by Yanbo Ru and Ellis Horowitz, 2005.
The Deep Web: Surfacing Hidden Value by Michael K. Bergman, Journal of Electronic Publishing
Crawling the Hidden Web Sriram Raghavan, Hector Garcia-Molina, VLDBM 2001.

Invisible Web

Indexable Web: Pages that can be found by crawling
Invisible Web (Hidden web, deep web): Information that can be accessed only or primarily by filling out a form. browser, but will not be found by a standard crawler for various reasons.

An enormous invisible Web site, though not a very interesting one:
NOAA (National Oceanic and Atmospheric Administration) National Climatic Data Center Doppler Weather Radar Data: 1600 TBytes (as of 2006)
In 2007 NOAA projected that they will be storing 140 PetaBytes by 2020.
I can't find a current (2011) count.

Surfacing Deep Invisible Web

Madhavan et al., "Google's Deep-Web Crawl"

Basic idea: Execute forms. Get pages. Add to Google index just as if they were ordinary static web pages. "Surfaced pages contribute results to more than 1,000 Web-search queries per second on Google.com". Since Google receives about 34,000 per query, then assuming that "contribute results" means "in a results page of 10 items", that means that 1 out of 340 results is from the surfaced web.

Emphasis on getting some data from a lot of sites rather than exhaustively surfacing a single site.

Template: When you fill out a form, some of the input fields are binding (that is, have a specified value) and the rest are free ("Any" option, blank text, wildcard). Some inputs are presentational (e.g. order); these are unimportant. A template is a choice of binding fields. The dimension of the template is the number of binding fields.

Informativeness: A signature for a results page is a way of computing whether two pages are the same: Ignore HTML formatting, ads, order of results. Delete query words from page (just echoing). Queries that give HTML errors or give failed search messages are ignored. The absolute informativeness of template T is (the number of distinct signatures returned in queries generated by T) / (the number of form submissions on T). If the informativeness is greater than 0.2 then T is informative.

The incremental informativeness is the number of new distinct signatures divided by the number of queries.

Generating templates
Try all 1-dimensional templates.
For each, issue a lot of queries.
See which templates are informative.
Extend all the 1-D templates to 2-D templates
Iterate. Stop after 3-D templates.

Generating queries. Here things become murky.

Experimentation and evaluation: I am skipping the experimentation except for Table 4, in which they compare the number of records that the system against the number of items in the database, for a dozen databases where this number was published or determinable. In all but one case, they were able to retrieve most of the records. What's strange are the two databases with 27 records.

WebTables

Work in progress: Prototype systems developed, but not available to the end-user.

Objective: Get useful information out of vast number of relational tables on Web.

14.1 billion HTML tables in English documents in Google index (2008).
About 154 Million of these are actually relational tables (rest are for formatting etc.)

Offline

Applications

Table Search Engine. Give high rank to tables that match query in top row and in leftmost column. Give query-indendent high rank to tables with high "Schema coherency" of the attributes, based on the ACSDb.

Schema Auto-Complete. A web designer enters some of the attributes of a table. Auto-complete suggests some more, based on high conditional probability.

Attribute synonymy. Attribute A is a synonym for attribute B if

Metasearch

Used to be that there were many (a dozen) general search engine, with surprisingly little overlap. So for a thorough search you wanted to consult them all. Metasearch automatically sent your query to all, then collated the results. It was unpopular with the search engines, because it stripped the ads. Issues:

Sentiment Analysis

Lots of opinions about lots of things on the web.

Level of analysis

Sub-tasks

2 nice feature of document level classification
1) There is an immense body of free labelled data, in the form of reviews that have both a text and a number of stars.
2) Unlike most natural language tasks, the output is a number between 1 and 5.

Supervised classification learning
Classification problem: A universe of instances. A finite set of categories, a numerical measure. Given a new instance, predict which category it belongs to, or what is the value of the measure.
Supervised learning. You have available a corpus of instances that are labelled correctly, or mostly correctly.
Learning task: Learn a classifier from the labelled instances. Then do classification by applying the classifier to the new instance.

I'm not going to teach techniques for supervised classification learning. We will simply assume that these techniques exist and work better or worse depending on the problem. For a little bit about this, take the AI class G22.2560; for a lot about it, take "Machine Learning" G22.2565.

What I will discuss are the features that are used for classification, and techniques that go beyond supervised classification.

Depth of the Natural Language Analysis

To do the problem with perfect accuracy would require a perfect natural language understander, which is not available. The question is what can you actually accomplish.

I. Some words (mostly adjectives; to a lesser extent nouns, verbs, and adverbs) are favorable ("good", "amazing", "beautiful") and some are unfavorable ("bad", "dreadful", "junk").

So for a very crude analysis you use a classifier that does classification just based on these words (e.g. by a weighted sum). Learn the classifier from the labelled corpus. "Applying machine learning techniques based on unigram [single word] models achieve over 80% in accuracy" (Pang and Lee p. 19) (for task of distinguishing positive from negative document).

II.A. Learn new words via synonyms/antonyms on WordNet. Learn word in foreign language from dictionary.

II.B. Learn new words via connectives. "U and V" -- probably same polarity. "U but V" --- probably opposite polarity.

III. Simple phrasal context can change the force of a word. Most notably "not W" has (generally) the reverse polarity from W. But you have to be careful of phrases like "not only W but also U", for which W has its positive polarity. Note the contrast with topic relevance. Quite a few other words with a negative force e.g. "It avoids all cliches and predictability found in Hollywood movies" (example from Pang and Lee, p. 36).

IV. Feature analysis.

V. Discourse Analysis. "In newsgroups devoded to three distinct controversial topics (abortion, gun control, and immigration) Agrawal et al. observe that The relationship between two individuals in the `responded to' network is more likely to be antagonistic --- overall 74% of the responses examined were found to be antagonistic, whereas only 7% were found to be reinforcing." (Pang and Lee p. 48).

Cocitation --- two blogs that cite the same source -- tends to indicate agreement.

Mining Newsgroups Using Networks Arising from Social Behavior. Rakesh Agrawal et al.

Observation: In a newsgroup, if X quotes Y, it is much more often to disagree than to agree (74% to disagree, 7% to agree, 19% off topic).

Therefore: Construct the graph whose nodes are posters and where there is an arc from X to Y if X quotes Y. Divide the set of posters into two disjoint classes F (for) and A (against) in such a way that the maximal number of arcs go from one class to the other.

This problem is known as "MAX-CUT" and is NP-complete, but there are approximation algorithms.

Experiment: Three newsgroups: abortion (2525 authors), gun control (2632 authors), and immigration (1648 authors).

Two versions of the algorithms: the pure version and the "constrained" version where 50 random authors were manually labelled, and the maximal cut respecting that labelling was returned.

Compare text-based SVM and Naive Bayes: both useless.

Algorithm Abortion Gun Control Immigration
Majority assignment 57% 72% 54%
SVM 55 42 55
Naive Bayes 50 72 54
EV 73 78 50
Constrained EV 73 84 88

VI. Discriminating opinions from other text. Direct statements of subjectivity ("In my opinion") or strong presence of evaluative words is suggestive but not decisive in either direction. Odd fact: "Hapax legomena or words that appear a single time in a given corpus have been found to be high-precision indicators of subjectivity" (Pang and Lee p. 33).

VII. Opinion search engine. Differs from the usual objective of a search engine in that

VIII. Spam, both positive and negative. Very hard to detect. Liu proposes as a sufficient test cases (a) where the same review has multiple userid's and (b) where the same review applies to different items. But this gives a very limited corpus which may not be representative. Liu observes that spam seems to gather more "this review was helpful" metacomments than non-spam, but of course these metacomments may also be spam.