Google's Deep Web crawl Madhavan et al.
Proc. VLDB Endowment, 1:2,
This is the central, current article on the topic, but unfortunately very poorly written. Do what you can with it.
WebTables: Exploring the Power of Tables on the Web, Michael Cafarella et al.
Sentiment Analysis and Subjectivity, Bing Liu, in Handbook of Natural Language Processing 2nd edn. ed. N. Indurkhya and F. J. Damerau.
An enormous invisible Web site, though not a very interesting one:
NOAA (National Oceanic and Atmospheric Administration) National Climatic Data Center Doppler Weather Radar Data: 1600 TBytes (as of 2006)
In 2007 NOAA projected that they will be storing 140 PetaBytes by 2020.
I can't find a current (2011) count.
Basic idea: Execute forms. Get pages. Add to Google index just as if they were ordinary static web pages. "Surfaced pages contribute results to more than 1,000 Web-search queries per second on Google.com". Since Google receives about 34,000 per query, then assuming that "contribute results" means "in a results page of 10 items", that means that 1 out of 340 results is from the surfaced web.
Emphasis on getting some data from a lot of sites rather than exhaustively surfacing a single site.
Template: When you fill out a form, some of the input fields are binding (that is, have a specified value) and the rest are free ("Any" option, blank text, wildcard). Some inputs are presentational (e.g. order); these are unimportant. A template is a choice of binding fields. The dimension of the template is the number of binding fields.
Informativeness: A signature for a results page is a way of computing whether two pages are the same: Ignore HTML formatting, ads, order of results. Delete query words from page (just echoing). Queries that give HTML errors or give failed search messages are ignored. The absolute informativeness of template T is (the number of distinct signatures returned in queries generated by T) / (the number of form submissions on T). If the informativeness is greater than 0.2 then T is informative.
The incremental informativeness is the number of new distinct signatures divided by the number of queries.
Try all 1-dimensional templates.
For each, issue a lot of queries.
See which templates are informative.
Extend all the 1-D templates to 2-D templates
Iterate. Stop after 3-D templates.
Generating queries. Here things become murky.
Experimentation and evaluation: I am skipping the experimentation except for Table 4, in which they compare the number of records that the system against the number of items in the database, for a dozen databases where this number was published or determinable. In all but one case, they were able to retrieve most of the records. What's strange are the two databases with 27 records.
Objective: Get useful information out of vast number of relational tables on Web.
14.1 billion HTML tables in English documents in Google index (2008).
About 154 Million of these are actually relational tables (rest are for formatting etc.)
Table Search Engine. Give high rank to tables that match query in top row and in leftmost column. Give query-indendent high rank to tables with high "Schema coherency" of the attributes, based on the ACSDb.
Schema Auto-Complete. A web designer enters some of the attributes of a table. Auto-complete suggests some more, based on high conditional probability.
Attribute synonymy. Attribute A is a synonym for attribute B if
Level of analysis
2 nice feature of document level classification
1) There is an immense body of free labelled data, in the form of reviews that have both a text and a number of stars.
2) Unlike most natural language tasks, the output is a number between 1 and 5.
Supervised classification learning
Classification problem: A universe of instances. A finite set of categories, a numerical measure. Given a new instance, predict which category it belongs to, or what is the value of the measure.
Supervised learning. You have available a corpus of instances that are labelled correctly, or mostly correctly.
Learning task: Learn a classifier from the labelled instances. Then do classification by applying the classifier to the new instance.
I'm not going to teach techniques for supervised classification learning. We will simply assume that these techniques exist and work better or worse depending on the problem. For a little bit about this, take the AI class G22.2560; for a lot about it, take "Machine Learning" G22.2565.
What I will discuss are the features that are used for classification, and techniques that go beyond supervised classification.
I. Some words (mostly adjectives; to a lesser extent nouns, verbs, and adverbs) are favorable ("good", "amazing", "beautiful") and some are unfavorable ("bad", "dreadful", "junk").
So for a very crude analysis you use a classifier that does classification just based on these words (e.g. by a weighted sum). Learn the classifier from the labelled corpus. "Applying machine learning techniques based on unigram [single word] models achieve over 80% in accuracy" (Pang and Lee p. 19) (for task of distinguishing positive from negative document).
II.A. Learn new words via synonyms/antonyms on WordNet. Learn word in foreign language from dictionary.
II.B. Learn new words via connectives. "U and V" -- probably same polarity. "U but V" --- probably opposite polarity.
III. Simple phrasal context can change the force of a word. Most notably "not W" has (generally) the reverse polarity from W. But you have to be careful of phrases like "not only W but also U", for which W has its positive polarity. Note the contrast with topic relevance. Quite a few other words with a negative force e.g. "It avoids all cliches and predictability found in Hollywood movies" (example from Pang and Lee, p. 36).
IV. Feature analysis.
V. Discourse Analysis. "In newsgroups devoded to three distinct controversial topics (abortion, gun control, and immigration) Agrawal et al. observe that The relationship between two individuals in the `responded to' network is more likely to be antagonistic --- overall 74% of the responses examined were found to be antagonistic, whereas only 7% were found to be reinforcing." (Pang and Lee p. 48).
Cocitation --- two blogs that cite the same source -- tends to indicate agreement.
Mining Newsgroups Using Networks Arising from Social Behavior. Rakesh Agrawal et al.
Observation: In a newsgroup, if X quotes Y, it is much more often to disagree than to agree (74% to disagree, 7% to agree, 19% off topic).
Therefore: Construct the graph whose nodes are posters and where there is an arc from X to Y if X quotes Y. Divide the set of posters into two disjoint classes F (for) and A (against) in such a way that the maximal number of arcs go from one class to the other.
This problem is known as "MAX-CUT" and is NP-complete, but there are approximation algorithms.
Experiment: Three newsgroups: abortion (2525 authors), gun control (2632 authors), and immigration (1648 authors).
Two versions of the algorithms: the pure version and the "constrained" version where 50 random authors were manually labelled, and the maximal cut respecting that labelling was returned.
Compare text-based SVM and Naive Bayes: both useless.
VI. Discriminating opinions from other text. Direct statements of subjectivity ("In my opinion") or strong presence of evaluative words is suggestive but not decisive in either direction. Odd fact: "Hapax legomena or words that appear a single time in a given corpus have been found to be high-precision indicators of subjectivity" (Pang and Lee p. 33).
VII. Opinion search engine. Differs from the usual objective of a search engine in that
VIII. Spam, both positive and negative. Very hard to detect. Liu proposes as a sufficient test cases (a) where the same review has multiple userid's and (b) where the same review applies to different items. But this gives a very limited corpus which may not be representative. Liu observes that spam seems to gather more "this review was helpful" metacomments than non-spam, but of course these metacomments may also be spam.