"Unsupervised" Content Mining

Opinion Mining

Bing Liu, Web Data Mining, chap. 11 Find and summarize evaluative text.

Tasks

Sentiment classification

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews Peter Turney

Accuracy: Ranges from 84% for automobile review to 66% for movie reviews.

Some misclassified reviews:

Movie: The Matrix
Actual rating: 5 stars
Sample phrase: "more evil"
SO of phrase -4.384
Average SO: -0.219
Context: "The slow methodical way he spoke. I loved it! It made him seem more arrogant and even more evil."

Movie: Pearl Harbor
Actual rating: 5 stars
Sample phrase: "sick feeling"
SO of phrase -8.308
Average SO: -0.3.78
Context: "During this period I have a sick feeling, knowing what was coming, knowing what was part of our history."

Movie: The Matrix
Actual rating: 5 stars
Sample phrase: "very talented"
SO of phrase 1.992
Average SO: 0.177
Context: "Well as usual Keanu Reeves is nothing special, but surprisingly the very talented Laurence Fishburne is not so good either. I was surprised."

Alternative approach: Train from labelled corpus using classification algorithm.

Product Feature Evaluation

Extracting Product Features and Opinions from Reviews Ana-Maria Popescu and Oren Etzioni

Input product class C, reviews R
Output set of [feature, ranked opinion list] tuples

Product: Scanner
Explicit features Example
Properties Scanner size
Parts Cover
Features of Parts Battery life
Related Concepts Scanner image
Feature of Related Concepts Scanner image size

Extract Explicit Features

Word Semantic Orientation (not the best phrase).
Three levels of orientation:

Absolute orientation

Finding (Word, Feature) orientation

Same thing for [Word, Feature, Sentence] orientation.

Minqing Hu and Bing Liu Mining and Summarizing Customer Reviews point out that online reviews are often obligingly divided into "Pros" and "Cons", which gives much more direct evidence for a lot of this. Also, reviews often come with a number of stars awarded. If Popescu and Etzioni are using any of this, they don't say so.

Evaluation
Identifying opinion phrases: precision 79%, recall 76%.
Assigning polarity: precision 86%, recall 89%

Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL Peter Turney

TOEFL synonym test problem:

Which of the following is a synonym for "adipose":

Measure the score that X (a choice) is a synonym for P (problem word) as

Score 1 Score1(X,P) = hits(P AND X) / hits(X)
TOEFL Accuracy = 62.5%. ESL Accuracy = 48% Note: Average TOEFL Accuracy of "Non-English US College Applicant" (presumably applicant to US College whose native language is not English) = 64.5%.

Score 2 Take proximity into account.
Score1(X,P) = hits(P NEAR X) / hits(X)
TOEFL Accuracy = 72.5%. ESL Accuracy = 62%

Score 3 Exclude antonyms. These tend to appear in the form "P but not X" or something similar.

HITS((p NEAR x) AND NOT ((p OR x) NEAR "not"))
----------------------------------------------
HITS(x AND NOT (x NEAR "not"))
where "HITS()" is the number of hits in AltaVista
TOEFL Accuracy = 73.75%. ESL Accuracy = 66%

Score 4 (ESL) The ESL test asks for synonyms for a word as used in a particular context. E.g.

Every year in the early spring, farmers tap maple syrup from their trees.
Identify context c as the word in the sentence, other than the problem word, the choices and stop words, that maximizes
HITS(p AND c)/HITS(c) (= Score1(p,c)).

Then set Score4(x) =

HITS((p NEAR x) AND c and NOT ((p OR x) NEAR "not"))
----------------------------------------------
HITS(x AND NOT (x NEAR "not"))
Accuracy = 74%

Why bother, given the existence of WordNet?

Preemptive Information Extraction

Yusuke Shinyama, NYU Ph.D. thesis.
Preemptive information extraction using unrestricted relation discovery Yusuke Shinyama and Satoshi Sekine

Experiments

Collected 1.1 million articles. Extracted 35,400 events, all significant; 176,985 entities, of which 52,468 significant; 4,400,000 local features (type) 500,000 significant. 49,000 mappings. 2,000 clusters of size greater than 4, containing total about 6,000 events.

Open Information Extraction

Open Information Extraction from the Web Michele Banko et al.

TextRunner Search

Overall: Extract binary relations between entities from web pages -- relation R holds between entities X and Y. No input domain or lexical knowledge (of open classes).

Specifically: From 9,000,000 Web pages, extract 60.5 million tuples. Exclude

Leaves 11.3 million tuples. Of these, estimate (by sampling) that 9.3 million have a well-formed relation. Of these, 7.8 million have well-formed entities. Of these 6.8 million are relations between "abstract" entities of which 80% are correct and 1 million of which 88 are correct. (Discussion of evaluation below.)

Implementation

Step 1: Start with a high-powered syntactic parser, and rules to identify significant relations. Find pairs of "base noun phrases" in the parse tree EI, EJ that are related by a relation RI,J

Step 2: Use the set of relations output step 1 as input for a Naive Bayes classifier. The classification computed by the classifier is the 4-ary predicate "EI and EJ are related by RI,J in sentence S". The features are things like "the presence of part-of-speech tag sequences in the relation RI,J, the number of tokens in RI,J, the number of stopwords in RI,J, whether or not an object E is found to be a proper noun" etc.

Step 3: Over the entire corpus, run a part of speech tagger and a "lightweight noun phrase chunker" plus a regularizer (e.g in the relation RI,J, the number of tokens in RI,J, the number of stopwords in RI,J, whether or not an object E is found to be a proper noun" etc.

Step 4: Over the entire corpus, run a part of speech tagger and a "lightweight noun phrase chunker" plus a regularizer (e.g. standardize tense). Apply the Naive Bayes classifier to the sentence and extract all pertinent relations.

Step 5: Merge identical relations.

Major problem: Correctly finding the boundary of NPs. Titles of books and movies are particularly difficult.

URES: an unsupervised web relation extraction system Benjamin Rosenfeld and Ronen Feldman

Task: Given a target relation, collect instances of the relation with high precision.

Input: Seed set of instances, set of relevant keywords. Specification of whether relations are symmetric (merger) or anti-symmetric (acquisition).

Step 1. Widen keyword set using WordNet synonyms (e.g. "purchase"). Download pages containing keywords. Extract sentences containing keywords.

Step 2. Find sentences matching instances. Construct positive instances of good patterns by variabilizing the arguments. E.g. given the seed instance Acquisition(Oracle,PeopleSoft) and the text

"The Antitrust division of the US Department of Justice evaluated the likely competitive effects of Oracle's proposed acquisition of PeopleSoft"

extract the pattern

"The Antitrust division of the US Department of Justice evaluated the likely competitive effects of X's proposed acquisition of Y"

Construct negative instances of bad patterns by variabilizing other entities in the sentence.

E.g. "The X of the Y evaluated ..."

Optionally use named-entity recognizer to do this only to entities of the correct types (company names)

Step 3. Generalize the patterns. For each pair of variabilized sentences, find the best pattern that matches both.

Pattern: sequence of tokens (words), skips (*), limited skip (*?) and slots. Limited skip may not match terms of the same type as the slot.

E.g. Given the two sentences
"Toward this end, X in July acquired Y."
and
"Earlier this year X acquired Y from Raytheon"
generate the pattern
*? this *? X acquired Y *

Step 4. Post-processing and filtering.

Step 5: Remove sentences that do not match any of the patterns, replacing slots by skips.

Step 6: Extract instances by matching sentences to patterns. Problem: Finding entities in sentences. Tested with a couple of different kinds of parsers.

Results See paper.

Autonomously Semantifying Wikipedia Fei Wun and Daniel Weld.

Task: Construct infoboxes for Wikipedia articles. Example: Abbeville County

Challenges:

Schema collection and refinement: Collect all infoboxes with the exact same template name. Collect the attributes used in at least 15% of these infoboxes.

Sentence matching: Match infobox attribute to sentence in text:

Document classifier Classify Wikipedia articles as belonging to the correct class. Standard document classification problem. Current technique:

Precision: 98.5%. Recall 68.8%. Future experiments with more sophisticated classifiers.

Sentence classifier: Maximum entropy model.

Extractor: Extract value of attribute from sentence. This is a sequential data-labelling problem. Conditional random fields.

Features: "First token", "In first half", "In second half", "Start with capital", "Start with capital, end with period", "Single capital", "All caps, end with period." "Contains digit", "2 digit", "4 digit", "contains dollar sign", "contains underscore", "contains percentage sign", "stop word", "numeric", "number type (e.g. "1,234"), "part of speech", "token itself", "NP chunking tag" "character normalization (upper, lower, digit, other)" "part of anchor text", "beginning of anchor text", "previous tokens (window size 5)" "following tokens (window size 5)", "previous/next anchored token".

Multiple values: Multiple values for a single attribute can occur either in error or because the attribute is set-valued. Check whether in known instances the attribute is single-valued. If so, choose best value; if not, return all values.

Results for attribute extraction
People Prec. People Rec. KYLIN Prec. KYLIN Rec.
County 97.6 65.9 97.3 95.9
Airline 92.3 86,7 87.2 63.7
Actor 94.2 70.1 88.0 68.2
University 97.2 90.5 73.9 60.5