Lecture 12: Information Extraction

General idea: To leverage the redundancy of the web against the difficulty of natural language interpretation. Somebody somewhere will have stated the fact you want in a form that your program can recognize.

NLP Tools


The more sophisticated forms of analysis obviously give richer information. The downside using them is that they tend to be (a) computationally costly; (b) ambiguous; (c) fragile under low-quality text; (d) hard to integrate; (e) hard to probabilize; (f) unavailable for minority languages and even for specialized subject matter.


KnowItAll (Etzioni et al.)

Web-scale Information Extraction in KnowItAll (preliminary results) Oren Etzioni et al., WWW 2004.

Task: To collect as many instances as possible of various categories. (cities, states, countries, actors, and films.)

General bootstrapping algorithm:

{ EXAMPLES := seed set of examples of the kind of thing you want to collect.
  repeat { EXAMPLEPAGES := retrieve pages containing the examples in E;
           PATTERNS := patterns of text surrounding the examples in E
                         in EXAMPLEPAGES;
           PATTERNPAGES := retrieve pages containing patterns in PATTERNS;
           EXAMPLES := extract examples from PATTERNPAGES matching PATTERNS
  until (some stopping condition: e.g. enough iterations, enough examples, 
         some measure of accuracy too low, etc.)
Danger: Semantic drift (once some irrelevant example has been introduced, it builds on itself.) Positive feedback system.

Domain-independent extractor and assessor rules.

Extractor rule:

Predicate: Class1
Pattern: NP1 "such as" NPList2 
Constraints: head(NP1)=plural(label(Class1)) properNoun(head(each(NPList2))) Bindings: Class1(head(each(NPList2)))
E.g. For the class "City" the pattern is "cities such as NPList2" "cities such as" can be used as a search string. The pattern would match "cities such as Providence, Pawtucket, and Cranston" and would label each of Providence, Pawtucket, and Cranston as cities.

Subclass extractors: Look for instances of a subclass rather than the superclass. E.g. it is easier to find people described as "physicists", "biologists", "chemists" etc. rather than "scientists."

List extractor rules.

Assessor Collection of high-precision, searchable patterns. E.g. "[INSTANCE] is a [CATEGORY]" ("Kalamazoo is a city.") There will not be very many of these on the Web, but if there are a few, that is sufficient evidence.


Learning synonyms for category names: E.g. learn "town" as a synonym for "city"; "nation" as a synonym for "country" etc.
Method: Run the extractor rules in the opposite direction. E.g. Look for patterns of the form "[CLASS](pl.) such as [INST1], [INST2] ..." where some of the instances are known to be cities.

Learning patterns

Some of the best patterns learned:
the cities of [CITY]
headquartered in [CITY]
for the city of [CITY]
in the movie [FILM]
[FILM] the movie starring
movie review of [FILM]
and physicist [SCIENTIST]
physicist [SCIENTIST]
[SCIENTIST], a British scientist

Learning Subclasses

Subclass patterns:
[SUPER] such as [SUB]
such [SUPER] as [SUB]
[SUB] and other [SUPER]
[SUPER] especially [SUB]
[SUB1] and [SUB2] (e.g. "physicists and chemists")

Learning list-pattern extractors

Looks for repeated substructures within an HTML subtree with many instances of the category in a particular place.
E.g. the pattern " CITY " will detect CITY as the first element of a row in a table. (Allows wildcards in the argument to an HTML tag.) Predict that the leaves of the subtree are all instances.
Hugely effective strategy; increases overall number retrieved by a factor of 7, and increases "extraction rate" (number retrieved per query) by a factor of 40.

Results: Found 151,016 cities of which 78,157 were correct: precision = 0.52. At precision = 0.8, found 33,000. At precision = 0.9, found 20,000


Targeted Categories and Relations

Language-Independent Set Expansion of Named Entities Using the Web Richard Wang and William Cohen, IEEE ICDM 2007.

SEAL (Set Expander for Any Language) Task: Given a few names from a category, find many names from that category. Do this in way that is independent of both the content language and of the markup system (e.g. don't assume HTML).

General method: Look for lists, tables, etc. containing the names. Extract other names from these.

Success: With 36 categories such as "Classic Disney Movies", "constellations" "countries", "Major league baseball teams" "Japanese emperor names" etc., with docs in English, Chinese, and Japanese, retrieves with about 94% precision. (All three langs for many categories; only one lang for certain categories such as "Japanese emperors").



Character-level analysis of semi-structured documents for set expansion Richard Wang and William Cohen, EMNLP 2009.

Expands the above to binary relations, using wrappers with left, right and middle contexts.

Sample relations: Governor vs. US state; Mayor vs. city in Taiwan; US federal agency acronym vs. full name.


Coupled Semi-Supervised Learning for Information Extraction Carlson et al., WSDM-10

Input: Initial "ontology" with

Basic bootstrapping algorithm:
Loop forever
  Use patterns to extract new instances
  Use instances to extract new patterns
  Filter and rank
  Add new elements to set (promote),

Coupled Pattern Learner (CPL): Extracts patterns and instances similarly to KnowItAll.

Filter using mutual exclusion and type-checking: An instance is rejected for a category C unless the number of times it co-occurs with a promoted pattern is at least 3 time the number of times it appears with any pattern for a category inconsistent with D. Patterns are filtered comparably.

Instances are ranked by the number of patterns for C they occur with.

CSEAL : Coupled SEAL

MBL (Meta-Bootstrap Learner). Combined together, combining the instances generated from each, and using the mutual exclusion and type-checking constraints to allow information gathered from one to filter the other.


MBL promoted 207 instances of countries with an estimated precision of 93%. CSEAL promoted 130 instances, with an estimated precision of 97%. [Why it is not possible to compute the exact precision, I do not understand.] Without coupling, Country preforms poorly, drifting into a more general Location category.

The categories for which the couple algorithms still have the most difficulty (e.g. ProductType, SportsEquipment, Traits, Vehicles) tend to be common nouns. ...

The coupled algorithms generally had high accuracies for relations but suffered from sparsity. SportUsesSportsEquipment performed poorly because the SportsEquipment category performed poorly, resulting in bad type checking. StateHasCapital and CompanyHeadquarteredInCiry drifted to the more general relations of StateContainsCity and CompanyHasOperationsInCity. ...

Our experiments included five relations for which no instances were promoted by any alorithms: CoachCoachesAthelete, AthletePlaysInStadium, CoachWonAwardTrophyTournament, SportPlayesGameInStadium and AthleteIsTeammateOfAthlete.


Toward an Architecture for Never-Ending Language Learning, Andrew Carlson et al., AAAI-10

NELL system. Four subsystems:

Starting point: 123 categories each with 10-15 instances. 55 relations, each with 10-15 instances and 5 non-instances (obtained by permuting the arguments).

Results: After running for 67 days, NELL completed 66 iterations of execution. 242,453 beliefs promoted: 95% are instances of categories and 5% instances of relations. On a per-iteration basis, the promotion rate remained reasonably constant. Precision steadily declined, from 90% in iterations 1-22 to 71% in iterations 23-44 to 57% in iterations 45-66. Overall precision 74%.

Most predicates had precision over 90% but two did badly:

However, in looking at errors made by the system, it is clear that CPL and CMC are not perfectly uncorrelated in their errors. As an example, for the category BakedGood, CPL learns the pattern, "X are enabled in" bcause of the believed instance "cookies". This leads CPL to extract "persistent cookies" as a candidate BakedGood. CMC outputs high probability for phrases that end in "cookies", and so "persistent cookies" is promoted as believed instance of BakedGood.


Read The Web

Since January 2010, NELL has been learning continuously; As of 4/20/11, NELL has accumulated a knowledge base of 581,405 beliefs in 287 iterations (about one every two days). Note that the "iterations per day" and "facts per iterations" are each about half what they were over the first 67 days.

The current "Recently Learned Facts" that I got off a first pass of the "Read the Web" home page are: (number is confidence)

athletics_at_the_2004_summer_olympics is an instance of the olympics 100.0
billy_holm is an Australian person 93.8
cell_phone_phone is a type of biological cell 99.6
larger_room is a kind of room 100.0
dartmouth_street is a street 100.0
rembrandt is a visual artist in the field of work 98.4
westinghouse is a company headquartered in the city pittsburgh 99.8
sonoma is a proxy for california 93.9
l_a__dodgers is a sports team that played in series 100.0
microsoft is a company that produces windows_server_2003 99.8
This strikes me as unimpressive. The beliefs about Billy Holm (according to Wikipedia, a catcher for the Chicago Cubs and the Red Sox in the 40's; native of Chicago) and about Westinghouse (actually headquartered in New York) are false.

The facts on the second pass were rather better.


Open Information Extraction

Open Information Extraction from the Web , Michele Banko et al., IJCAI, 2007.

Specifically: From 9,000,000 Web pages, extract 60.5 million tuples. Exclude

Leaves 11.3 million tuples. Of these, estimate (by sampling) that 9.3 million have a well-formed relation. Of these, 7.8 million have well-formed entities. Of these 6.8 million are relations between "abstract" entities of which 80% are correct and 1 million of which 88 are correct. (Discussion of evaluation below.)


Step 1: Start with a high-powered syntactic parser, and rules to identify significant relations. Find pairs of "base noun phrases" in the parse tree EI, EJ that are related by a relation RI,J

Step 2: Use the set of relations output step 1 as input for a Naive Bayes classifier. The classification computed by the classifier is the 4-ary predicate "EI and EJ are related by RI,J in sentence S". The features are things like "the presence of part-of-speech tag sequences in the relation RI,J, the number of tokens in RI,J, the number of stopwords in RI,J, whether or not an object E is found to be a proper noun" etc.

Step 3: Over the entire corpus, run a part of speech tagger and a "lightweight noun phrase chunker" plus a regularizer (e.g. standardize tense). Apply the Naive Bayes classifier to the sentence and extract all pertinent relations.

Step 4: Merge identical relations.

Major problem: Correctly finding the boundary of NPs. Titles of books and movies are particularly difficult. Open Information Extraction from the Web , Oren Etzioni et al., CACM, 2008

TextRunner Search The Tradeoffs between Open and Traditional Relation Extraction Michele Banko and Oren Etzioni, ACL 2008.

Considerably better results obtained using graphical models learning (Conditional Random Fields) than with Naive Bayes, because of the more systematic use of word order.

Theory Construction by Web Mining

Strategies for lifelong knowledge extraction from the web Michele Banko and Oren Etzioni Task: ALICE creates a theory for a specified domain: Nutrition.

Buzzword: "Lifelong Knowledge Extraction"

Concept Discovery: Import classes, IS-A (subclass) relations from WordNet. Also as in KnowItAll, find classes and IS-A relations by matching patterns in Web text e.g. "frult such as < y >" "buckwheat is an < x >". In this way, determine that buckwheat is a whole grain, gluten-free grain, fiber-rich food and nutritious food where these are newly created categories.

Generalization: Use KnowItAll to collect relations among individuals and small classes from Web test.
Generalize to larger classes.
E.g. KnowItAll collects "Oranges provide Vitamin C", "Bananas provide a source of B vitamins", "An avocado provides niacin". Using the known facts that oranges, bananas, and avocados are fruit and that Vitamin C, B vitamins, and niacin are vitamins, deduce PROVIDE(< FRUIT >, < VITAMIN >)).

(Of course, it's not clear how the quantifiers are supposed to work here. It is certainly not true that

forall(F,V) fruit(F) ^ vitamin(V) => provide(F,V).
What is probably closest to the truth is
forall(F) fruit(F) => exists(V) vitamin(V) ^ provide(F,V).
but there is no indication how ALICE would figure that out.)

Have to be careful to avoid over-generalization: e.g. "Provide(< FOOD >, < SUBSTANCE >) or "Provide(< ENTITY >, < ENTITY >)"

Results Constructed 696 new generalizations.
78% were meaningful, true, and relevant.
6% were off-topic e.g. "Cause(< Organism >, < Disease >)".
9.5% were vacuous e.g. "Provide(< Food >, < Substance >)".
3.% were incomplete e.g. "Provide(< Antioxidant >, < Body Part >)
3.5% were false e.g. "BeNot(< Fruit >, < Food >)".


Extracting Verb Relations

VERBOCEAN: Mining the Web for Fine-Grained Semantic Verb Relations Timothy Chdlovski and Patrick Pantel

Relations to be found: (These examples were all actually extracted by the system.)

Semantic patterns: (I omit tense variations)

SEMANTIC RELATION     Surface Pattern
narrow similarity X i.e. Y
broad similarity X and Y
strength X even Y
X and even Y
Y or at least X
not only X but Y
not just X but Y
enablement Xed * by Ying the
Xed * by Ying or
antynomy either X or Y
whether to X or Y
X * but Y
precedes X * and [then/later/subsequently/eventually] Y

Accuracy: 68%.

Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL Peter Turney

TOEFL synonym test problem:

Which of the following is a synonym for "adipose":

Measure the score that X (a choice) is a synonym for P (problem word) as

Score 1 Score1(X,P) = hits(P AND X) / hits(X)
TOEFL Accuracy = 62.5%. ESL Accuracy = 48% Note: Average TOEFL Accuracy of "Non-English US College Applicant" (presumably applicant to US College whose native language is not English) = 64.5%.

Score 2 Take proximity into account.
Score1(X,P) = hits(P NEAR X) / hits(X)
TOEFL Accuracy = 72.5%. ESL Accuracy = 62%

Score 3 Exclude antonyms. These tend to appear in the form "P but not X" or something similar.

HITS((p NEAR x) AND NOT ((p OR x) NEAR "not"))
HITS(x AND NOT (x NEAR "not"))
where "HITS()" is the number of hits in AltaVista
TOEFL Accuracy = 73.75%. ESL Accuracy = 66%

Score 4 (ESL) The ESL test asks for synonyms for a word as used in a particular context. E.g.

Every year in the early spring, farmers tap maple syrup from their trees.
Identify context c as the word in the sentence, other than the problem word, the choices and stop words, that maximizes
HITS(p AND c)/HITS(c) (= Score1(p,c)).

Then set Score4(x) =

HITS((p NEAR x) AND c and NOT ((p OR x) NEAR "not"))
HITS(x AND NOT (x NEAR "not"))
Accuracy = 74%

Why bother, given the existence of WordNet?


Preemptive Information Extraction

Yusuke Shinyama, NYU Ph.D. thesis.
Preemptive information extraction using unrestricted relation discovery Yusuke Shinyama and Satoshi Sekine


Collected 1.1 million articles. Extracted 35,400 events, all significant; 176,985 entities, of which 52,468 significant; 4,400,000 local features (type) 500,000 significant. 49,000 mappings. 2,000 clusters of size greater than 4, containing total about 6,000 events.

URES: an unsupervised web relation extraction system Benjamin Rosenfeld and Ronen Feldman

Task: Given a target relation, collect instances of the relation with high precision.

Input: Seed set of instances, set of relevant keywords. Specification of whether relations are symmetric (merger) or anti-symmetric (acquisition).

Step 1. Widen keyword set using WordNet synonyms (e.g. "purchase"). Download pages containing keywords. Extract sentences containing keywords.

Step 2. Find sentences matching instances. Construct positive instances of good patterns by variabilizing the arguments. E.g. given the seed instance Acquisition(Oracle,PeopleSoft) and the text

"The Antitrust division of the US Department of Justice evaluated the likely competitive effects of Oracle's proposed acquisition of PeopleSoft"

extract the pattern

"The Antitrust division of the US Department of Justice evaluated the likely competitive effects of X's proposed acquisition of Y"

Construct negative instances of bad patterns by misinstantiating other entities in the sentence.

E.g. Create sentences of the form "The X of the Y evaluated ..." where X and Y are random NP's.

Optionally use named-entity recognizer to do this only to entities of the correct types.

Step 3. Generalize the patterns. For each pair of variabilized sentences, find the best pattern that matches both.

Pattern: sequence of tokens (words), skips (*), limited skip (*?) and slots. Limited skip may not match terms of the same type as the slot.

E.g. Given the two sentences
"Toward this end, X in July acquired Y."
"Earlier this year X acquired Y from Raytheon"
generate the pattern
*? this *? X acquired Y *

Step 4. Post-processing and filtering.

Step 5: Remove sentences that do not match any of the patterns, replacing slots by skips.

Step 6: Extract instances by matching sentences to patterns. Problem: Finding entities in sentences. Tested with a couple of different kinds of parsers.

Results See paper.


Autonomously Semantifying Wikipedia Fei Wun and Daniel Weld.

Task: Construct infoboxes for Wikipedia articles. Example: Abbeville County


Schema collection and refinement: Collect all infoboxes with the exact same template name. Collect the attributes used in at least 15% of these infoboxes.

Sentence matching: Match infobox attribute to sentence in text:

Document classifier Classify Wikipedia articles as belonging to the correct class. Standard document classification problem. Current technique:

Precision: 98.5%. Recall 68.8%. Future experiments with more sophisticated classifiers.

Sentence classifier: Maximum entropy model.

Extractor: Extract value of attribute from sentence. This is a sequential data-labelling problem. Conditional random fields.

Features: "First token", "In first half", "In second half", "Start with capital", "Start with capital, end with period", "Single capital", "All caps, end with period." "Contains digit", "2 digit", "4 digit", "contains dollar sign", "contains underscore", "contains percentage sign", "stop word", "numeric", "number type (e.g. "1,234"), "part of speech", "token itself", "NP chunking tag" "character normalization (upper, lower, digit, other)" "part of anchor text", "beginning of anchor text", "previous tokens (window size 5)" "following tokens (window size 5)", "previous/next anchored token".

Multiple values: Multiple values for a single attribute can occur either in error or because the attribute is set-valued. Check whether in known instances the attribute is single-valued. If so, choose best value; if not, return all values.

Results for attribute extraction
People Prec. People Rec. KYLIN Prec. KYLIN Rec.
County 97.6 65.9 97.3 95.9
Airline 92.3 86,7 87.2 63.7
Actor 94.2 70.1 88.0 68.2
University 97.2 90.5 73.9 60.5


Learning First-Order Horn Clauses from Web Text Stefan Schoenmakers et al., EMNLP 2010.