Automatically Generating Extraction Patterns from Untagged Text
Proc. Thirteenth National Conference on Artificial Intelligence
(AAAI-96) , 1996, pp. 1044-1049.
Riloff's work built on the earlier
AutoSlog system, which built extraction patterns using a supervised
learner and human review. Given a filled slot in the training
data, AutoSlog the immediate syntactic environment of that slot in the
text -- the governing verb or noun -- and proposed it as an extraction
patterns (see paper Figure 1). (These patterns were called concept nodes.) Because this
process generated some too-general patterns, the patterns were reviewed
Riloff's system, AutoSlog-TS, required a
corpus in which the documents were classified for relevance (but not
otherwise annotated). The basic idea was that concept nodes which
appeared more often in relevant documents than in irrelevant documents
were likely extraction patterns. Suppose 50% of the documents in
the corpus were relevant. First patterns (concept nodes) were
generated for all noun phrases in each document. Then for each
pattern she computed its relevance rate = instances of pattern in
relevant documents / total instances of pattern. Patterns with
relevance rate < 50% were discarded; the remainder were ranked
by relevance_rate * log(frequency) (see paper Figure 3 for some
patterns). The patterns were then manually reviewed and manually
associated with particular template slots. Evaluation was over
selected slots; the problem of grouping slots into templates was
Roman Yangarber; Ralph Grishman; Pasi Tapanainen; Silja
Acquisition of Domain Knowledge for Information Extraction.
M. Stevenson and M. Greenwood. A Semantic
Approach to IE Pattern Induction.
Compares Yangarber's discovery procedure
with a procedure which expands the same set of seeds using
WordNet. Defines a "semantic similarity" metric over WordNet
first for individual words and then for subject-verb-object
patterns. At each iteration it adds to the seed set the patterns
most similar to (the centroid of) the seed patterns. Evaluates
both methods on the MUC-6 (executive succession) task, using the MUC-6
corpus and (for Yangarber's method) 6000 additional documents.
Shows a small advantage over Yangarber on the document filtering task,
and a considerably larger advantage on the sentence filtering task.
The Yangarber approach, based on pattern
distribution across documents, runs into problems when unwanted events
co-occur frequently with seed events. The WordNet-based method
has shortcomings when a seed term has multiple senses because all
WordNet senses are expanded. Liao ["Filtered Ranking for
Bootstrapping in Event Extraction", COLING 2010] has found that these
problems can be reduced by combining
these expansion criteria; the distributional criteria tend to
favor the appropriate word sense.
Mihai Surdeanu, Jordi Turmo, and Alicia Ageno.
A Hybrid Approach for the Acquisition of Information Extraction Patterns.
Proceedings of the EACL 2006 Workshop on Adaptive Text Extraction and Mining (ATEM 2006), April 2006.
Uses a co-training strategy in which two classifiers seek to classify documents as relevant to a particular scenario. One is an extraction-pattern based classifier similar to Yangarber's; the other is a bag-of-words classifier. In their experiments, the bag-of-words classifier converges quickly; the pattern-based takes much longer. A number of pattern-ranking functions are tried, including the Riloff criterion used by Yangarber. A simpler function adapted from Collins and Singer proves to work better. In general, considerable gains are reported over Yangarber-style bootstrapping.
Demo: On-Demand Information Extraction (Satoshi Sekine, COLING 2006)
ahead: unsupervised methods
Yusuke Shinyama and Satoshi Sekine. Preemptive information
extraction using unrestricted relation discovery. HLT-NAACL 2006.
Percy Liang, Michael Jordan, and Dan Klein. Learning semantic
correspondences with less supervision. ACL-IJCNLP 2009.
Hoifung Poon and Pedro Domingos. Unsupervised semantic
parsing. EMNLP 2009.