G22.2591 - Advanced Natural Language Processing - Spring 2011

Lecture 10

Scenario Template / Event Extraction (cont'd)

Ellen Riloff, Automatically Generating Extraction Patterns from Untagged Text Proc. Thirteenth National Conference on Artificial Intelligence (AAAI-96) , 1996, pp. 1044-1049.

Riloff's work built on the earlier AutoSlog system, which built extraction patterns using a supervised learner and human review.  Given a filled slot in the training data, AutoSlog the immediate syntactic environment of that slot in the text -- the governing verb or noun -- and proposed it as an extraction patterns (see paper Figure 1).  (These patterns were called concept nodes.)  Because this process generated some too-general patterns, the patterns were reviewed by hand. 

Riloff's system, AutoSlog-TS, required a corpus in which the documents were classified for relevance (but not otherwise annotated).  The basic idea was that concept nodes which appeared more often in relevant documents than in irrelevant documents were likely extraction patterns.  Suppose 50% of the documents in the corpus were relevant.  First patterns (concept nodes) were generated for all noun phrases in each document.  Then for each pattern she computed its relevance rate = instances of pattern in relevant documents / total instances of pattern.  Patterns with relevance rate < 50% were discarded;  the remainder were ranked by relevance_rate * log(frequency)  (see paper Figure 3 for some patterns).  The patterns were then manually reviewed and manually associated with particular template slots.  Evaluation was over selected slots;  the problem of grouping slots into templates was not addressed.

Roman Yangarber; Ralph Grishman; Pasi Tapanainen; Silja Huttunen.  Automatic Acquisition of Domain Knowledge for Information Extraction.  Proc. COLING 2000.

Yangarber sought to extend Riloff's procedure into a bootstrapping procedure which required no corpus annotation, only a seed set of patterns.  The set of seed patterns were used to select documents from a large corpus.  Then an extension of Riloff's metric was used to select the top patterns and add them to the seed set and the process repeated.  Pattern relevance was graded based on the distribution of the pattern, and in turn the relevance of documents was graded based on the relevance of the patterns they contained.

An automatic evaluation was conducted based on the document-filtering task in MUC-6.  (Stevenson and Greenwood later pointed out that this was not a very sensitive metric because more than 50% of the documents in the test set were relevant.)  The patterns were also manually associated with slots and evaluated on the slot-filling task.

slides (primarily from 2000 talk)

Roman Yangarber. (2003) Counter-Training in Discovery of Semantic Patterns.
ACL 2003.

Addresses the problem which arose with Yangarber et al. (2000) of figuring out when to stop bootstrapping ... without further control, that procedure would continue adding patterns until all patterns in the corpus are included.  This is a common problem of semi-supervised (bootstrapping) methods.  Yangarber addresses it here by creating seeds for a number of different scenarios and training classifiers for these scenarios concurrently.  Relevance of an article for one scenario is treated as negative evidence of its relevance for other scenarios.  Eventually the patterns are partitioned among the scenarios and the bootstrapping halts.

(One shortcoming of this work was the need to manually identify the competing scenarios.  We discussed earlier in the course very recent work by Tara McIntosh [EMNLP 2010] and Ang Sun [COLING 2010] to create competing sets automatically, although not for events.)

M. Stevenson and M. Greenwood. A Semantic Approach to IE Pattern Induction.
ACL 2005.

Compares Yangarber's discovery procedure with a procedure which expands the same set of seeds using WordNet.  Defines a "semantic similarity" metric over WordNet first for individual words and then for subject-verb-object patterns.  At each iteration it adds to the seed set the patterns most similar to (the centroid of) the seed patterns.  Evaluates both methods on the MUC-6 (executive succession) task, using the MUC-6 corpus and (for Yangarber's method) 6000 additional documents.  Shows a small advantage over Yangarber on the document filtering task, and a considerably larger advantage on the sentence filtering task.

The Yangarber approach, based on pattern distribution across documents, runs into problems when unwanted events co-occur frequently with seed events.  The WordNet-based method has shortcomings when a seed term has multiple senses because all WordNet senses are expanded.  Liao ["Filtered Ranking for Bootstrapping in Event Extraction", COLING 2010] has found that these problems can be reduced by combining these expansion criteria;  the distributional criteria tend to favor the appropriate word sense.

Mihai Surdeanu, Jordi Turmo, and Alicia Ageno.
A Hybrid Approach for the Acquisition of Information Extraction Patterns.
Proceedings of the EACL 2006 Workshop on Adaptive Text Extraction and Mining (ATEM 2006), April 2006.

Uses a co-training strategy in which two classifiers seek to classify documents as relevant to a particular scenario. One is an extraction-pattern based classifier similar to Yangarber's; the other is a bag-of-words classifier. In their experiments, the bag-of-words classifier converges quickly; the pattern-based takes much longer. A number of pattern-ranking functions are tried, including the Riloff criterion used by Yangarber. A simpler function adapted from Collins and Singer proves to work better. In general, considerable gains are reported over Yangarber-style bootstrapping.

Demo: On-Demand Information Extraction (Satoshi Sekine, COLING 2006)

Looking ahead:  unsupervised methods

Yusuke Shinyama and Satoshi Sekine.  Preemptive information extraction using unrestricted relation discovery.  HLT-NAACL 2006.

Percy Liang, Michael Jordan, and Dan Klein.  Learning semantic correspondences with less supervision.  ACL-IJCNLP 2009.

Hoifung Poon and Pedro Domingos.  Unsupervised semantic parsing.  EMNLP 2009.