G22.2591 - Advanced Natural Language Processing - Spring 2011

Lecture 9

Unsupervised Relation Extraction

Semi-supervised relation extraction begins with a specific relation which the user specifies either through several examples or one phrasing of the relation, and seeks to discover alternate phrasings of the same relation.

Unsupervised relation extraction has a more ambitious goal:  to identify the primary semantic relations of a domain and they way they are expressed.  In other words, to create an ontology of the domain automatically.  See for example Stanley Kok and Pedro Domingos Extracting Semantic Networks from Text via Relational Clustering, Proceedings of the Nineteenth European Conference on Machine Learning (pp. 624-639), 2008. Antwerp, Belgium: Springer.

For everyday texts (such as general news) it may be difficult to identify a closed set of word classes and relations.  The task is more feasible within limited scientific and technical domains.  Within such domains there are more sharply defined word classes and constraints on relations between word classes.  The language within such a domain is termed a sublanguage (Harris, Mathematical Structures of Language, 1968).  Being able to discover these classes and relations automatically would represent the automatic creation of domain ontologies.

Takaaki Hasegawa, Satoshi Sekine, Ralph Grishman Discovering Relations among Named Entities from Large Corpora. ACL 2004.
evaluation of unsupervised methods is difficult (hard to do manual analysis)
Alexander Yates and Oren Etzioni. Unsupervised Resolution of Objects and Relations on the Web. Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2007.

Scenario Template / Event Extraction

The scenario template task originally was the information extraction task for the MUC evaluations. It involved identifying participants, locations, dates etc. of a class of events -- a naval engagement, a terrorist incident, a joint venture.  Typically, a single template included related information, such as an attack and its effects;  this led to some relatively complex templates. With later MUCs (6 and 7), the task narrowed to single events or closely related events -- executive succession, rocket launchings. For the ACE evaluations, this became the event extraction task. Some examples of MUC and ACE templates are shown here.

In principle, event extraction and relation extraction both aim to extract a predicate and its arguments from text.  But event extraction is more complicated in several respects.  Relations are binary and, in ACE relations in particular, both arguments must appear together in a sentence.  This makes model building and model training for relations easier.  Events may have multiple arguments and modifiers (place, time), some of which are omitted in any given instance of an event.  This makes bootstrapping and distant learning more difficult;  it also make event coreference more difficult than relation coreference.

Supervised models

For the early MUCs, these systems were created by constructing rules manually ... either patterns which matched subtrees of a parse tree or regular expressions which matched series of tokens or chunks.  Subsequently, there was some effort at building supervised models.  For ACE, most sites have built supervised models (about 300,000 words of training data were annotated by LDC).   Because the annotation consistency was not very good, and system performance was worse, ACE event extraction was done only once, in 2005.

Because an event involves multiple (and varying numbers of) arguments, it is difficult to construct a single classifier for event extraction -- instead, the extraction is done in stages.  For ACE, the annotation of each event included a 'trigger' -- the word most specifically indicating the event (typically a verb or nominalization).  Extraction begins with a trigger classifier, which detects an event and identifies its type;  this will typically use as features the trigger word itself and its immediate context.  Then an argument classifier is applied to each entity mention in the sentence, to determine whether it is an argument and, if so, its role.  (David Ahn, The Stages of Event Extraction.  ACL 2006 Workshop on Annotating and Reasoning about Time and Events.)

Semi-supervised models

Early attempts were made to semi-automate the pattern creation process: starting with a large annotated corpus, taking the immediate syntactic context of each slot filler and converting it to an extraction pattern. The resulting set of patterns was then reviewed by hand. This, however, still depended on a large amount of hand annotation.

Ellen Riloff, Automatically Generating Extraction Patterns from Untagged Text Proc. Thirteenth National Conference on Artificial Intelligence (AAAI-96) , 1996, pp. 1044-1049.

How to select relevant patterns without tagging the whole corpus? Riloff observed that if the corpus was classified into relevant and irrelevant documents, patterns which occurred substantially more often in relevant documents were in general relevant patterns. This greatly reduced but did not eliminate the corpus annotation required.

Roman Yangarber; Ralph Grishman; Pasi Tapanainen; Silja Huttunen.  Automatic Acquisition of Domain Knowledge for Information Extraction.  Proc. COLING 2000.

Yangarber entirely eliminated the corpus annotation through a bootstrapping scheme. Starting from a set of seed patterns, he retrieved some (relevant) documents containing these seeds, then used Riloff's metric to select some additional patterns, which were used to retrieve more documents, etc.

looking ahead

Roman Yangarber. (2003) Counter-Training in Discovery of Semantic Patterns.
ACL 2003.

M. Stevenson and M. Greenwood. A Semantic Approach to IE Pattern Induction.
ACL 2005.

Mihai Surdeanu, Jordi Turmo, and Alicia Ageno.
A Hybrid Approach for the Acquisition of Information Extraction Patterns.
Proceedings of the EACL 2006 Workshop on Adaptive Text Extraction and Mining (ATEM 2006), April 2006.