G22.2591 - Advanced Natural Language Processing - Spring 2009

Lecture 12

Discuss KBP progress.

Event extraction: final comments

Summarize Yangarber 2003, Stevenson and Greenberg, Surdeanu et al. (Lecture 11 notes).

Scenario / event extraction really represents a range of tasks

Different sources of trigger similarity No real study yet of whether these sources are complementary, or how different sources may be appropriate for different tasks.


Coreference is the task of determining whether two phrases refer to the same 'entity'.  We will consider it synonymous with anaphora resolution -- when a linguistic unit (the anaphor) refers back to a prior linguistic unit (the antecedent) in the discourse.  [However, it is possible to distinguish these phenomena ... if there are two mentions of "the FBI" in a discourse, they are definitely coreferential, but one may not consider the second mention an anaphoric reference to the first.]

Coreference is critical to most NLP tasks. Certainly it will be essential to doing well on KBP. If we find a sentence "He was born in Greenwich Village on January 9, 1914." we won't be able to do anything without resolving he.

Defining coreference as a formal task (so that it can be evaluated) is not trivial ... there are many possible variations on the basic task. One such task definition was provided by the MUC-6 coreference task:

ACE implicitly defines a similar coference task because entity mentions are grouped into entities. One basic different, however, is that only entities of the ACE types are marked. This may have subtle effects on the evaluation.

There is generally good agreement between people about coreference relations, although decisions can be tricky in some cases of vague referents.  ("The New York police apprehended twenty criminals last week.  A police raid made the newspapers.").

Anaphora can be divided into three types:  pronouns, nominals (phrases headed by common nouns), and names.  Linguistic studies have mostly looked at pronouns;  some computational treatments have also been limited to pronouns.  Generally, system performance  is very good for names, moderate for pronouns, and poorest for nominals.  The main factors considered in resolving each type of potential anaphor are different:

In principle, we would like to use much richer criteria of semantic compatibility and coherence in selecting antecedents, but (except for very restricted domains) this is currently beyond the state of the art.

Within- and cross-document coreference

So far we have considered coreference within a single document or discourse, but the problem also arises across documents. When applied across documents, it is generally limited to entities with names. Even for entities with names, the task is not trivial. On the one hand, names transliterated from non-Roman alphabets may be spelled differently in different documents (e.g., "Muammar al-Gaddafi"). On the other hand, a common name may refer to many different people. So cross-document resolution (part of KBP) requires contextual information.


Unlike all the tasks we considered until now (part-of-speech tagging, chunking, name tagging, word sense disambiguation), coreference cannot be readily reduced to a tagging task, because it involves a relation between mentions (in effect, a gathering of mentions into co-referential clusters or equivalence classes).  This has led to complications in devising an appropriate scoring metric.

It's not sufficient to ask whether a pronoun refers to the correct antedecent, because there may be several possible correct antecedents.  In addition, we need to be concerned with how severely we count particular errors:  if there are 20 coreferential mentions (referring noun phrases), and a system reports the answer as two sets of 10, have we gotten 10 mentions wrong, or just 1 of 19 links?  This can have a large effect on scores.

MUC-6/7 adopted a scoring metric which essentially counts links, and asks how many coreference links would have to be added to get the correct cluster.  Consider one coreferential cluster consisting of mentions m1, ..., mN.  Suppose in the system response these mentions are divided among K clusters.  We then define recall as (N-K)/(N-1).  Total recall is obtained by summing these quantities (numerator and denominator) over all clusters in the key.  Precision is computed by reversing key and response.

This metric is now generally used for reporting coreference scores, but it has some shortcomings.  In particular, if there are lots of singletons, you don't get any credit for getting them right.  Baldwin et al. discuss Vilain's metric and their alterative B-cubed coreference metric in their MUC-7 paper.

Non-trainable Coreference Methods:  Major References

Jerry Hobbs.  Resolving Pronoun ReferencesLingua 44:311-338.

Proposes a search order (through the parse trees of current and prior sentences) for pronouns;  widely used for pronominal anaphora resolution.

Shalom Lappin and Herbert Leass.  An Algorithm for Pronominal Anaphora ResolutionComputational Linguistics 20(4):535-561.

Uses various weights (hand set) to select antecedents for pronouns.  Tested on corpus of several hundred pronouns ... got 86% accuracy.

Renata Vieira; Massimo Poesio.  An Empirically Based System for Processing Definite Descriptions.  Computational Linguistics 26(4):539-593.

Developed detailed rules for resolving definite nominal anaphors and provides detailed evaluation against a corpus.

Trainable Coreference Methods:

Pronoun Coreference
Niyu Ge; John Hale; Eugene Charniak.  A Statistical Approach to Anaphora ResolutionWVLC 1998.

Does only pronoun resolution.  Reports 84% accuracy on their corpus (non-referential pronouns excluded).  Probabilistic model with features including Hobbs distance, number agreement, gender match, mention frequency, and selectional preference.  Discusses methods of learning gender/animateness characteristics of potential antecedents.

Hale's WVLC slides.

General Coreference
Wee Meng Soon; Daniel Chung Yong Lim; Hwee Tou Ng.  A Machine Learning Approach to Coreference Resolution of Noun Phrases.  Computational Linguistics 27 #4, 521-545 (2001). 

Corpus-trained procedure for full (MUC) coreference task.  Used a set of syntactic features plus a semantic class agreement feature (using high-level Wordnet classes);  12 features in all.  Trained on MUC data with decision tree procedure (classifier produces binary outcome).  On test, links to most recent antecedent for which classifier returns true.  Got MUC-6, 7 F=63%, 60%.