G22.2591 - Advanced Natural Language Processing - Spring 2011

Lecture 12

Coreference

Coreference is the task of determining whether two phrases refer to the same 'entity'.  We will consider it synonymous with anaphora resolution -- when a linguistic unit (the anaphor) refers back to a prior linguistic unit (the antecedent) in the discourse.  [However, it is possible to distinguish these phenomena ... if there are two mentions of "the FBI" in a discourse, they are definitely coreferential, but one may not consider the second mention an anaphoric reference to the first.]

Coreference is critical to most NLP tasks. Certainly it will be essential to doing well on KBP. If we find a sentence "He was born in Greenwich Village on January 9, 1914." we won't be able to do anything without resolving he.

Defining coreference as a formal task (so that it can be evaluated) is not trivial ... there are many possible variations on the basic task. One such task definition was provided by the MUC-6 coreference task:

ACE implicitly defines a similar coference task because entity mentions are grouped into entities. One basic different, however, is that only entities of the ACE types are marked. This may have subtle effects on the evaluation.

There is generally good agreement between people about coreference relations, although decisions can be tricky in some cases of vague referents.  ("The New York police apprehended twenty criminals last week.  A police raid made the newspapers.").

Anaphora can be divided into three types:  pronouns, nominals (phrases headed by common nouns), and names.  Linguistic studies have mostly looked at pronouns;  some computational treatments have also been limited to pronouns.  Generally, system performance  is very good for names, moderate for pronouns, and poorest for nominals.  The main factors considered in resolving each type of potential anaphor are different:

In principle, we would like to use much richer criteria of semantic compatibility and coherence in selecting antecedents, but (except for very restricted domains) this is currently beyond the state of the art.

Within- and cross-document coreference

So far we have considered coreference within a single document or discourse, but the problem also arises across documents. When applied across documents, it is generally limited to entities with names. Even for entities with names, the task is not trivial. On the one hand, names transliterated from non-Roman alphabets may be spelled differently in different documents (e.g., "Muammar al-Gaddafi"). On the other hand, a common name may refer to many different people. So cross-document resolution (part of KBP) requires contextual information.

Evaluation

Unlike all the tasks we considered until now (part-of-speech tagging, chunking, name tagging, word sense disambiguation), coreference cannot be readily reduced to a tagging task, because it involves a relation between mentions (in effect, a gathering of mentions into co-referential clusters or equivalence classes).  This has led to complications in devising an appropriate scoring metric.

It's not sufficient to ask whether a pronoun refers to the correct antedecent, because there may be several possible correct antecedents.  In addition, we need to be concerned with how severely we count particular errors:  if there are 20 coreferential mentions (referring noun phrases), and a system reports the answer as two sets of 10, have we gotten 10 mentions wrong, or just 1 of 19 links?  This can have a large effect on scores.

MUC-6/7 adopted a scoring metric which essentially counts links, and asks how many coreference links would have to be added to get the correct cluster.  Consider one coreferential cluster consisting of mentions m1, ..., mN.  Suppose in the system response these mentions are divided among K clusters.  We then define recall as (N-K)/(N-1).  Total recall is obtained by summing these quantities (numerator and denominator) over all clusters in the key.  Precision is computed by reversing key and response.

This metric is now generally used for reporting coreference scores, but it has some shortcomings.  In particular, if there are lots of singletons, you don't get any credit for getting them right.  Baldwin et al. discuss Vilain's metric and their alterative B-cubed coreference metric in their MUC-7 paper.

Non-trainable Coreference Methods:  Major References

Jerry Hobbs.  Resolving Pronoun ReferencesLingua 44:311-338.

Proposes a search order (through the parse trees of current and prior sentences) for pronouns;  widely used for pronominal anaphora resolution.

Shalom Lappin and Herbert Leass.  An Algorithm for Pronominal Anaphora ResolutionComputational Linguistics 20(4):535-561.

Uses various weights (hand set) to select antecedents for pronouns.  Tested on corpus of several hundred pronouns ... got 86% accuracy.

Renata Vieira; Massimo Poesio.  An Empirically Based System for Processing Definite Descriptions.  Computational Linguistics 26(4):539-593.

Developed detailed rules for resolving definite nominal anaphors and provides detailed evaluation against a corpus.

Trainable Coreference Methods:

Two 'classic' papers and a recent survey
Pronoun Coreference
Niyu Ge; John Hale; Eugene Charniak.  A Statistical Approach to Anaphora ResolutionWVLC 1998.

Does only pronoun resolution.  Reports 84% accuracy on their corpus (non-referential pronouns excluded).  Probabilistic model with features including Hobbs distance, number agreement, gender match, mention frequency, and selectional preference.  Discusses methods of learning gender/animateness characteristics of potential antecedents.

Hale's WVLC slides.

General Coreference
Wee Meng Soon; Daniel Chung Yong Lim; Hwee Tou Ng.  A Machine Learning Approach to Coreference Resolution of Noun Phrases.  Computational Linguistics 27 #4, 521-545 (2001). 

Corpus-trained procedure for full (MUC) coreference task.  Used a set of syntactic features plus a semantic class agreement feature (using high-level Wordnet classes);  12 features in all.  Trained on MUC data with decision tree procedure (classifier produces binary outcome).  On test, links to most recent antecedent for which classifier returns true.  Got MUC-6, 7 F=63%, 60%.
Survey of Coreference Models
Vincent Ng, who has published extensively on coreference resolution over the past decade, has written a recent survey of supervised methods:  Vincent Ng.  Supervised Noun Phrase Coreference Research:  the first 15 years.  ACL 2010.  He discusses the three major types of trainable models for coreference:

Looking Ahead

A variety of approaches to building coreference models

Aria Haghighi and Dan Klein, Simple coreference resolution with rich syntactic and semantic features.  EMNLP 2009. 
Is it possible to do coreference without a statistical coref model?  Haghighi and Klein learn rich lexical, syntactic, and semantic constraints and combine them to create a high performance coref system without the need to train a coref model.  (Further improved on by Haghighi and Klein, Coreference resolution in a modular, entity-centered model, NAACL 2010.)

Vincent Ng, Unsupervised models for coreference resolution.  EMNLP 2008.

Is it possible to train a coref model without training data? (See also Hoifung Poon and Pedro Domingos, Joint unsupervised coreference resolution with Markov logic, EMNLP 2008.)

David Bean and Ellen Riloff. Unsupervised learning of contextual role knowledge for coreference resolution. Proc. HLT/NAACL 2004.

Can we make use of discourse information -- event sequences -- to resolve coref?  Bean and Riloff learn pairs of predicates which are likely to govern references to the same entity, starting with reliable coreference pairs. Applied in two narrow domains, terrorism and disasters.  (Generalized by Liao and Grishman, Large corpus-based semantic feature extraction for pronoun coreference, Proceedings of the Second International Workshop on NLP Challenges in the Information Explosion Era (at COLING 2010)).