G22.2591 - Advanced Natural Language Processing - Spring 2004

Lecture 9

(Discuss term project topics)


Coreference is the task of determining whether two phrases refer to the same 'entity'.  We will consider it synonymous with anaphora resolution -- when a linguistic unit (the anaphor) refers back to a prior linguistic unit (the antecedent) in the discourse.  [However, it is possible to distinguish these phenomena ... if there are two mentions of "the FBI" in a discourse, they are definitely coreferential, but one may not consider the second mention an anaphoric reference to the first.]

Defining coreference as a formal task (so that it can be evaluated) is not trivial ... there are many possible variations on the basic task. One such task definition was provided by the MUC-6 coreference task:
There is generally good agreement between people about coreference relations, although decisions can be tricky in some cases of vague referents.  ("The New York police apprehended twenty criminals last week.  A police raid made the newspapers.").

Anaphora can be divided into three types:  pronouns, nominals (phrases headed by common nouns), and names.  Linguistic studies have mostly looked at pronouns;  some computational treatments have also been limited to pronouns.  Generally, system performance  is very good for names, moderate for pronouns, and poorest for nominals.  The main factors considered in resolving each type of potential anaphor are different:
In principle, we would like to use much richer criteria of semantic compatibility and coherence in selecting antecedents, but (except for very restricted domains) this is currently beyond the state of the art.


Unlike all the tasks we considered until now (part-of-speech tagging, chunking, name tagging, word sense disambiguation), coreference cannot be readily reduced to a tagging task, because it involves a relation between mentions (in effect, a gathering of mentions into co-referential clusters or equivalence classes).  This has led to complications in devising an appropriate scoring metric.

It's not sufficient to ask whether a pronoun refers to the correct antedecent, because there may be several possible correct antecedents.  In addition, we need to be concerned with how severely we count particular errors:  if there are 20 coreferential mentions (referring noun phrases), and a system reports the answer as two sets of 10, have we gotten 10 mentions wrong, or just 1 of 19 links?  This can have a large effect on scores.

MUC-6/7 adopted a scoring metric which essentially counts links, and asks how many coreference links would have to be added to get the correct cluster.  Consider one coreferential cluster consisting of mentions m1, ..., mN.  Suppose in the system response these mentions are divided among K clusters.  We then define recall as (N-K)/(N-1).  Total recall is obtained by summing these quantities (numerator and denominator) over all clusters in the key.  Precision is computed by reversing key and response.

This metric is now generally used for reporting coreference scores, but it has some shortcomings.  In particular, if there are lots of singletons, you don't get any credit for getting them right.  Baldwin et al. discuss Vilain's metric and their alterative B-cubed coreference metric in their MUC-7 paper.

Non-trainable Coreference Methods:  Major References

Jerry Hobbs.  Resolving Pronoun ReferencesLingua 44:311-338.

Proposes a search order (through the parse trees of current and prior sentences) for pronouns;  widely used for pronominal anaphora resolution.

Shalom Lappin and Herbert Leass.  An Algorithm for Pronominal Anaphora ResolutionComputational Linguistics 20(4):535-561.

Uses various weights (hand set) to select antecedents for pronouns.  Tested on corpus of several hundred pronouns ... got 86% accuracy.

Renata Vieira; Massimo Poesio.  An Empirically Based System for Processing Definite Descriptions.  Computational Linguistics 26(4):539-593.

Developed detailed rules for resolving definite nominal anaphors and provides detailed evaluation against a corpus.

Trainable Coreference Methods:  An Annotated Bibliography

Pronoun Coreference
Niyu Ge; John Hale; Eugene Charniak.  A Statistical Approach to Anaphora ResolutionWVLC 1998. **

Does only pronoun resolution.  Reports 84% accuracy on their corpus (non-referential pronouns excluded).  Probabilistic model with features including Hobbs distance, number agreement, gender match, mention frequency, and selectional preference.  Discusses methods of learning gender/animateness characteristics of potential antecedents.
General Coreference
Wee Meng Soon; Daniel Chung Yong Lim; Hwee Tou Ng.  A Machine Learning Approach to Coreference Resolution of Noun Phrases.  Computational Linguistics 27 #4, 521-545 (2001).  **

Corpus-trained procedure for full (MUC) coreference task.  Used a set of syntactic features plus a semantic class agreement feature (using high-level Wordnet classes);  12 features in all.  Trained on MUC data with decision tree procedure (classifier produces binary outcome).  On test, links to most recent antecedent for which classifier returns true.  Got MUC-6, 7 F=63%, 60%.

Vincent Ng; Claire Cardie.  Improving Machine Learning Approaches to Coreference Resolution.  ACL 2002. **

Tries to improve on work of Soon et al.  Non-linguistic improvements include richer string-match features, different selection of positive training examples, and selecting best-scoring antecedent (rather than closest acceptable antecedent).  Together yield 3% improvement.  Tries very rich set of features -- net loss using decision tree.  Prunes the features, gets some gain on MUC-6 (not MUC-7).  Final MUC F= 69%, 63%.

Vincent Ng and Claire Cardie.  Identifying Anaphoric and Non-Anaphoric Noun Phrases to Improve Coreference ResolutionCOLING 2002.

Introduces a separate (corpus-trained) test to distinguish anaphoric and non-anaphoric NPs, using domain-independent criteria (in contrast to Bean and Riloff).  Finds that if this test is applied to all possible anaphors, performance gets worse, but if some 'sure' anaphor-antecedent pairs passing a string match or alias test are excluded, performance improves.  Reports MUC-6, 7 scores 66%, 64%.

Xiaofeng Yang; Guodong Zhou; Jian Su; Chew Lim TanCoreference Resolution Using Competition Learning Approach.  ACL 2003.

Proposes using discriminative learner to choose between antecedents, rather than just scoring each antecedent against anaphor.  Helps for pronouns (2-3% over Ng&Cardie ACL), little help for nominals.  Perhaps features comparing antecedents are too weak.  Overall MUC-6, 7 F is 71%, 60%.

Sanda M. Harabagiu; Razvan C. Bunescu; Steven J. Maiorano.  Text and Knowledge Mining for Coreference Resolution.  NAACL 2001. **

Integrates several corpus-driven methods, although in all cases the learning appears to be performed manually, from corpus examples.  Develops rules based on MUC corpus.  Uses WordNet to determine similarity of heads for nominal anaphors;  uses corpora to weight different possible paths in WordNet.  Learns rule weighting from corpus.  Finally, uses bootstrap to find good candidate examples / rules from unannotated text.  Scores are very high, but Ng and Cardie [ACL 2002] report that this is based on a different metric from the standard MUC metric.  (Also, Sanda M. Harabagiu; Steven J. Maiorano.  Multilingual Coreference Resolution.  ANLP/NAACL 2000.  discusses similar approach, extended to process parallel English/Romanian texts with small improvement in reference resolution.)
Information-Extraction Coreference
J. McCarthy and W. Lehnert, Using Decision Trees for Coreference Resolution.  Proc. Fourteenth Int'l Joint Conf. on Artificial Intelligence (1995).

Coreference for entities as identified by MUC-5 joint venture task;  uses information from extracted templates.

Andrew Kehler.  Probabilistic Coreference in Information Extraction.  WVLC 97.

Coreference for entities as identified by MUC-5 joint venture task;  uses information from extracted templates.
Claire Cardie and Kiri Wagstaf.  Noun Phrase Coreference as Clustering.  EMNLP 1999.

Builds entities by clustering mentions, proceeding backwards through document, treating entities as clusters.  Is described as unsupervised learning but really appears to be hand-tuned ... no real learning. MUC-6 F=54%.

David L. Bean; Ellen Riloff.  Corpus-Based Identification of Non-Anaphoric Noun Phrases.  ACL 1999.

Describes a variety of test (some syntactic, some lexical and corpus-based) to identify 'existential' NPs -- those which can be understood independently of other mentions  (note that this is not the same as being non-coreferential).  Suggests that this would be helpful as a filter for anaphora resolution, but this is not directly tested.

Andrew McCallum and Ben Wellner. Toward Conditional Models of Identity Uncertainty with Application to Proper Noun Coreference.  IJCAI Workshop on Information Integration on the Web, 2003.

Argues that coreference models based on binary (anaphor-antecedent) probabilities or scores are not sufficient;  we need more general models that can capture relations among all mentions in a coreferential cluster.  Introduces such a model, and demonstrates 2-3% absolute improvement in name coreference performance.