G22.2590 - Natural Language Processing - Spring 2013 Prof. Grishman

Lecture 11 Outline

April 16, 2013

Text Coherence and Coreference

Why are we interested in analyzing the structure of a discourse beyond the sentence level?

How to analyze and utilize text coherence?

Cross-document coreference

If we have a large collection of documents, we want to know which documents refer to the same individual or organization so that we can gather all the information together. But cross-document coreference is a much different problem from in-document coreference. A single document (news article, web page, email, blog post) is typically written or edited by a single author, and so will have consistent references: a single person will be referred to by the same name, and multiple mentions of the same name will refer to the same person. In contrast, across a large collection there may be many people with the same name, and variations in the spelling of the name of an individual.

As input to cross-document coreference, we assume that named-entity extraction and within-document coreference has already been performed. We will only attempt cross-document linking for those entities which include a named mention. Thus we will not try to link "the first wife of Richard Burton" (if she isn't named in the document) or "the only victim of the bombing".

Coreference [whether within-document or cross-document] is basically a clustering problem: for each document entity, we identify a set of features and then, based on the features of two document entities, decide whether to place them in the same cluster. The features can be divided into internal (name) features and external features.

Internal features

Each named document entity will have a maximal name (and possibly some shorter forms of the name). The simplest name matching requires that these be identical. This is OK for names in Roman alphabets, and works frequently for languages with a Romanization standard [Chinese] but rarely works for languages without a standard Romanization [Arabic]. The many spellings of former Libyan leader Gaddafi are indicative. To handle such cases, we require a similarity measure on names. Simple measures are based on Levenshtein (edit) distance or number of shared bigrams. Measures based on similarity of pronounciation (e.g., Soundex) are also used.

If the full name is the same in two documents, we need to know how strong the evidence that this is the same person. This can be estimated using name perplexity: the number of different family names in a large corpus appearing with the same given name, and the number of different given names appearing with the same family name.

External features

The external features may involve the document as a whole, the immediate context of an entity's mention chain, or specific semantic relations in which the entity participates.

Whole-document features. Document metadata (author, email recipient, date), if available, may be valuable indicators that two document entities have the same referent. IR-style measures applied to the documents (for example, cosine similarity applied to tf-idf normalized term vectors) can be a good measure of whether the documents are about the same topic or event and thus likely to refer to the same person.

Local-context features. The IR-style measures can also be applied more specifically to the context of the entity. For example, the work by Bagga and Baldwin (1998), one of the first on cross-document coreference, generated a document extract consisting of the sentences containing mentions of the entity in question, and then measured the similarity of these extracts.

Semantic relations.If information extraction has been applied to the corpus, for example to extract employer, title, residence, date/place of birth/death, ... this can provide valuable evidence of coreference/non-coreference.

Weighting features

Given multiple features, we need some means of weighting them to provide a unified coreference measure. This can be done through supervised training, e.g., with MaxENt, and a few small corpora (e.g., ACE, WePS) have been hand annotated. However, effective training requires a large training corpus (why?), which would be very expensive to prepare. Furthermore, appropriate weights may be heavily dependent on the corpus. As an alternative, some systems are trained using 'person-x' corpora, where a subset of names [why not all names?] are replaced-by "person-x". This does not work, however, for all features [why?].


A complete pairwise comparison of entities would take quadratic time. This is not feasible for even a modest collection (say 106 documents with 10 named entities each. So some fast method of finding pairs with shared features is required.

Probabilistic Grammars

(J&M Chapter 14;  see also slides by Ray Mooney)

We face a quandary in building broad-coverage grammars.  As we encounter new constructs and extend the grammar by adding new productions, we find that these productions also provide alternative, unwanted parses for 'ordinary' sentences.  We can try to address this problem by adding feature-based constraints, but the resulting grammar becomes very hard to maintain.

One solution we already have considered is to perform only partial parsing -- to recognize a limited set of constituents, and not to insist on getting a complete syntactic analysis of every sentence.

An alternative solution is to employ probabilistic context-free grammars (PCFGs).  We associate a probability with each production in the grammar, with the requirement that the sum of the probabilities of the productions expanding a given symbol sum to 1.  We can then view the grammar as a probabilistic sentence generator, just as we did Markov Models and HMMs.  We start with the sentence symbol, and at each step expand the leftmost non-terminal, choosing a production based on its probability.  In contrast to HMMs, the PCFG allows us to easily capture more global constraints ... for example, that each sentence has a subject and object.

When we parse with a PCFG, we seek the most probable analysis of a sentence (just as we sought the most probable state sequence for an HMM).  The probility of a parse is the product of the probabilities of all the productions which are used to expand symbols in the tree.

Why does a PCFG help us in managing large grammars?  The probabilities provide a way of choosing between parses, preferring the more probable ones.  If we add an additional production to handle some very rare phenomenon we had not previously encountered, we will assign a very low probability to that production, so (roughly speaking) we will generate an analysis using that production only if an analysis using 'ordinary' (more probable) productions is not available.

Of course, if we had to assign such probabilities by hand, judging which productions are more likely, the task of grammar preparation would be almost impossible.  However, if we have a 'tree bank' -- a large set of sentences with manually-assigned parses -- we can build a PCFG in a very straightforward fashion by counting the number of times each production appears in the tree bank.  The probability of a production r(n) expanding a symbol n is

P(r(n)) = count (r(n)) / count(n)
An early tree bank for this purpose was developed by the speech group at IBM Research (Yorktown Heights);  the most influential and widely-used tree bank was developed at the University of Pennsylvania by Mitch Marcus and his colleagues (the Penn TreeBank).  It has led to the development of tree banks for several dozen other languages, including Japanese, German, Czech, Chinese, and Arabic.

Parsing with Probabilistic Grammars

Minimally modifying a parser to incorporate probabilities is relatively simple:  we retain a probability with each partial parse (with each node in the bottom-up parser, with each edge in the chart parser).  When a production is added to the partial parse, we multiply in the probability of that production.

If we are factoring the parse tree, combining all partial analyses of the same word sequence as the same symbol, then we need to retain only the most probable analysis. [Review algorithm complexity.]

The efficiency of probabilisitic parsing can be improved by doing a best-first search (extending the most probable hypothesis) or a heuristic search.

Extending Probabilistic Grammars:  Probabilistic Lexicalized CFGs

PCFGs assume that probabilities are independent of context.  This is not a very accurate assumption;  for example, pronouns occur much more often as the subject of a sentence than as the direct object.  We can make more accurate probability estimates by conditioning the probability on the parent symbol in the parse tree (for example).

In addition, PCFGs do not take the lexical items into account, even though we recognize that some lexical items are much more likely in some environments.  There are dependencies between words and constructions, which we are familiar with as subcategorization.  For example,  in the context  "I ___ that Fred has two dogs.", it is much more likely that the blank is filled by the word "know" or "believe", and not by "am" or "sold".  In addition, there are dependencies between the words of a sentence.  For example, in the context "I bought a ____ about computational linguistics.", it is much more likely that the blank is filled with "book" than "dog".

To handle such dependencies, we need to make productions dependent on the lexical items in the sentence.  But which lexical items?  We don't have enough data to estimate probabilities for all combinations of words.  To limit the lexical items used to condition the probabilities, we use the notion of head introduced as part of constituent structure and employed in feature grammars.  For each production, we designate one element of the right-hand side as the head of the construct;  for example, the verb in a verb phrase production.  We pass up the tree the word associated with the head element, so that there is a head word h(n) associated with each node n in the tree.  If we have a production A --> B C D, and C is the head element of the production, then h(A) = h(C) is the head of the phrase, and we say that h(B) and h(D) are dependents of the head of the phrase.

We then compute the probability of the tree as the product, over all nodes n, of

p(r(n) | n, h(n)) * P(h(n) | n, h(parent(n)))
where r(n) is the production used to expand node n.  The quality of a parser is measured in terms of how many of the constituents were correctly identified.  A simple PCFG, trained on the Penn Tree Bank (1 million words) may get in the 70's;  the best lexicalized PCFGs get almost 90% correct when tested on the same type of text.