G22.2591 - Advanced Natural Language Processing - Spring 2004

Lecture 7

Discuss experiences with Asgn. 4 - MEMM
use of orthographic clues
importance of large training corpora ... need to 'memorize' common names and titles
role of Viterbi and of transition probabilities

First discussion of term project possibilities

Word Sense Disambiguation (WSD)

J&M, Sections 17.1-2;  Manning & Schutze, chap 7

What is WSD?

Our goal in most NLP applications depends on figuring out what a sentence means, which depends ultimately on knowing what the individual words mean.  However, many words have multiple meanings.  For example, "bank" may be a financial institution or the side of a river;  "chair" may be a place to sit or the head of a department ("chairman").  Machine translation depends crucially on sense disambiguation;  even between languages with relatively similar word order, MT will be unsuccessful if WSD is not addressed.  If we are searching a text for examples of weapons, we need to know if "shell" represents ordnance or sea life.

One other application for which WSD was thought to be crucial was information retrieval.  It turns out, however, that once there are several keywords in the query, WSD provides relatively little gain for IR.

WSD Standards

Defining a WSD standard for a language is much more difficult than defining a standard for parts of speech, syntactic structures, or names, because so much of the specification is specific to individual words.   (J & M 595-7)

Lexicographers have long been concerned with the task of cataloging the senses of the words in a language.  A dictionary provides an inventory of such senses, typically arranged in a hierarchical fashion -- multiple entries, senses within an entry, and subsenses within a sense.  However, the goal in creating a dictionary is to explain rather than to classify;  overlapping senses are not a problem for the human user, but clearly make it more difficult to define a WSD task.  This was particularly true for older dictionaries, before the systematic use of concordances to gather and sort word usages.  Also, there are wide variations among dictionaries in how finely they divide a word into senses.

There are a few tests for identifying clearly different senses.  For some verbs, different senses will have different sets of complement structures.  (J&M 596:  He served meat. vs. He served as ambassador.).  Even in cases where two uses of the verb have the same complement, the conjoined form may be marginal or unacceptable, indicating that the uses reflect different senses ("zeugma" (J&M 597):  Continental Airlines serves breakfast.  Continental Airlines serves Newark.  *Continental Airlines serves breakfast and Newark.)  or it may create a pun ("Mr. Pickwick took his hat and his leave." -- Dickens).

WordNet is one of the most widely used sources of lexical information for English.  It is organized as a network of synsets -- synonym sets;  a word with several senses is associated with several synsets.  In addition, a substantial text corpus (Semcor) has been annotated with WordNet word senses.  Unfortunately, the division of senses in Wordnet is relatively fine, so it is difficult to assign WordNet senses consistently (though this has improved in more recent releases).

Cross-language information can also be used as a criterion for distinguishing senses, but it must be kept in mind that different languages make very different distinctions regarding word senses.

WSD Evaluations

There have been two common task evaluations for WSD, Senseval 1 and 2.  Senseval 1 (in 1998) for English involved 35 words for which senses had been developed in the Hector project.  Senseval 2 (in 2001) used Wordnet senses  and had tasks for tagging running text (all words) and selected words. Senseval 2 covered a large number of languages besides English, mostly using EuroWordNet resources.  A third Senseval is now underway.

These evaluations rely on substantial hand-tagged corpora.  Because of the earlier very limited availability of such corpora (and the lack today for many languages), many experiments (starting with Gale et al. 1992) used pseudowords.  We take two words with the same part of speech, and replace both by a single artificial word.  For example, 'door' and 'banana' might both be replaced in a corpus by the word 'donana'.  The WSD algorithm would then be given the task of determining whether each instance of donana was originally 'door' or 'banana'.  However, Gaustad (ACL Student Session, 2001) argued that the results are not directly comparable -- that real sense disambiguation may involve problems that are hard to replicate with pseudowords.

Selectional Restrictions

In some applications, word sense disambiguation is part of a larger task of semantic interpretation, where the verbs are being translated into predicates and the noun phrases into arguments.  This is the case for most dialog systems and, to some degree, for event extraction systems.  In such cases, the domains are limited and the restrictions on the types of arguments a predicate can take are part of the domain specification.  These type constaints ("selectional constraints") can then be used for word sense disambiguation.  (ex)

However, there are several problems with using selectional restrictions for WSD
Learning selectional constraints typically requires a hierarchy of word-sense classes, such as Wordnet.  Even with such a hierarchy, there will be trade-offs ... if a constraint is stated in terms of too broad a class, it will not be helpful;  if stated in terms of too narrow a class, it will have too many exceptions. (see Resnik)

Using Context:  Supervised Methods

The selectional constraints are a specific example of using context for WSD.  A simpler and more robust approach to WSD considers the context of a word more generally ... the words immediately before and after an instance. 

Suppose we are interested in WSD for word w, and that we are given a training corpus in which the instances of w have been sense tagged.  We use the contexts of these instances to train a classifier and then use this classifier to sense tag new instances of w.  What features of the context should we use?  One of the simplest possibilities is to define a fixed-size window around each instance and treat the words within the window as a 'bag of words'.  All the contexts can be combined into a single bag -- so the context is defined as a set of words along with the frequency for each word in the set.

Given this set of words with frequencies, a Naive Bayes classifier can be quite effective (J&M 638-640)
if V is a vector representing the frequencies of the different words in context, and vi is the i-th element of this vector (the frequency of the i-th word), we select the sense s
= argmax(s) P(s|V)
= argmax(s) P(V|s) P(s)
assuming the probabilities of the vi are independent
= argmax(s) product(i) P(vi|s) P(s)
Because the test context may include words not in the training context for a particular sense, maximum likelihood probability estimates will not be sufficient ... they have to be smoothed.

Other classification learners, such as SVMs, can also be used.

Using Dictionaries (Lesk algorithm)

The problem with this approach is that it requires a substantial number of sense-tagged examples to be effective.  (A corpus tagged for WordNet senses, SEMCOR,  is now available.)  As an alternative, Lesk (Automatic Sense Disambiguation: How to tell a pine cone from an ice cream cone. In Proc. SIGDOC'86 Conference, ACM) proposed a simple algorithm for selecting the appropriate word sense relative to a set of dictionary senses.  The Lesk algorithm uses the words in each definition:  for each sense, we count the overlap between the words in the definition of that sense and the words in all the definitions of the words in its context.  Or, we can disambiguate two neighboring words together by seeking the pair of definitions with the greatest overlap.

Using Parallel Corpora

Hwee Tou Ng, Bin Wang, and Yee Seng Chan.  Exploiting Parallel Texts for Word Sense Disambiguation:  An Empirical Study.  (ACL-03).

Bootstrapping approaches

Marti Hearst (1991) Noun Homograph Disambiguation using Local Context in Large Corpora

David Yarowsky (1995)  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods