CSCI-GA.2590 - Natural Language Processing - Spring 2013    Prof. Grishman

Lecture 8 Outline

March 26, 2013

Term project status.

Lexical Semantics [J&M Chap 19 and 20]

Until now we have focussed primarily on syntactic issues: part of speech, names, noun and verb groups, and some larger structures. For our semantics -- our meaning-bearing elements -- we have relied on words. This is really problematic for a semantic representation:  one word may have several meanings (polysemy) and several words may have the same or nearly the same meaning (synonymy).  Both of these can cause problems for NLP applications, including infomation extraction. In this section we take a closer look at word meanings.

This will also give us an opportunity to see a wide range of approaches: hand-coded resources and methods involving supervised, semi-supervised, and unsupervised training.

Terminology [J&M 19.1, 2]

    - multiple senses of a word
    - polysemy (and homonymy for totally unrelated senses ("bank"))
    - metonomy for certain types of regular, productive polysemy ("the White House", "Washington")
    - zeugma (conjunction combining distinct senses) as test for polysemy ("serve")
    - synonymy:  when two words mean (more-or-less) the same thing
    - hyponymy:  X is the hyponym of Y if X denotes a more specific subclass  of Y
        (X is the hyponym, Y is the hypernym)

WordNet [J&M 19.3]

    - large-scale database of lexical relations
    - freely available for interactive use or download
    - organized as a graph whose nodes are synsets (synonym sets)
        - each synset consists of 1 or more word senses which are considered synonymous
    - primary relation:  hyponym / hypernym
    - very fine sense distinctions
    - sense-annotated corpus (SemCor, subset of Brown corpus)
    - similar wordnets developed for many foreign languages:  Global WordNet Association

Word Sense Disambiguation [J&M 20.1]

    - process of identifying the sense of a word in context
    - WSD evaluation:  either using WordNet or coarser senses (e.g., main senses from a dictionary)
    - local cues (Weaver):  train a classifer using nearby words as features
        - either treat words at specific positions relative to target word as separate features
        - or put all words within a given window (e.g., 10 words wide) as a 'bag of words'
        - simple demo for 'interest'

Simple supervised WSD algorithm:  naive Bayes [J&M 20.2.2]

        selected sense s' = argmax(sense s) P(s | F)
        where F is the set of context features (n different features)
            s' = argmax(s) P(F | s) P(s) / P(F)
               = argmax(s) P(F | s) P(s)
        If we now assume features are independent
            P(F | s) =  Πi P(f[i] | s)
            s' = argmax(s) P(s) Πi P(f[i] | s)
        Maximum likelihood estimates for P(s) and P(f[i] | s) can be easily obtained by counting
            - some smoothing (e.g., add-one smoothing) is needed
        Works quite well at selecting best sense (not at estimating probabilities)
        But needs substantial annotated training data for each word

Semi-supervised WSD algorithm [J&M 20.5]

Based on Gale / Yarowsky's "one sense per collocation" and "one sense per discourse" observation
    (generally true for coarse word senses)
Allows bootstrapping (semi-supervised learning) from a small set of sense-annotated seeds
Basic idea of bootstrapping:
start with a small set of labeled seeds L and a large set of unlabeled examples U
repeat
train classifier C on L
apply C to U
identify examples with most confident labels; remove them from U and add them (with labels) to L
For WSD:
identify some collocates which unambiguously indicate one sense (e.g., "fish" and "play" for "bass"); select examples with those collocates as seeds
extend confident labels to other examples of target word in same document

Identifying similar words

Distance metric for Wordnet [J&M 20.6]

Simplest metrics just use path length in WordNet
More sophisticated metrics take account of the fact that going 'up' (to a hypernym) may represent different degrees of generalization in different cases
Resnik introduced P(c):  for each concept (synset), P(c) = probability that a word in a corpus is an instance of the concept (matches the synset c or one of its hyponyms)
Information content of a concept
    IC(c) = -log P(c)
If LCS(c1, c2) is the lowest common subsumer of c1 and c2, the JC distance between c1 and c2 is
    IC(c1) + IC(c2) - 2 IC(LCS(c1, c2))

Similarity metric from corpora [J&M 20.7]

Basic idea:  characterize words by their contexts;  words sharing more contexts are more similar
Contexts can either be defined in terms of adjacency or dependency (syntactic relations)
Given a word w and a context feature f, define pointwise mutual information PMI
    PMI(w,f) = log ( P(w,f) / P(w) P(f))
Given a list of contexts (words left and right) we can compute a context vector for each word.
The similarity of two vectors v and w (representing two words) can be computed in many ways;
  a standard way is using the cosine (normalized dot product):
   simcosine = Σvi × wi / ( | v | × | w | ) .
See the Thesaurus demo by Patrick Pantel.
By applying clustering methods we have an unsupervised way of creating semantic word classes.