CSCI-GA.2590 - Natural Language Processing - Spring 2013 Prof. Grishman

Lecture 4 Outline

February 19, 2013

(Discuss Asgn 2 -- adding pronoun constraints to the grammar)

Part-of-Speech Tagging Using Hidden Markov Models

Corpus-Based Methods

Natural language is very complex
    - we don't know how to model it fully,
        so we build simplified models which provide some approximation to natural language
How can we measure 'how good' these models are?
    - we build a corpus,
        annotate it by hand with respect to the phenomenon we are interested in,
        and then compare it with the predictions of our model
    - for example, how well the model predicts part-of-speech or syntactic structure
To build a good corpus
    - we must define a task people can do reliably (choose a suitable POS set, for example)
    - we must provide good documentation for the task
    - we must measure human performance (through dual annotatiion and inter-annotator agreement)
How to train the model?
    - need a goodness metric
    - train by hand, by adjusting rules and analyzing errors
    - train automatically
        * develop new rules
        * build probabilistic model (generally very hard to do by hand)

Statistical Part-of-Speech Tagging (J&M sec 5.5)

Looking at words in isolation:
Given a word w, what tag t should we assign to it?
We want to assign the tag which maximizes the number of correct assignments.
The probability of getting the assignment correct is  P ( t | w )
So we want to assign the tag = argmax(t) P ( t | w )
We can estimate P ( t | w ) as
    (number of times w is tagged as t in corpus) / (number of times w appears in corpus)
(this is the 'maximum likelihood estimator', J&M p. 88)

Using Bayes' rule, we can rewrite
     P ( t | w ) = P ( t ) * P ( w | t ) /  P ( w )
This will be helpful when we take context into account.

Looking at the tag of the previous word
Call the tag of the previous word 'u'
Want to find argmax(t) P ( t | u, w)
Under independence assumptions, = argmax(t) P ( t | u ) * P ( w | t )

P ( t | u, w ) = P ( t ) P ( u, w | t ) / P (u, w)
                    = P ( t ) P ( u | t ) P ( w | t ) / ( P ( u ) P ( w ) )  [ independence assumption ]
                    = P ( t | u ) P ( w | t ) / P ( w ) 

Taking into account both the previous and following tags in deciding the tag of the current word.
We want to find the most likely sequence of tags T = t1, t2, ... , tn
    given the sequence of words W = w1, w2, ... wn
I.e., find argmax(T) P ( T | W ) = argmax(T) P ( T ) P ( W | T ) / P ( W )
                                                = argmax(T) P ( T ) P ( W | T )
                                                = argmax(T) product(i) P( ti | ti-1, ... t1 ) P ( wi | ti wi-1 ti-1 ... w1 t1 )
which we will approximate as
                                                   argmax(T) product(i) P ( ti | ti-1 ) P ( wi | ti )

Markov Model (J&M 6.1)

Finite state networks with transition probabilities:
probability of next state depends only on current state ('Markov assumption')

By associating a particular word with each state, we get a probabilistic model of word sequences ...
a probabilistic generative model of sentences.

as shown in a Markov Model of my cat and dog

Hidden Markov Model (HMM) (J&M 6.2)

Suppose we associate a part-of-speech with each state in a Markov Model.
We then associate an 'emission probability' P ( w | t )
    of emitting a particular word when in a particular state.
This is a hidden Markov Model ...
    the sequence of words generated do not uniquely determine the sequence of states.
Training an HMM
Training an HMM is simple if we have a completely labeled corpus:
    we have marked the POS of each word.
We can then directly estimate both P ( ti | ti-1 ) and P ( wi | ti ) from corpus counts
    using the Maximum Likelihood Estimator.
Using an HMM ('decoding')
The argmax(T) given above corresponds to finding the most likely path through the model, given W.
The Viterbi algorithm provides a fast (linear in number of tokens) algorithm for this task. (J&M sec. 5.5.3 and 6.4)
It consists of a forward pass which computes probabilities,
and a backward pass which traces the most likely path.
In the forward pass,
    it builds a probability matrix viterbi [number of POS states+2, number of tokens + 2], and
    a back pointer matrix of the same size.
viterbi [ s, t ] = max (over all paths to [s,t]) of the probability of reaching state s at token t

We review a simple example of Viterbi decoding.


Uses the general HMM model classes within Jet.
File pos_hmm.txt:  trained on 96% of Penn Tree Bank corpus (960,000 words).
Weak on handling of unknown words (words not in training corpus)
Limitations of bigram tagging (limited context) Domain dependency