G22.2590 - Natural Language Processing - Spring 2005 Prof. Grishman

Lecture 4 Outline

February 14, 2005

Part-of-Speech Tagging Using Hidden Markov Models

Corpus-Based Methods

Natural language is very complex
    - we don't know how to model it fully,
        so we build simplified models which provide some approximation to natural language
How can we measure 'how good' these models are?
    - we build a corpus,
        annotate it by hand with respect to the phenomenon we are interested in,
        and then compare it with the predictions of our model
    - for example, how well the model predicts part-of-speech or syntactic structure
How to train the model?
    - need a goodness metric
    - train by hand, by adjusting rules and analyzing errors
    - train automatically
        * develop new rules
        * build probabilistic model (generally very hard to do by hand)
 

Statistical Part-of-Speech Tagging (J&M sec 8.5)

Looking at words in isolation:
Given a word w, what tag t should we assign to it?
We want to assign the tag which maximizes the number of correct assignments.
The probability of getting the assignment correct is  P ( t | w )
So we want to assign the tag = argmax(t) P ( t | w )
We can estimate P ( t | w ) as
    (number of times w is tagged as t in corpus) / (number of times w appears in corpus)
(this is the 'maximum likelihood estimator', J&M p. 200)

Using Bayes' rule, we can rewrite
     P ( t | w ) = P ( t ) * P ( w | t ) /  P ( w )
This will be helpful when we take context into account.

Looking at the tag of the previous word
Call the tag of the previous word 'u'
Want to find argmax(t) P ( t | u, w)
Under independence assumptions, = argmax(t) P ( t | u ) * P ( w | t )

Derivation:
P ( t | u, w ) = P ( t ) P ( u, w | t ) / P (u, w)
                    = P ( t ) P ( u | t ) P ( w | t ) / ( P ( u ) P ( w ) )  [ independence assumption ]
                    = P ( t | u ) P ( w | t ) / P ( w ) 

Taking into account both the previous and following tags in deciding the tag of the current word.
We want to find the most likely sequence of tags T = t1, t2, ... , tn
    given the sequence of words W = w1, w2, ... wn
I.e., find argmax(T) P ( T | W ) = argmax(T) P ( T ) P ( W | T ) / P ( W )
                                                = argmax(T) P ( T ) P ( W | T )
                                                = argmax(T) product(i) P( ti | ti-1, ... t1 ) P ( wi | ti wi-1 ti-1 ... w1 t1 )
which we will approximate as
                                                   argmax(T) product(i) P ( ti | ti-1 ) P ( wi | ti )

Markov Model

Finite state networks with transition probabilities:
probability of next state depends only on current state ('Markov assumption')

By associating a particular word with each state, we get a probabilistic model of word sequences ...
a probabilistic generative model of sentences.

Hidden Markov Model (HMM)

Suppose we associate a part-of-speech with each state in a Markov Model.
We then associate an 'emission probability' P ( w | t )
    of emitting a particular word when in a particular state.
This is a hidden Markov Model ...
    the sequence of words generated do not uniquely determine the sequence of states.
Training an HMM
Training an HMM is simple if we have a completely labeled corpus:
    we have marked the POS of each word.
We can then directly estimate both P ( ti | ti-1 ) and P ( wi | ti ) from corpus counts
    using the Maximum Likelihood Estimator.
Using an HMM ('decoding')
The argmax(T) given above corresponds to finding the most likely path through the model, given W.
The Viterbi algorithm provides a fast (linear in number of tokens) algorithm for this task. (J&M p. 176)
It consists of a forward pass which computes probabilities,
and a backward pass which traces the most likely path.
In the forward pass,
    it builds a probability matrix viterbi [number of POS states+2, number of tokens + 2], and
    a back pointer matrix of the same size.
viterbi [ s, t ] = max (over all paths to [s,t]) of the probability of reaching state s at token t
(in the printed text the algorithm is given on p. 179, in the context of speech recognition;
in the new version of Chapter 8, available as an on-line update, there is a self-contained presentation of HMMs and part-of-speech tagging on pages 20-35).

We review a simple example of viterbi decoding as a PowerPoint presentation.

JET HMM POS Tagger

Uses the general HMM model classes within Jet.
File pos_hmm.txt:  trained on 96% of Penn Tree Bank corpus (960,000 words).
Weak on handling of unknown words (words not in training corpus)
    - determining which tags (open classes) are likely for unknown words
    - using morphology, esp. suffixes (even more important for more inflected languages)
Limitations of bigram tagging (limited context)
    - problems, for example, after conjunctions