### Lecture 4 Outline

February 14, 2005

## Part-of-Speech Tagging Using Hidden Markov Models

### Corpus-Based Methods

Natural language is very complex
- we don't know how to model it fully,
so we build simplified models which provide some approximation to natural language
How can we measure 'how good' these models are?
- we build a corpus,
annotate it by hand with respect to the phenomenon we are interested in,
and then compare it with the predictions of our model
- for example, how well the model predicts part-of-speech or syntactic structure
How to train the model?
- need a goodness metric
- train by hand, by adjusting rules and analyzing errors
- train automatically
* develop new rules
* build probabilistic model (generally very hard to do by hand)

### Statistical Part-of-Speech Tagging (J&M sec 8.5)

##### Looking at words in isolation:
Given a word w, what tag t should we assign to it?
We want to assign the tag which maximizes the number of correct assignments.
The probability of getting the assignment correct is  P ( t | w )
So we want to assign the tag = argmax(t) P ( t | w )
We can estimate P ( t | w ) as
(number of times w is tagged as t in corpus) / (number of times w appears in corpus)
(this is the 'maximum likelihood estimator', J&M p. 200)

Using Bayes' rule, we can rewrite
P ( t | w ) = P ( t ) * P ( w | t ) /  P ( w )
This will be helpful when we take context into account.

##### Looking at the tag of the previous word
Call the tag of the previous word 'u'
Want to find argmax(t) P ( t | u, w)
Under independence assumptions, = argmax(t) P ( t | u ) * P ( w | t )

Derivation:
P ( t | u, w ) = P ( t ) P ( u, w | t ) / P (u, w)
= P ( t ) P ( u | t ) P ( w | t ) / ( P ( u ) P ( w ) )  [ independence assumption ]
= P ( t | u ) P ( w | t ) / P ( w )

##### Taking into account both the previous and following tags in deciding the tag of the current word.
We want to find the most likely sequence of tags T = t1, t2, ... , tn
given the sequence of words W = w1, w2, ... wn
I.e., find argmax(T) P ( T | W ) = argmax(T) P ( T ) P ( W | T ) / P ( W )
= argmax(T) P ( T ) P ( W | T )
= argmax(T) product(i) P( ti | ti-1, ... t1 ) P ( wi | ti wi-1 ti-1 ... w1 t1 )
which we will approximate as
argmax(T) product(i) P ( ti | ti-1 ) P ( wi | ti )

### Markov Model

Finite state networks with transition probabilities:
probability of next state depends only on current state ('Markov assumption')

By associating a particular word with each state, we get a probabilistic model of word sequences ...
a probabilistic generative model of sentences.

### Hidden Markov Model (HMM)

Suppose we associate a part-of-speech with each state in a Markov Model.
We then associate an 'emission probability' P ( w | t )
of emitting a particular word when in a particular state.
This is a hidden Markov Model ...
the sequence of words generated do not uniquely determine the sequence of states.
##### Training an HMM
Training an HMM is simple if we have a completely labeled corpus:
we have marked the POS of each word.
We can then directly estimate both P ( ti | ti-1 ) and P ( wi | ti ) from corpus counts
using the Maximum Likelihood Estimator.
##### Using an HMM ('decoding')
The argmax(T) given above corresponds to finding the most likely path through the model, given W.
The Viterbi algorithm provides a fast (linear in number of tokens) algorithm for this task. (J&M p. 176)
It consists of a forward pass which computes probabilities,
and a backward pass which traces the most likely path.
In the forward pass,
it builds a probability matrix viterbi [number of POS states+2, number of tokens + 2], and
a back pointer matrix of the same size.
viterbi [ s, t ] = max (over all paths to [s,t]) of the probability of reaching state s at token t
(in the printed text the algorithm is given on p. 179, in the context of speech recognition;
in the new version of Chapter 8, available as an on-line update, there is a self-contained presentation of HMMs and part-of-speech tagging on pages 20-35).

We review a simple example of viterbi decoding as a PowerPoint presentation.

### JET HMM POS Tagger

Uses the general HMM model classes within Jet.
File pos_hmm.txt:  trained on 96% of Penn Tree Bank corpus (960,000 words).
Weak on handling of unknown words (words not in training corpus)
- determining which tags (open classes) are likely for unknown words
- using morphology, esp. suffixes (even more important for more inflected languages)
Limitations of bigram tagging (limited context)
- problems, for example, after conjunctions