February 13, 2003

- we don't know how to model it fully,

so we build simplified models which provide some approximation to natural language

How can we measure 'how good' these models are?

- we build a corpus,

annotate it by hand with respect to the phenomenon we are interested in,

and then compare it with the predictions of our model

- for example, how well the model predicts part-of-speech or syntactic structure

How to train the model?

- need a goodness metric

- train by hand, by adjusting rules and analyzing errors

- train automatically

* develop new rules

* build probabilistic model (generally very hard to do by hand)

Given a wordw, what tagtshould we assign to it?

We want to assign the tag which maximizes the number of correct assignments.

The probability of getting the assignment correct is P ( t | w )

So we want to assign the tag = argmax(t) P ( t | w )

We canestimateP ( t | w ) as

(number of times w is tagged as t in corpus) / (number of times w appears in corpus)

(this is the 'maximum likelihood estimator', J&M p. 200)Using Bayes' rule, we can rewrite

P ( t | w ) = P ( t ) * P ( w | t ) / P ( w )

This will be helpful when we take context into account.

Call the tag of the previous word 'u'

Want to find argmax(t) P ( t | u, w)

Under independence assumptions, = argmax(t) P ( t | u ) * P ( w | t )Derivation:

P ( t | u, w ) = P ( t ) P ( u, w | t ) / P (u, w)

= P ( t ) P ( u | t ) P ( w | t ) / ( P ( u ) P ( w ) ) [ independence assumption ]

= P ( t | u ) P ( w | t ) / P ( w )

We want to find the most likelysequenceof tags T = t_{1}, t_{2}, ... , t_{n}

given the sequence of words W = w_{1}, w_{2}, ... w_{n}

I.e., find argmax(T) P ( T | W ) = argmax(T) P ( T ) P ( W | T ) / P ( W )

= argmax(T) P ( T ) P ( W | T )

= argmax(T) product(i) P( t_{i}| t_{i-1}, ... t_{1}) P ( w_{i}| t_{i}w_{i-1}t_{i-1}... w_{1}t_{1})

which we will approximate as

argmax(T) product(i) P ( t_{i}| t_{i-1}) P ( w_{i}| t_{i})

Finite state networks with transition probabilities:

probability of next state depends only on current state ('Markov assumption')By associating a particular word with each state, we get a probabilistic model of word sequences ...

a probabilistic generative model of sentences.

Suppose we associate a part-of-speech with each state in a Markov Model.

We then associate an 'emission probability' P ( w | t )

of emitting a particular word when in a particular state.

This is ahiddenMarkov Model ...

the sequence of words generated do not uniquely determine the sequence of states.

Training an HMM is simple if we have a completely labeled corpus:

we have marked the POS of each word.

We can then directly estimate both P ( t_{i}| t_{i-1}) and P ( w_{i}| t_{i}) from corpus counts

using the Maximum Likelihood Estimator.

The argmax(T) given above corresponds to finding the most likely path through the model, given W.

TheViterbialgorithm provides a fast (linear in number of tokens) algorithm for this task. (J&M p. 176)

It consists of a forward pass which computes probabilities,

and a backward pass which traces the most likely path.

In the forward pass,

it builds a probability matrix[number of states+2, number of tokens + 2], andviterbi

a back pointer matrix of the same size.

[ s, t ] = max (over all paths to [s,t]) of the probability of reaching stateviterbisat tokent

(the algorithm is given on p. 179)

Uses the general HMM model classes within Jet.

Filepos_hmm.txt: trained on 96% of Penn Tree Bank corpus (960,000 words).

Weak on handling of unknown words (words not in training corpus)

- determining which tags (open classes) are likely for unknown words

- using morphology, esp. suffixes (even more important for more inflected languages)

Limitations of bigram tagging (limited context)