February 19, 2013
(Discuss Asgn 2 -- adding pronoun constraints to the grammar)
Given a word w, what tag t should we assign to it?
We want to assign the tag which maximizes the number of correct assignments.
The probability of getting the assignment correct is P ( t | w )
So we want to assign the tag = argmax(t) P ( t | w )
We can estimate P ( t | w ) as
(number of times w is tagged as t in corpus) / (number of times w appears in corpus)
(this is the 'maximum likelihood estimator', J&M p. 88)
Using Bayes' rule, we can rewrite
P ( t | w ) = P ( t ) * P ( w | t ) / P ( w )
This will be helpful when we take context into account.
Call the tag of the previous word 'u'
Want to find argmax(t) P ( t | u, w)
Under independence assumptions, = argmax(t) P ( t | u ) * P ( w | t )
P ( t | u, w ) = P ( t ) P ( u, w | t ) / P (u, w)
= P ( t ) P ( u | t ) P ( w | t ) / ( P ( u ) P ( w ) ) [ independence assumption ]
= P ( t | u ) P ( w | t ) / P ( w )
We want to find the most likely sequence of tags T = t1, t2, ... , tn
given the sequence of words W = w1, w2, ... wn
I.e., find argmax(T) P ( T | W ) = argmax(T) P ( T ) P ( W | T ) / P ( W )
= argmax(T) P ( T ) P ( W | T )
= argmax(T) product(i) P( ti | ti-1, ... t1 ) P ( wi | ti wi-1 ti-1 ... w1 t1 )
which we will approximate as
argmax(T) product(i) P ( ti | ti-1 ) P ( wi | ti )
Finite state networks with transition probabilities:
probability of next state depends only on current state ('Markov assumption')
By associating a particular word with each state, we get a probabilistic model of word sequences ...
a probabilistic generative model of sentences.
as shown in a Markov Model of my cat and dog
Suppose we associate a part-of-speech with each state in a Markov Model.
We then associate an 'emission probability' P ( w | t )
of emitting a particular word when in a particular state.
This is a hidden Markov Model ...
the sequence of words generated do not uniquely determine the sequence of states.
Training an HMM is simple if we have a completely labeled corpus:
we have marked the POS of each word.
We can then directly estimate both P ( ti | ti-1 ) and P ( wi | ti ) from corpus counts
using the Maximum Likelihood Estimator.
The argmax(T) given above corresponds to finding the most likely path through the model, given W.
The Viterbi algorithm provides a fast (linear in number of tokens) algorithm for this task. (J&M sec. 5.5.3 and 6.4)
It consists of a forward pass which computes probabilities,
and a backward pass which traces the most likely path.
In the forward pass,
it builds a probability matrix viterbi [number of POS states+2, number of tokens + 2], and
a back pointer matrix of the same size.
viterbi [ s, t ] = max (over all paths to [s,t]) of the probability of reaching state s at token t
We review a simple example of Viterbi decoding.
Uses the general HMM model classes within Jet.
File pos_hmm.txt: trained on 96% of Penn Tree Bank corpus (960,000 words).
Weak on handling of unknown words (words not in training corpus)
Limitations of bigram tagging (limited context)
- determining which tags (open classes) are likely for unknown words
- using morphology, esp. suffixes (e.g., '-ous' --> adj)
(even more important for more inflected languages)
- problems, for example, after conjunctions
- the typical performance of this and other POS taggers (96-97%) is deceptive: this is measured by training and testing on very similar corpora (typically, Wall Street Journal)
- performance when trained and tested on corpora from different domains or genre is much lower (about 88% when trained on WSJ and tested on medical article abstracts)
- this is a serious problem because we cannot afford to tag a million words for every type of corpus we want to process
- fortunately the effect can be reduced with only a small amount of tagged data in the target domain (to be discussed when we consider MaxEnt taggers)