For example, if we own a random pig, and squeeze it 10 times, and 8 times it goes "oink" and 2 times it goes "ouch", and we assume it is a stateless pig, the maximum likelihood estimate (MLE) tells us to model the pig as a random emitter with P("oink") = 0.8 and P('"ouch") = 0.2. Simply, P(x) = count(X) / total number of trials.

Part of speech taggers are generally trained on annotated corpora. In terms of an HMM model, this means that we know, for each sentence, the sequence of states the HMM went through to produce this sentence. Thus both the emission and transition probabilities can be computed directly using the MLE.

Can we train the HMM from an unannotated corpus? The goal is the same ... to select the parameters maximizing the likelihood of the training corpus. But the procedure is not so direct -- the calculations we make of emission probabilities depend on what we assume regarding the state the HMM is in for each word, which is no longer explicitly given. There is in fact, in general, no closed solution to this problem. We must instead use an Expectation Maximization (EM) method, which is essentially an iterative, hill-climbing method to set the parameters.

The specific form of EM for HMMs is the Baum-Welch or Forward-Backward algorithm (J&M Appendix D; M&S p. 333; Charniak p. 63). [We follow here the notation of J&M.] We want to estimate the transition probabilities a

a_{ij} = count(transition from
state i to state j) / count(transition from state i to any state)

What we'll do is assume some initial values for a

a'_{ij} = expected
value(transitiion from state i to state j) / expected value(transition
from state i to any state)

To compute the expected values, we use the values computed by the Viterbi algorithm. Assume the input is w

tau_{t}(i,j) = alpha_{i}(t)
a_{ij} b_{j}(w_{t+1}) beta_{j}(t+1) /
alpha_{N}(T)

then

a'_{ij} = sum(_{t=1 to T-1})
tau_{t}(i,j) / sum(_{t=1 to T-1}) sum(_{j=1 to N})
tau_{t}(i,j)

Updated values for b

Like any hill-climbing algorithm, it only finds a local maximum, so it is important to have a good starting point (or try multiple starting points). Also, there is the danger of overfitting, particularly if we are trying to train too many parameters from too little data.

Doug Cutting; Julian Kupiec; Jan
Pedersen; Penelope Sibun**. **A Practical
Part-of-Speech Tagger*. ANLP 1992.*

Presentation by Yusuke Shinyama.

Presentation by Yusuke Shinyama.

One critical advantage of Kupiec was his use of 'ambiguity classes': infrequent words are grouped together based on their possible parts of speech. This greatly reduces the number of parameters to be estimated.

Bernard Merialdo. Tagging English
Text with a Probabilistic Model. CL 20, 2 (1994).

Brill developed an unsupervised version of his TBL POS tagger. The unsupervised version acts as a constraint tagger (to be discussed shortly), eliminating rather than changing tags. A typical transformation might say change [X,Y] to Y in context C, where X and Y are part of speech tags. Words are initially assigned all their possible parts of speech, based on a dictionary, as in other supervised methods. Transformations are scored by the formula

incontext(Y, C) - (freq (Y) / freq (X))
* incontext(X, C)

where freq(A) is the number of words in the corpus unambiguously tagged with part-of-speech A, and incontext(A, C) is the number of words unambiguously tagged with part-of-speech A in context C. In other words, we favor transformations which are validated by lots of unambiguous examples in the corpus. When some tagged data is available, Brill first applies his unsupervised tagger, and then his supervised tagger (to the output of the unsupervised tagger), so there is no problem of the unsupervised tagger making the supervised results worse.

Eric Brill. Unsupervised
Learning of Disambiguation Rules for Part of Speech Tagging.
Third WVLC, 1995.

- -ing forms
- nonfinite -ed forms
- noun-adjective homographs

Their tagger begins with a dictionary look-up which assigns each word all possible parts of speech. Then a set of 'constraint rules' operate which look for a particular pattern and eliminate a particular part of speech. These rules were all written by hand, using the annotated corpus to check the correctness of the rules. The final ENGCG (English constraint grammar) system had 3600 rules! The rules do not eliminate all ambiguity; a few (4-7% of words) are left with multiple tags.

The initial presentation of this work was greeted with considerable discussion about whether ENGCG was really performing better than any of the other taggers that had been developed to date. To address these doubts, they wrote papers to compare ENGCG to stochastic taggers.

Pasi Tapanainen; Atro
Voutilainen.
Tagging
accurately - Don't guess if you know. (ANLP 1994)

Christer Samuelsson, Atro Voutilainen. Comparing a Linguistic and a Stochastic Tagger (ACL 1997)

Presentation by Heng Ji.

Christer Samuelsson, Atro Voutilainen. Comparing a Linguistic and a Stochastic Tagger (ACL 1997)

Presentation by Heng Ji.

The features provide constraints on the model. We'd like to have a probability distribution which, outside of these constraints, is as uniform as possible -- has the maximum entropy among all models which satisfy these constraints.

Suppose we have a tagging task, where we want to assign a tag t to a word w based on the 'context' h of w (the words around w, including w itself). In other words, we want to compute p(h,t). We will specify a set of K features in the form of binary-valued indicator functions f

f_{1}(h, t) = 1 if the
preceding word in h is "to" and t = "VB"

= 0 otherwise

= 0 otherwise

We will use a log-linear model, where

p(h, t) = (1/Z) product(_{i=1 to K})
alpha_{i}^{fi(h, t)}

where alpha

sum_{h,c} p(h, t) f_{i}(h,
t)

is equal to its expected value (i.e., its average value) over the training set of N words (whose contexts are h

(1/N) sum_{j} f_{i}(h_{j},
t)

The procedure for finding the alpha

Adwait Ratnaparkhi. A Maximum
Entropy Model for Part-Of-Speech Tagging (EMNLP 1996)

and will then turn to chunking and look at the 'standard' paper on training a chunker:

Lance Ramshaw and Mitch Marcus. Text Chunking using
Transformation-Based Learning (WVLC 1995)