### Lecture 2

#### Computing probabilities:  forward-backward algorithm

In general, our approach to training parameterized probabilistic models is to select the parameters which maximize the likelihood of the training corpus.

For example, if we own a random pig, and squeeze it 10 times, and 8 times it goes "oink" and 2 times it goes "ouch", and we assume it is a stateless pig, the maximum likelihood estimate (MLE) tells us to model the pig as a random emitter with P("oink") = 0.8 and P('"ouch") = 0.2.  Simply, P(x) = count(X) / total number of trials.

Part of speech taggers are generally trained on annotated corpora.  In terms of an HMM model, this means that we know, for each sentence, the sequence of states the HMM went through to produce this sentence.  Thus both the emission and transition probabilities can be computed directly using the MLE.

Can we train the HMM from an unannotated corpus?  The goal is the same ... to select the parameters maximizing the likelihood of the training corpus.  But the procedure is not so direct -- the calculations we make of emission probabilities depend on what we assume regarding the state the HMM is in for each word, which is no longer explicitly given.  There is in fact, in general, no closed solution to this problem.  We must instead use an Expectation Maximization (EM) method, which is essentially an iterative, hill-climbing method to set the parameters.

The specific form of EM for HMMs is the Baum-Welch or Forward-Backward algorithm (J&M Appendix D;  M&S p. 333;  Charniak p. 63).  [We follow here the notation of J&M.]  We want to estimate the transition probabilities aij (the probability, being in state i, of making a transition to state j) and bi(w) (the probability, being in state i, of emitting word w).  If we had a tagged corpus, the computation of aij would be straightforward:

aij = count(transition from state i to state j) / count(transition from state i to any state)

What we'll do is assume some initial values for aij and bi, use these to compute expected values, and then use the expected values to compute new values of aij, which we'll call a'ij:

a'ij = expected value(transitiion from state i to state j) / expected value(transition from state i to any state)

To compute the expected values, we use the values computed by the Viterbi algorithm.  Assume the input is w1, ... wT, and the states are numbered 1 to N.   The forward probability alphaj(t) is the probability (for a given input) of being in state j and generating the first t words of the input.  The backward probability betai(t) is the mirror image:  the probability, starting in state i, of generating the words from word t through the end of the input.  Let taut(i,j) be the probability of being in state i for word t and state j for word t+1:

taut(i,j) = alphai(t) aij bj(wt+1) betaj(t+1) / alphaN(T)

then

a'ij = sum(t=1 to T-1) taut(i,j) / sum(t=1 to T-1) sum(j=1 to N) taut(i,j)

Updated values for bi can be computed similarly.  Baum showed that the probability of the training corpus on the new parameters is >= the probability with the old parameters.  By iterating, we can gradually increase the training corpus probability.

Like any hill-climbing algorithm, it only finds a local maximum, so it is important to have a good starting point (or try multiple starting points).  Also, there is the danger of overfitting, particularly if we are trying to train too many parameters from too little data.

#### Using forward-backward for training a model without a tagged corpus:  the Xerox tagger

Doug Cutting; Julian Kupiec; Jan Pedersen; Penelope SibunA Practical Part-of-Speech Tagger.  ANLP 1992.
Presentation by Yusuke Shinyama.

#### More unsupervised tagging

Merialdo made comparisons of unsupervised and supervised training, and found that unsupervised training was helpful if only a small amount of tagged data was available (2000 sentences or less), but was not helpful if a larger amount of tagged data was available to initialize the parameters.  There is a general correlation between maximizing the probability of the training corpus and assigning the tags which people have assigned, but they are not necessarily identical.

One critical advantage of Kupiec was his use of 'ambiguity classes':  infrequent words are grouped together based on their possible parts of speech.  This greatly reduces the number of parameters to be estimated.

Bernard Merialdo. Tagging English Text with a Probabilistic Model.  CL 20, 2 (1994).

Brill developed an unsupervised version of his TBL POS tagger.  The unsupervised version acts as a constraint tagger (to be discussed shortly), eliminating rather than changing tags.  A typical transformation might say change [X,Y] to Y in context C, where X and Y are part of speech tags.  Words are initially assigned all their possible parts of speech, based on a dictionary, as in other supervised methods.  Transformations are scored by the formula

incontext(Y, C) - (freq (Y) / freq (X)) * incontext(X, C)

where freq(A) is the number of words in the corpus unambiguously tagged with part-of-speech A, and incontext(A, C) is the number of words unambiguously tagged with part-of-speech A in context C.  In other words, we favor transformations which are validated by lots of unambiguous examples in the corpus.  When some tagged data is available, Brill first applies his unsupervised tagger, and then his supervised tagger (to the output of the unsupervised tagger), so there is no problem of the unsupervised tagger making the supervised results worse.

Eric Brill.  Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging.  Third WVLC, 1995.

#### Constraint grammar tagger

Constraint grammar was developed by Fred Karlsson and his group at the University of Helsinki. It used a detailed tag set which, however, avoided some of the problematic ambiguities of other tag sets, such as
• -ing forms
• nonfinite -ed forms
They claimed that they were able to achieve very high inter-annotator agreement.

Their tagger begins with a dictionary look-up which assigns each word all possible parts of speech. Then a set of 'constraint rules' operate which look for a particular pattern and eliminate a particular part of speech. These rules were all written by hand, using the annotated corpus to check the correctness of the rules. The final ENGCG (English constraint grammar) system had 3600 rules! The rules do not eliminate all ambiguity; a few (4-7% of words) are left with multiple tags.

The initial presentation of this work was greeted with considerable discussion about whether ENGCG was really performing better than any of the other taggers that had been developed to date. To address these doubts, they wrote papers to compare ENGCG to stochastic taggers.

Pasi Tapanainen; Atro Voutilainen.  Tagging accurately - Don't guess if you know. (ANLP 1994)
Christer Samuelsson, Atro Voutilainen. Comparing a Linguistic and a Stochastic Tagger (ACL 1997)
Presentation by Heng Ji.

#### Maximum entropy modeling

(M&S, sec. 16.2;  see also Berger's online tutorial)  Until now, we have looked at a very simple statistical model -- the n-gram model (and its realization as an HMM) -- in which probabilities depend in a very uniform way on the preceeding one or two tokens.  In practice, however, we may (through our linguistic analysis or intuition) identify a rather heterogenous set of features which contribute in some way to a choice about the current word:  whether the current story is a sports article;  whether it's in the first sentence of the story;  whether their's a preposition anywhere in the last five words;  and so forth.  Maximum entropy modeling provides one mathematically well-founded method for combining such features in a probabilistic model.

The features provide constraints on the model.  We'd like to have a probability distribution which, outside of these constraints, is as uniform as possible -- has the maximum entropy among all models which satisfy these constraints.

Suppose we have a tagging task, where we want to assign a tag t to a word w based on the 'context' h of w (the words around w, including w itself).  In other words, we want to compute p(h,t).  We will specify a set of K features in the form of binary-valued indicator functions fi (h, t).  For example,

f1(h, t) = 1 if the preceding word in h is "to" and t = "VB"
= 0 otherwise

We will use a log-linear model, where

p(h, t) = (1/Z) product(i=1 to K) alphaifi(h, t)

where alphai is the weight for feature i, and Z is a normalizing constant.  The goal of the learning procedure is to determine the values of the alphai's so that the expected value of each fi

sumh,c p(h, t) fi(h, t)

is equal to its expected value (i.e., its average value) over the training set of N words (whose contexts are h1, ..., hN):

(1/N) sumj fi(hj, t)

The procedure for finding the alphai's is called generalized iterative scaling.  A number of packages are now available which implement this algorithm and improvements thereon.

#### Looking ahead to next week

We will finish our discussion of POS tagging with ...

Adwait Ratnaparkhi.  A Maximum Entropy Model for Part-Of-Speech Tagging (EMNLP 1996)

and will then turn to chunking and look at the 'standard' paper on training a chunker:

Lance Ramshaw and Mitch Marcus.  Text Chunking using Transformation-Based Learning  (WVLC 1995)