Tagging NL text using the k-gram model

Introduction

A number of problems in natural language analysis can be formulated in the following terms: You are given a text , which is viewed as a string of elements . The problem is to assign one tag out of a collection of tags to each of the elements. Examples: In the statistical, or supervised learning, approach to this problem, you are given a large training corpus of correctly tagged text. The objective, then, is to extract patterns from the training corpus that can be applied to label the new text.

The Probabilistic Model

In a probabilistic model, we formulate the problem as looking for the most likely sequence of tags for the given sequence of elements. Formally, let N be the number of elements in the sentence to be tagged. Let E[I] be the random variable for the Ith element, let T[I] be the random variable for the Ith tag, let eI be the actual value of the Ith element. The problem is to find the sequence of tag values t 1 ... t N such that

Prob(T[1]=t1 ... T[N]=tN | E[1]=e1 ... E[N]=eN)

is as large as possible.

For instance, in tagging the sentence "I can fish", we want to arrive somehow at the estimate that

Prob(T[1]=pronoun, T[2]=modal, T[3]=verb | E[1]="I", E[2]="can", E[3]="fish")

is large, that

Prob(T[1]=pronoun, T[2]=verbl, T[3]=noun | E[1]="I", E[2]="can", E[3]="fish")

is small and that the probability for any other sequence of three tags is tiny or zero.

Comment on notation:

The form we have used above

Prob(T[1]=t1 ... T[N]=tN | E[1]=e1 ... E[N]=eN)

is strictly correct. However, it is obviously rather cumbersome, and as we get to more complicated expressions would get impossibly cumbersome. The tradition in the literature of applied probability, which we follow below, is to be very unrigorous, not to say sloppy, about using clearly defined notations, and to write the above expression in the form

Prob(t1 ... tN | e1 ... eN)

Strictly speaking, this is nonsense; an expression like

Prob(pronoun,modal,verb | "I","can","fish")

doesn't actually mean anything. It is the responsibility of the reader to keep track of what is actually meant (and the responsibility, not always observed, of the writer to explain clearly what his abbreviated notations actually mean.)

The Probabilistic Estimate

To repeat, we want to find the 1 values t 1 ... t N such that Prob(t1 ... tN | e1 ... eN) is maximal. In principle, what we're after is this; if you consider all the times that the sentence "I can fish" might be written or spoken, what fraction of the time is the intended meaning the one corresponding to [pronoun modal verb] and what fraction of the time is it the one corresponding to [pronoun verb noun]? Unfortunately, since most sentences that one encounters are being spoken (or written) for the first and only time in the history of the universe --- and, more specifically, are certainly not in the training corpus --- this definition is not a useful one. So we have to estimate this probability in terms of considerations whose weight we can actually measure. To do this, as throughout applied probability, we have to make independence assumptions that are violently and obviously false. The "justification" of these assumptions, in the final analysis, is that We proceed as follows. We first apply Bayes' Law to derive
Prob(t1 ... tN | e1 ... eN) = Prob(t1 ... tN) * Prob(e1 ... eN| t1 ... tN) / Prob(e1 ... eN)

However the denominator above Prob(e1 ... eN) depends only on the sequence of elements, which is given. As this is the same for all choices of tags, we can ignore it in comparing one sequence of tags to another. (This is known as the "normalizing factor". It doesn't affect the relative sizes of the probablities of different sequence of tags; its only effect is to ensure that the sum of the probabilities over all possible sequence of tags adds up to 1.)

So we've re-expressed the problem as finding the sequence of tags that maximizes the product
Prob(t1 ... tN) * Prob(e1 ... eN| t1 ... tN)
The first term here is the inherent likelihood of the sequence of tags, before we have seen what the elements are. The second is the likelihood that a particular sequence of tags t1 ... tN will be instantiated as the sequence of elements e1 ... eN. For instance, in the speech understanding problem, the first term is the probability that a speaker will speak a given sentence, sometimes called the "linguistic" model; the second is the probability, given the sentence, that he would pronounce it in the way that has been heard, sometimes called the "phonetic" model. (The interpretation of this product in the "part-of-speech" problem is not as natural.) We now have to estimate these two terms using independence assumptions.

We can use the rule for the probability of a conjunction to expand the second term as follows:
Prob(e1 ... eN| t1 ... tN) =
Prob(e1 | t1 ... tN) * Prob(e2 | e1, t1 ... tN) * Prob(e3 | e1, e2, t1 ... tN) * ... Prob(eN | e1 ... eN-1, t1 ... tN)
We now make the following independence assumption:
Prob(eI | e1 ... eI-1, t1 ... tN) = Prob(eI | tI)
That is, once we have fixed the Ith tag tI, the likelihood of any given element eI is not affected by any preceding element or by any of the other tags. (Also, we assume that the value of the index I doesn't actually enter here; that is, we are interested in the probability that e.g. a verb will be "can", not the probability that a verb which is the second word in the sentence will be "can". This kind of thing is quite difficult to notate formally and precisely, which is one reason that writers don't even aim at a complete formal notation.)

Having made this assumption, we can rewrite the above product in the form Prob(e1 | t1) * ... * Prob(eN | tN)
We can now apply Bayes' Law to each of the above factors: Prob(eI | tI) = Prob(tI | eI) * Prob(eI) / Prob(tI)
These are all numbers we can estimate from the corpus.

We now turn to the first term in the original product above: Prob(t1 ... tN)
Again using the rule for conjunctions, we can write this in the form,
Prob(t1 ... tN) = Prob(t1) * Prob(t2 | t1) * Prob(t3 | t1 t2) * Prob(t4 | t1 t2 t3 ) * ... * Prob(tN | t1 ... tN-1)

We now make the second independence assumption, known as the k-gram assumption (for some small integer k), that the dependence of the Ith tag on all the previous tags in fact reduces to its dependence on the k-1 previous tags.
Prob(tI | t1 ... tI-1) = Prob(tI | tI+1-K ... tI-1)
For instance, with K=3 (the usual case) we get the trigram assumption Prob(tI | t1 ... tI-1) = Prob(tI | tI-2 tI-1)
With K=2 we get the bigram assumption Prob(tI | t1 ... tI-1) = Prob(tI | tI-1)
(Again we mean, not clearly indicated in the notation, that this is the probability of seeing a specified tag following K-1 specified tags at any position in the sentence. That is, e.g. the probability of seeing noun following determiner adjective, not the probability of seeing noun in the fifth position following determiner adjective in the third and fourth.)

In order to get a more uniform notation, we assume that every sentence of tags is preceded by K-1 occurences of the special tag "*Start*". We can then write the above product in the form ProductI=1N Prob(tI | tI+1-K ... tI-1)

To simplify the notation, we'll assume from here on down, that we are making the trigram assumption with K=3

Putting all this together we get the formula
Prob(t1 ... tN | e1 ... eN) = ProductI=1N Prob(tI | tI-2 tI-1) * Prob(tI | eI) * Prob(eI) / Prob(tI)

Again, since the eI are fixed, we can ignore the term Prob(eI) , which is unaffected by the choice of the tags, and just try to maximize the product ProductI=1N Prob(tI | tI-2 tI-1) * Prob(tI | eI) / Prob(tI)

The Sparse Data Problem and Smoothing

To compute the above product, we need three types of probabilities: Prob(tI | tI-2 tI-1), Prob(tI | eI), and Prob(tI). This is where our training corpus comes in: We estimate each of these quantities by the frequency in the corpus:
Prob(tI | tI-2 tI-1) = FreqC(tI | tI-2 tI-1); Prob(tI | eI) = FreqC(tI | eI), and Prob(tI) = FreqC(tI)
where "FreqC" means the frequency in the training corpus C.

Unfortunately, there is a problem with the first of these in particular; namely, that some possible triples of tags tI-2 tI-1 tI may actually never occur in the corpus C. This is unlikely in the case where the tags are parts of speech, but is altogether likely in the case where the tags are words. (If the active vocabulary is 10,000 words, then in principle there could be as many as a trillion different trigrams. Now, in fact, the vast majority of these are impossible. Nonetheless, there are many more trigrams that are possible and even not terribly unlikely than will appear in, say, a billion word corpus. That is, if you take a billion word training corpus, and then look at another million words, it is very likely that the new million words will include a substantial fraction of trigrams that never appear in the training corpus.)

The solution that is used for this problem is called smoothing ; We estimate the probability of the trigram as a weighted sum of the trigram frequency, the bigram frequency, and the unigram frequency.

Prob(tI | tI-2 tI-1) = w3 FreqC(tI | tI-2 tI-1) + w2 FreqC(tI | tI-1) + w1 FreqC(tI | tI-1)
for some fixed weights w1, w2, w3, that add up to 1.0; e.g. w1 = 0.1 w2 = 0.2, w3 =0.7. For instance, we estimate the probability of seeing "jelly" following "burnt strawberry" as 0.7 times the frequency with which "burnt strawberry" is followed by "jello" in the corpus, plus 0.2 times the frequency with which "strawberry" is followed by "jello" in the corpus, plus 0.1 times the frequency of "jello" in the corpus. Thus, we give preference to trigrams that actually occur in the corpus, while admitting the possibility of seeing new trigrams and even new bigrams.