Tagging NL text using the k-gram model

Introduction

A number of problems in natural language analysis can be formulated in the following terms: You are given a text , which is viewed as a string of elements . The problem is to assign one tag out of a collection of tags to each of the elements. Examples:

The elements are words, and the tags are parts of speech. In applications, "Part of speech" here often involves finer distinctions than the usual linguistic categories. For instance the "Named entity" recognition problem involves categorizing proper nouns as personal names, organizations, place names, etc.
The elements are words and the tags are the meanings of the words (lexical disambiguation).
The elements are phonemes, as analyzed in a recording of speech, and the tags are words.

In the statistical, or supervised learning, approach to this problem, you are given a large training corpus of correctly tagged text. The objective, then, is to extract patterns from the training corpus that can be applied to label the new text.

The Probabilistic Model

In a probabilistic model, we formulate the problem as looking for the most likely sequence of tags for the given sequence of elements. Formally, let N be the number of elements in the sentence to be tagged. Let E[I] be the random variable for the Ith element, let T[I] be the random variable for the Ith tag, let e_I be the actual value of the Ith element. The problem is to find the sequence of tag values t₁ ... t_Nsuch that

Prob(T[1]=t₁ ... T[N]=t_N | E[1]=e₁ ... E[N]=e_N)

is as large as possible.

We will abbreviate this as Prob(t₁ ... t_N | e₁ ... e_N) meaning "The probability that the tagging of the sentence is the string of tags t₁ ... t_N given that the sentence is the string of words e₁ ... e_N).

For instance, in tagging the sentence "I can fish", we want to arrive somehow at the estimate that

Prob(< pronoun,modal,verb> | < "I","can","fish>)

is large, that

Prob(< pronoun,verb,noun> | < "I","can","fish>)

is small and that the probability for any other sequence of three tags is tiny or zero.

The Probabilistic Estimate

In principle, what we're after is this; if you consider all the times that the sentence "I can fish" might be written or spoken, what fraction of the time is the intended meaning the one corresponding to [pronoun modal verb] and what fraction of the time is it the one corresponding to [pronoun verb noun]? Unfortunately, since most sentences that one encounters are being spoken (or written) for the first and only time in the history of the universe --- and, more specifically, are certainly not in the training corpus --- this definition is not a useful one. (Actually, a significant fraction of speech is very short, common sentences --- "Hello!", "Go away!", etc. --- but those don't get you very far.) So we have to estimate this probability in terms of considerations whose weight we can actually measure. To do this, as throughout applied probability, we have to make independence assumptions that are violently and obviously false. The "justification" of these assumptions, in the final analysis, is that

1. they enable us to proceed, whereas without them we are stuck;
2. the overall system derived using these assumptions actually works pretty well.

We proceed as follows. We first apply Bayes' Law to derive
Prob(t₁ ... t_N | e₁ ... e_N) = Prob(t₁ ... t_N) * Prob(e₁ ... e_N| t₁ ... t_N) / Prob(e₁ ... e_N)

However the denominator above Prob(e₁ ... e_N) depends only on the sequence of elements, which is given. As this is the same for all choices of tags, we can ignore it in comparing one sequence of tags to another. (This is known as the "normalizing factor". It doesn't affect the relative sizes of the probablities of different sequence of tags; its only effect is to ensure that the sum of the probabilities over all possible sequence of tags adds up to 1.)

So we've re-expressed the problem as finding the sequence of tags that maximizes the product
Prob(t₁ ... t_N) * Prob(e₁ ... e_N| t₁ ... t_N)
The first term here is the inherent likelihood of the sequence of tags, before we have seen what the elements are. The second is the likelihood that a particular sequence of tags t₁ ... t_N will be instantiated as the sequence of elements e₁ ... e_N. For instance, in the speech understanding problem, the first term is the probability that a speaker will speak a given sentence, sometimes called the "linguistic" model; the second is the probability, given the sentence, that he would pronounce it in the way that has been heard, sometimes called the "phonetic" model. (The interpretation of this product in the "part-of-speech" problem is not as natural.) We now have to estimate these two terms using independence assumptions.

We can use the rule for the probability of a conjunction to expand the second term as follows:
Prob(e₁ ... e_N| t₁ ... t_N) =
Prob(e₁ | t₁ ... t_N) * Prob(e₂ | e₁, t₁ ... t_N) * Prob(e₃ | e₁, e₂, t₁ ... t_N) * ... Prob(e_N | e₁ ... e_N-1, t₁ ... t_N)
We now make the following independence assumption:
Prob(e_I | e₁ ... e_I-1, t₁ ... t_N) = Prob(e_I | t_I)
That is, once we have fixed the Ith tag t_I, the likelihood of any given element e_I is not affected by any preceding element or by any of the other tags. (Also, we assume that the value of the index I doesn't actually enter here; that is, we are interested in the probability that e.g. a verb will be "can", not the probability that a verb which is the second word in the sentence will be "can". This kind of thing is quite difficult to notate formally and precisely. Authors of these kinds of probabilistic calculations very rarely make any attempt to achieve formal precision in these kinds of formulas.)

Having made this assumption, we can rewrite the above product in the form Prob(e₁ | t₁) * ... * Prob(e_N | t_N)
We can now apply Bayes' Law to each of the above factors: Prob(e_I | t_I) = Prob(t_I | e_I) * Prob(e_I) / Prob(t_I)
These are all numbers we can estimate from the corpus.

We now turn to the first term in the original product above: Prob(t₁ ... t_N)
Again using the rule for conjunctions, we can write this in the form,
Prob(t₁ ... t_N) = Prob(t₁) * Prob(t₂ | t₁) * Prob(t₃ | t₁ t₂) * Prob(t₄ | t₁ t₂ t₃ ) * ... * Prob(t_N | t₁ ... t_N-1)

We now make the second independence assumption, known as the k-gram assumption (for some small integer k), that the dependence of the Ith tag on all the previous tags in fact reduces to its dependence on the k-1 previous tags.
Prob(t_I | t₁ ... t_I-1) = Prob(t_I | t_I+1-K ... t_I-1)
For instance, with K=3 (the usual case) we get the trigram assumption Prob(t_I | t₁ ... t_I-1) = Prob(t_I | t_I-2 t_I-1)
With K=2 we get the bigram assumption Prob(t_I | t₁ ... t_I-1) = Prob(t_I | t_I-1)
(Again we mean, not clearly indicated in the notation, that this is the probability of seeing a specified tag following K-1 specified tags at any position in the sentence. That is, e.g. the probability of seeing noun following determiner adjective, not the probability of seeing noun in the fifth position following determiner adjective in the third and fourth.)

In order to get a more uniform notation, we assume that every sentence of tags is preceded by K-1 occurences of the special tag "*Start*". We can then write the above product in the form
∏_I=1^N Prob(t_I | t_I+1-K ... t_I-1)

To simplify the notation, we'll assume from here on down, that we are making the trigram assumption with K=3

Putting all this together we get the formula
Prob(t₁ ... t_N | e₁ ... e_N) = ∏_I=1^N Prob(t_I | t_I-2 t_I-1) * Prob(t_I | e_I) * Prob(e_I) / Prob(t_I)

Again, since the e_I are fixed, we can ignore the term Prob(e_I) , which is unaffected by the choice of the tags, and just try to maximize the product ∏_I=1^N Prob(t_I | t_I-2 t_I-1) * Prob(t_I | e_I) / Prob(t_I)

The Sparse Data Problem and Smoothing

To compute the above product, we need three types of probabilities: Prob(t_I | t_I-2 t_I-1), Prob(t_I | e_I), and Prob(t_I). This is where our training corpus comes in: We estimate each of these quantities by the frequency in the corpus:
Prob(t_I | t_I-2 t_I-1) = Freq_C(t_I | t_I-2 t_I-1); Prob(t_I | e_I) = Freq_C(t_I | e_I), and Prob(t_I) = Freq_C(t_I)
where "Freq_C" means the frequency in the training corpus C.

Unfortunately, there is a problem with the first of these in particular; namely, that some possible triples of tags t_I-2 t_I-1 t_I may actually never occur in the corpus C. This is unlikely in the case where the tags are parts of speech, but is altogether likely in the case where the tags are words. (If the active vocabulary is 10,000 words, then in principle there could be as many as a trillion different trigrams. Now, in fact, the vast majority of these are impossible. Nonetheless, there are many more trigrams that are possible and even not terribly unlikely than will appear in, say, a billion word corpus. That is, if you take a billion word training corpus, and then look at another million words, it is very likely that the new million words will include a substantial fraction of trigrams that never appear in the training corpus.)

The solution that is used for this problem is called smoothing ; We estimate the probability of the trigram as a weighted sum of the trigram frequency, the bigram frequency, and the unigram frequency.

Prob(t_I | t_I-2 t_I-1) = w₃ Freq_C(t_I | t_I-2 t_I-1) + w₂ Freq_C(t_I | t_I-1) + w₁ Freq_C(t_I | t_I-1)
for some fixed weights w₁, w₂, w₃, that add up to 1.0; e.g. w₁ = 0.1 w₂ = 0.2, w₃ =0.7. For instance, we estimate the probability of seeing "jelly" following "burnt strawberry" as 0.7 times the frequency with which "burnt strawberry" is followed by "jello" in the corpus, plus 0.2 times the frequency with which "strawberry" is followed by "jello" in the corpus, plus 0.1 times the frequency of "jello" in the corpus. Thus, we give preference to trigrams that actually occur in the corpus, while admitting the possibility of seeing new trigrams and even new bigrams.