Prob(T=t1 ... T[N]=tN | E=e1 ... E[N]=eN)is as large as possible.
For instance, in tagging the sentence "I can fish", we want to arrive somehow at the estimate that
Prob(T=pronoun, T=modal, T=verb | E="I", E="can", E="fish")is large, that
Prob(T=pronoun, T=verbl, T=noun | E="I", E="can", E="fish")is small and that the probability for any other sequence of three tags is tiny or zero.
Prob(T=t1 ... T[N]=tN | E=e1 ... E[N]=eN)is strictly correct. However, it is obviously rather cumbersome, and as we get to more complicated expressions would get impossibly cumbersome. The tradition in the literature of applied probability, which we follow below, is to be very unrigorous, not to say sloppy, about using clearly defined notations, and to write the above expression in the form
Prob(t1 ... tN | e1 ... eN)
Strictly speaking, this is nonsense; an expression like
Prob(pronoun,modal,verb | "I","can","fish")doesn't actually mean anything. It is the responsibility of the reader to keep track of what is actually meant (and the responsibility, not always observed, of the writer to explain clearly what his abbreviated notations actually mean.)
However the denominator above Prob(e1 ... eN) depends only on the sequence of elements, which is given. As this is the same for all choices of tags, we can ignore it in comparing one sequence of tags to another. (This is known as the "normalizing factor". It doesn't affect the relative sizes of the probablities of different sequence of tags; its only effect is to ensure that the sum of the probabilities over all possible sequence of tags adds up to 1.)
So we've re-expressed the problem as finding the sequence of tags that
maximizes the product
Prob(t1 ... tN) * Prob(e1 ... eN| t1 ... tN)
The first term here is the inherent likelihood of the sequence of tags, before we have seen what the elements are. The second is the likelihood that a particular sequence of tags t1 ... tN will be instantiated as the sequence of elements e1 ... eN. For instance, in the speech understanding problem, the first term is the probability that a speaker will speak a given sentence, sometimes called the "linguistic" model; the second is the probability, given the sentence, that he would pronounce it in the way that has been heard, sometimes called the "phonetic" model. (The interpretation of this product in the "part-of-speech" problem is not as natural.) We now have to estimate these two terms using independence assumptions.
We can use the rule for the probability of a conjunction to expand the
second term as follows:
Prob(e1 ... eN| t1 ... tN) =
Prob(e1 | t1 ... tN) * Prob(e2 | e1, t1 ... tN) * Prob(e3 | e1, e2, t1 ... tN) * ... Prob(eN | e1 ... eN-1, t1 ... tN)
We now make the following independence assumption:
Prob(eI | e1 ... eI-1, t1 ... tN) = Prob(eI | tI)
That is, once we have fixed the Ith tag tI, the likelihood of any given element eI is not affected by any preceding element or by any of the other tags. (Also, we assume that the value of the index I doesn't actually enter here; that is, we are interested in the probability that e.g. a verb will be "can", not the probability that a verb which is the second word in the sentence will be "can". This kind of thing is quite difficult to notate formally and precisely, which is one reason that writers don't even aim at a complete formal notation.)
Having made this assumption, we can rewrite the above product in the form
Prob(e1 | t1) * ... *
Prob(eN | tN)
We can now apply Bayes' Law to each of the above factors: Prob(eI | tI) = Prob(tI | eI) * Prob(eI) / Prob(tI)
These are all numbers we can estimate from the corpus.
We now turn to the first term in the original product above:
Prob(t1 ... tN)
Again using the rule for conjunctions, we can write this in the form,
Prob(t1 ... tN) = Prob(t1) * Prob(t2 | t1) * Prob(t3 | t1 t2) * Prob(t4 | t1 t2 t3 ) * ... * Prob(tN | t1 ... tN-1)
We now make the second independence assumption, known as the k-gram
assumption (for some small integer k), that the dependence of
the Ith tag on all the previous tags in fact reduces to its dependence on
the k-1 previous tags.
Prob(tI | t1 ... tI-1) = Prob(tI | tI+1-K ... tI-1)
For instance, with K=3 (the usual case) we get the trigram assumption Prob(tI | t1 ... tI-1) = Prob(tI | tI-2 tI-1)
With K=2 we get the bigram assumption Prob(tI | t1 ... tI-1) = Prob(tI | tI-1)
(Again we mean, not clearly indicated in the notation, that this is the probability of seeing a specified tag following K-1 specified tags at any position in the sentence. That is, e.g. the probability of seeing noun following determiner adjective, not the probability of seeing noun in the fifth position following determiner adjective in the third and fourth.)
In order to get a more uniform notation, we assume that every sentence
of tags is preceded by K-1 occurences of the special tag "*Start*".
We can then write the above product in the form
Prob(tI | tI+1-K ... tI-1)
To simplify the notation, we'll assume from here on down, that we are making the trigram assumption with K=3
Putting all this together we get the formula
Prob(t1 ... tN | e1 ... eN) = ProductI=1N Prob(tI | tI-2 tI-1) * Prob(tI | eI) * Prob(eI) / Prob(tI)
Again, since the eI are fixed, we can ignore the term
Prob(eI) , which is unaffected by the choice of
the tags, and just try to maximize the product
Prob(tI | tI-2 tI-1) *
Prob(tI | eI) /
Unfortunately, there is a problem with the first of these in particular; namely, that some possible triples of tags tI-2 tI-1 tI may actually never occur in the corpus C. This is unlikely in the case where the tags are parts of speech, but is altogether likely in the case where the tags are words. (If the active vocabulary is 10,000 words, then in principle there could be as many as a trillion different trigrams. Now, in fact, the vast majority of these are impossible. Nonetheless, there are many more trigrams that are possible and even not terribly unlikely than will appear in, say, a billion word corpus. That is, if you take a billion word training corpus, and then look at another million words, it is very likely that the new million words will include a substantial fraction of trigrams that never appear in the training corpus.)
The solution that is used for this problem is called smoothing ; We estimate the probability of the trigram as a weighted sum of the trigram frequency, the bigram frequency, and the unigram frequency.
Prob(tI | tI-2 tI-1) =
FreqC(tI | tI-2 tI-1)
FreqC(tI | tI-1)
FreqC(tI | tI-1)
for some fixed weights w1, w2, w3, that add up to 1.0; e.g. w1 = 0.1 w2 = 0.2, w3 =0.7. For instance, we estimate the probability of seeing "jelly" following "burnt strawberry" as 0.7 times the frequency with which "burnt strawberry" is followed by "jello" in the corpus, plus 0.2 times the frequency with which "strawberry" is followed by "jello" in the corpus, plus 0.1 times the frequency of "jello" in the corpus. Thus, we give preference to trigrams that actually occur in the corpus, while admitting the possibility of seeing new trigrams and even new bigrams.