Prob(T[1]=t1 ... T[N]=tN | E[1]=e1 ... E[N]=eN)
is as large as possible.We will abbreviate this as Prob(t1 ... tN | e1 ... eN) meaning "The probability that the tagging of the sentence is the string of tags t1 ... tN given that the sentence is the string of words e1 ... eN).
For instance, in tagging the sentence "I can fish", we want to arrive somehow at the estimate that
Prob(< pronoun,modal,verb> | < "I","can","fish>)
is large, thatProb(< pronoun,verb,noun> | < "I","can","fish>)
is small and that the probability for any other sequence of three tags is tiny or zero.However the denominator above Prob(e1 ... eN) depends only on the sequence of elements, which is given. As this is the same for all choices of tags, we can ignore it in comparing one sequence of tags to another. (This is known as the "normalizing factor". It doesn't affect the relative sizes of the probablities of different sequence of tags; its only effect is to ensure that the sum of the probabilities over all possible sequence of tags adds up to 1.)
So we've re-expressed the problem as finding the sequence of tags that
maximizes the product
Prob(t1 ... tN) *
Prob(e1 ... eN|
t1 ... tN)
The first term here is the inherent likelihood of the sequence of tags,
before we have seen what the elements are. The second is the likelihood
that a particular sequence of tags
t1 ... tN will be instantiated as the sequence
of elements
e1 ... eN. For instance, in the speech understanding
problem, the first term is the probability that a speaker will speak a
given sentence, sometimes called the "linguistic" model; the second is the
probability, given the sentence, that
he would pronounce it in the way that has been heard, sometimes called the
"phonetic" model. (The interpretation
of this product in the "part-of-speech" problem is not as natural.)
We now have to estimate these two terms using independence assumptions.
We can use the rule for the probability of a conjunction to expand the
second term as follows:
Prob(e1 ... eN|
t1 ... tN) =
Prob(e1 |
t1 ... tN) *
Prob(e2 |
e1, t1 ... tN) *
Prob(e3 |
e1, e2, t1 ... tN) *
...
Prob(eN |
e1 ... eN-1, t1 ... tN)
We now make the following independence assumption:
Prob(eI |
e1 ... eI-1, t1 ... tN) =
Prob(eI | tI)
That is, once we have fixed the Ith tag tI, the likelihood
of any given element eI is not affected by any
preceding element or by any of the other tags. (Also,
we assume that the value of the index I doesn't actually
enter here; that is, we are interested in the probability that e.g. a
verb will be "can", not the probability that a verb which is the second
word in the sentence will be "can". This kind of thing is quite difficult
to notate formally and precisely. Authors of these kinds of probabilistic
calculations very rarely make any attempt to achieve formal
precision in these kinds of formulas.)
Having made this assumption, we can rewrite the above product in the form
Prob(e1 | t1) * ... *
Prob(eN | tN)
We can now apply Bayes' Law to each of the above factors:
Prob(eI | tI) =
Prob(tI | eI) *
Prob(eI) /
Prob(tI)
These are all numbers we can estimate from the corpus.
We now turn to the first term in the original product above:
Prob(t1 ... tN)
Again using the rule for conjunctions, we can write this in the form,
Prob(t1 ... tN) =
Prob(t1) *
Prob(t2 | t1) *
Prob(t3 | t1 t2) *
Prob(t4 | t1 t2 t3
) * ... *
Prob(tN | t1 ... tN-1)
We now make the second independence assumption, known as the k-gram
assumption (for some small integer k), that the dependence of
the Ith tag on all the previous tags in fact reduces to its dependence on
the k-1 previous tags.
Prob(tI | t1 ... tI-1) =
Prob(tI | tI+1-K ... tI-1)
For instance, with K=3 (the usual case) we get the trigram assumption
Prob(tI | t1 ... tI-1) =
Prob(tI | tI-2 tI-1)
With K=2 we get the bigram assumption
Prob(tI | t1 ... tI-1) =
Prob(tI | tI-1)
(Again we mean, not clearly indicated in the notation, that this is the
probability of seeing a specified tag following K-1 specified
tags at any position in the sentence. That is, e.g. the probability of
seeing noun following determiner adjective, not the probability of
seeing noun in the fifth position following determiner adjective in
the third and fourth.)
In order to get a more uniform notation, we assume that every sentence
of tags is preceded by K-1 occurences of the special tag "*Start*".
We can then write the above product in the form
∏I=1N
Prob(tI | tI+1-K ... tI-1)
To simplify the notation, we'll assume from here on down, that we are making the trigram assumption with K=3
Putting all this together we get the formula
Prob(t1 ... tN | e1 ... eN)
=
∏I=1N
Prob(tI | tI-2 tI-1) *
Prob(tI | eI) *
Prob(eI) /
Prob(tI)
Again, since the eI are fixed, we can ignore the term
Prob(eI) , which is unaffected by the choice of
the tags, and just try to maximize the product
∏I=1N
Prob(tI | tI-2 tI-1) *
Prob(tI | eI) /
Prob(tI)
Unfortunately, there is a problem with the first of these in particular; namely, that some possible triples of tags tI-2 tI-1 tI may actually never occur in the corpus C. This is unlikely in the case where the tags are parts of speech, but is altogether likely in the case where the tags are words. (If the active vocabulary is 10,000 words, then in principle there could be as many as a trillion different trigrams. Now, in fact, the vast majority of these are impossible. Nonetheless, there are many more trigrams that are possible and even not terribly unlikely than will appear in, say, a billion word corpus. That is, if you take a billion word training corpus, and then look at another million words, it is very likely that the new million words will include a substantial fraction of trigrams that never appear in the training corpus.)
The solution that is used for this problem is called smoothing ; We estimate the probability of the trigram as a weighted sum of the trigram frequency, the bigram frequency, and the unigram frequency.
Prob(tI | tI-2 tI-1) =
w3
FreqC(tI | tI-2 tI-1)
+ w2
FreqC(tI | tI-1)
+ w1
FreqC(tI | tI-1)
for some fixed weights
w1,
w2,
w3, that add up to 1.0; e.g.
w1 = 0.1
w2 = 0.2,
w3 =0.7.
For instance, we estimate the probability of seeing "jelly" following
"burnt strawberry" as 0.7 times the frequency with which "burnt strawberry" is
followed by "jello" in the corpus, plus
0.2 times the frequency with which "strawberry" is
followed by "jello" in the corpus, plus
0.1 times the frequency of "jello" in the corpus. Thus, we give preference
to trigrams that actually occur in the corpus, while admitting the possibility
of seeing new trigrams and even new bigrams.