- The elements are words, and the tags are parts of speech. In applications, "Part of speech" here often involves finer distinctions than the usual linguistic categories. For instance the "Named entity" recognition problem involves categorizing proper nouns as personal names, organizations, place names, etc.
- The elements are words and the tags are the meanings of the words (lexical disambiguation).
- The elements are phonemes, as analyzed in a recording of speech, and the tags are words.

Prob(T[1]=*t _{1}* ... T[N]=

We will abbreviate this as
Prob(t_{1} ... t_{N} |
e_{1} ... e_{N}) meaning "The probability that the
tagging of the sentence is the string of tags t_{1} ... t_{N}
given that the sentence is the string of words
e_{1} ... e_{N}).

For instance, in tagging the sentence "I can fish", we want to arrive somehow at the estimate that

Prob(< pronoun,modal,verb> | < "I","can","fish>)

is large, thatProb(< pronoun,verb,noun> | < "I","can","fish>)

is small and that the probability for any other sequence of three tags is tiny or zero.- 1. they enable us to proceed, whereas without them we are stuck;
- 2. the overall system derived using these assumptions actually works pretty well.

Prob(

However the denominator above
Prob(*e _{1} ... e_{N}*)
depends only on the sequence of elements, which is given. As this is the
same for all choices of tags, we can ignore it in comparing one sequence
of tags to another. (This is known as the "normalizing factor". It doesn't
affect the relative sizes of the probablities of different sequence of tags;
its only effect is to ensure that the sum of the probabilities over all
possible sequence of tags adds up to 1.)

So we've re-expressed the problem as finding the sequence of tags that
maximizes the product

Prob(*t _{1} ... t_{N}*) *
Prob(e

The first term here is the inherent likelihood of the sequence of tags, before we have seen what the elements are. The second is the likelihood that a particular sequence of tags

We can use the rule for the probability of a conjunction to expand the
second term as follows:

Prob(*e _{1} ... e_{N}*|

Prob(

We now make the following independence assumption:

Prob(

That is, once we have fixed the Ith tag

Having made this assumption, we can rewrite the above product in the form
Prob(*e _{1}* |

We can now apply Bayes' Law to each of the above factors: Prob(

These are all numbers we can estimate from the corpus.

We now turn to the first term in the original product above:
Prob(*t _{1} ... t_{N}*)

Again using the rule for conjunctions, we can write this in the form,

Prob(

We now make the second independence assumption, known as the *k*-gram
assumption (for some small integer *k*), that the dependence of
the Ith tag on all the previous tags in fact reduces to its dependence on
the *k-1* previous tags.

Prob(*t _{I} | t_{1}* ... t

For instance, with

With

(Again we mean, not clearly indicated in the notation, that this is the probability of seeing a specified tag following

In order to get a more uniform notation, we assume that every sentence
of tags is preceded by *K-1 * occurences of the special tag "*Start*".
We can then write the above product in the form

Product_{I=1}^{N}
Prob(*t _{I} | t_{I+1-K} ... t_{I-1}*)

To simplify the notation, we'll assume from here on down, that we are
making the trigram assumption with *K=3*

Putting all this together we get the formula

Prob(*t _{1} ... t_{N} | e_{1} ... e_{N}*)
=
Product

Again, since the *e _{I}* are fixed, we can ignore the term
Prob(

Prob(

where "Freq

Unfortunately, there is a problem with the first of these in particular;
namely, that some possible triples of tags
*t _{I-2} t_{I-1} t_{I}* may actually
never occur in the corpus C. This is unlikely in the case where
the tags are parts of speech, but is altogether likely in the case
where the tags are words. (If the active vocabulary is 10,000 words,
then in principle there could be as many as a trillion different trigrams.
Now, in fact, the vast majority of these are impossible. Nonetheless,
there are many more trigrams that are possible and even not terribly
unlikely than will appear in, say, a billion word corpus. That is, if
you take a billion word training corpus, and then look at another million
words, it is very likely that the new million words will include a substantial
fraction of trigrams that never appear in the training corpus.)

The solution that is used for this problem is called * smoothing *;
We estimate the probability of the trigram as a weighted sum of
the trigram frequency, the bigram frequency, and the unigram frequency.

Prob(*t _{I} | t_{I-2} t_{I-1}*) =

for some fixed weights