So how do we evaluate Prob(C|D)? The multinomial naive Bayes method proceeds as follows:
First, as before, Prob(C|D) = Prob(D|C) Prob(C)/Prob(D).
Second, let T1, T2 ... Tm be the sequence of
lexical terms in D. We will
assume that the occurrence of a term T in the Ith place of D depends
depends only on the category C and, given C, is conditionally
independent of all the other terms in D and of the position I.
Therefore
Prob(D|C) = Prob(T1|C)*Prob(T2|C) * ... * Prob(Tm|C)
where Prob(Ti|C) means "the probability that a randomly chosen word
token in a document of type C is the word Ti".
Third, we estimate Prob(Ti|C) and Prob(C)
using the training set.
Prob(Ti|C) is estimated as the relative frequency
of Ti in documents of category C =
(number of occurrences of Ti in C) /
(total number of words in C).
Prob(C) is estimated as the fraction of documents in the training set that
are of category C.
Fourth, as usual in Naive Bayes, we can ignore the deonominator Prob(D) since it is independent of the value of C. Since we are given D and are just trying to compare different categories, we need not calculate Prob(D). (This is known as a normalizing factor) its function is to ensure that the probabilities add up to 1.
We can now calculate the product
Prob(Ci)*Prob(T1|Ci)*Prob(T2|Ci) * ... * Prob(Tm|Ci) for each
category Ci and choose the category that maximizes this.
First, as before, Prob(C|D) = Prob(D|C) Prob(C)/Prob(D).
Second, let { T1, T2 ... Tm } be the set of distinct
lexical terms in D. We will
assume that the appearance
of a term Ti depends only on the category C and
is independent of all the other terms in D. Therefore
Prob(D|C) = Prob(T1|C)*Prob(T2|C) * ... * Prob(Tm|C)
(where Prob(Ti|C)
means "the probability that a document of category C contains at least
one occurrence of Ti").
Third, we estimate Prob(Ti|C) and Prob(C) using the training set.
Prob(Ti|C) is estimated as the fraction of documents in C that contain
Tii.
The rest of the algorithm proceeds as above.
The stochastic generative model corresponding to this procedure generates documents by a random processs. The question then becomes, given a document of this kind, what is the conditional probability that it belongs to a given category?
As in the discussion of Naive Bayes generally as a stochastic generative model, the stochastic generative model here consists of two levels of roulette wheels.
Given the particular structure of roulette wheels, the assignment of probabilities that makes the training set most likely is to make each probability equal to the observed frequency in the training set.
First, you spin a roulette wheel to choose the value v of the classication attribute with probability Prob(X.C=v)
You now arbitrarily choose the number k of words in the document to be generated. For each category v you have a separate roulette wheel, with all possible words as values, where the probability of obtaining word u is Prob(X.T=u|X.C=v). You spin that roulette wheel k times. That gives you all the words in the document. You put them in random order, and there's your document.
(The fact that the number of words, kis an arbitrary rather than a probabilistic choice corresponds to the fact that we're not using document length as a feature to use in categorization. Of course, one could, and that in fact might often be reasonable -- some kinds of documents are typically longer than others. In that case, that would have to be another roulette wheel here.
Now, for each category v and for each word w in the language you have a separate roulette wheel, with two values: TRUE and FALSE. Having chosen the category v, for every word w you spin the roulette wheel. If it comes up TRUE, you include the word w in the document at least once — the number of times to include it is arbitrary.