Naive Bayes for Classifying Text

Naive Bayes: Multinomial model

Suppose you have a collection of documents D1 ... Dn labelled as belonging to categories C1 ... Ck. A new document D is given for you to categorize. In a probabilistic approach, we are looking for the category C such that Prob(C|D) is maximal.

So how do we evaluate Prob(C|D)? The multinomial naive Bayes method proceeds as follows:

First, as before, Prob(C|D) = Prob(D|C) Prob(C)/Prob(D).

Second, let T1, T2 ... Tm be the sequence of lexical terms in D. We will assume that the occurrence of a term T in the Ith place of D depends depends only on the category C and, given C, is conditionally independent of all the other terms in D and of the position I. Therefore
Prob(D|C) = Prob(T1|C)*Prob(T2|C) * ... * Prob(Tm|C)
where Prob(Ti|C) means "the probability that a randomly chosen word token in a document of type C is the word Ti".

Third, we estimate Prob(Ti|C) and Prob(C) using the training set.
Prob(Ti|C) is estimated as the relative frequency of Ti in documents of category C =
(number of occurrences of Ti in C) / (total number of words in C).
Prob(C) is estimated as the fraction of documents in the training set that are of category C.

Fourth, as usual in Naive Bayes, we can ignore the deonominator Prob(D) since it is independent of the value of C. Since we are given D and are just trying to compare different categories, we need not calculate Prob(D). (This is known as a normalizing factor) its function is to ensure that the probabilities add up to 1.

We can now calculate the product
Prob(Ci)*Prob(T1|Ci)*Prob(T2|Ci) * ... * Prob(Tm|Ci) for each category Ci and choose the category that maximizes this.

Bernoulli model

A variant of this is the Bernoulli model, as follows:

First, as before, Prob(C|D) = Prob(D|C) Prob(C)/Prob(D).

Second, let { T1, T2 ... Tm } be the set of distinct lexical terms in D. We will assume that the appearance of a term Ti depends only on the category C and is independent of all the other terms in D. Therefore
Prob(D|C) = Prob(T1|C)*Prob(T2|C) * ... * Prob(Tm|C)
(where Prob(Ti|C) means "the probability that a document of category C contains at least one occurrence of Ti").

Third, we estimate Prob(Ti|C) and Prob(C) using the training set.
Prob(Ti|C) is estimated as the fraction of documents in C that contain Tii.

The rest of the algorithm proceeds as above.

Stochastic generative model

The stochastic generative model corresponding to this procedure generates documents by a random processs. The question then becomes, given a document of this kind, what is the conditional probability that it belongs to a given category?

As in the discussion of Naive Bayes generally as a stochastic generative model, the stochastic generative model here consists of two levels of roulette wheels.

Given the particular structure of roulette wheels, the assignment of probabilities that makes the training set most likely is to make each probability equal to the observed frequency in the training set.

Stochastic generative model for the multinomial model

First, you spin a roulette wheel to choose the value v of the classication attribute with probability Prob(X.C=v)

You now arbitrarily choose the number k of words in the document to be generated. For each category v you have a separate roulette wheel, with all possible words as values, where the probability of obtaining word u is Prob(X.T=u|X.C=v). You spin that roulette wheel k times. That gives you all the words in the document. You put them in random order, and there's your document.

(The fact that the number of words, kis an arbitrary rather than a probabilistic choice corresponds to the fact that we're not using document length as a feature to use in categorization. Of course, one could, and that in fact might often be reasonable -- some kinds of documents are typically longer than others. In that case, that would have to be another roulette wheel here.

Stochastic generative model for the Bernoulli model

The top level roulette wheel, as before, is used to choose the classification category v with probability Prob(X.c=v).

Now, for each category v and for each word w in the language you have a separate roulette wheel, with two values: TRUE and FALSE. Having chosen the category v, for every word w you spin the roulette wheel. If it comes up TRUE, you include the word w in the document at least once — the number of times to include it is arbitrary.