Naive Bayes for Classifying Text

Naive Bayes

Suppose you have a collection of documents D1 ... Dn labelled as belonging to categories C1 ... Ck. A new document D is given for you to categorize. In a probabilistic approach, we are looking for the category C such that Prob(C|D) is maximal.

So how do we evaluate Prob(C|D)? The multinomial naive Bayes method proceeds as follows:

First, as before, Prob(C|D) = Prob(D|C) Prob(C)/Prob(D).

Second, let T1, T2 ... Tm be the sequence of lexical terms in D. We will assume that the occurrence of a term T in the Ith place of D depends depends only on the category C and, given C, is conditionally independent of all the other terms in D and of the position I. Therefore
Prob(D|C) = Prob(T1|C)*Prob(T2|C) * ... * Prob(Tm|C)
where Prob(Ti|C) means "the probability that a randomly chosen word token in C is the word Ti".

Third, we estimate Prob(Ti|C) and Prob(C) using the training set.
Prob(Ti|C) is estimated as the relative frequency of Ti in documents of category C = (number of occurrences of Ti in C) / (total number of words in C).
Prob(C) is estimated as the fraction of documents in the training set that are of category C.

Fourth, note that Prob(D) is independent of the category C. Since we are given D and are just trying to compare different categories, we need not calculate Prob(D). (This is known as a normalizing factor) its function is to ensure that the probabilities add up to 1.

We can now calculate the product
Prob(Ci)*Prob(T1|Ci)*Prob(T2|Ci) * ... * Prob(Tm|Ci) for each category Ci and choose the category that maximizes this.

Bernoulli model

A variant of this is the Bernoulli model, as follows:

First, as before, Prob(C|D) = Prob(D|C) Prob(C)/Prob(D).

Second, let { T1, T2 ... Tm } be the set of distinct lexical terms in D. We will assume that the appearance of a term Ti depends only on the category C and is independent of all the other terms in D. Therefore
Prob(D|C) = Prob(T1|C)*Prob(T2|C) * ... * Prob(Tm|C) (where Prob(Ti|C) means "the probability that a document of category C contains at least one occurrence of Ti").

Third, we estimate Prob(Ti|C) and Prob(C) using the training set.
Prob(Ti|C) is estimated as the fraction of documents in C that contain Ti.

The rest of the algorithm proceeds as above.

Modifications to the algorithms

Using logs
Actually computing this product with standard floating point will soon get into problems of underflow. Instead, one computes the logarithm:
log(Prob(Ci)*Prob(T1|Ci)*Prob(T2|Ci) * ... * Prob(Tm|Ci)) =
log(Prob(Ci)) + log(Prob(T1|Ci)) + log(Prob(T2|Ci)) + ... + log(Prob(Tm|Ci))

This transformation shows that the classifier returned by Naive Bayes is a linear discrimination; corresponds to a hyper-plane division under a vector model of documents (a single hyperplane for two categories, a collection of hyperplanes for multiple categories).

Laplacian correction .
If document contains a term T that does not occur in any of the documents in category C, then Prob(T|C) will be estimated as 0. Then the product Prob(Ci)*Prob(T1|Ci)*Prob(T2|Ci) * ... * Prob(Tm|Ci) will be equal to 0, no matter how much other evidence there is favoring Ci.

The usual solution is to do a Laplacian correction by upping all the counts by i 1. That is, for multinomial Naive Bayes,
let CW be the total number of occurrences of word W in documents of category C.
Let V be the number of different words in D.
Let |C| be the sum of the lengths of all the documents in the category C.
Then estimate Prob(W|C) as (1+CW)/(|C|+V).
Note that the sum over W of Prob(W|D) is equal to 1, as it should be.

In the Bernoulli model, let CW be the number of documents in C that contain at least one occurrence of W.
Let V be the number of different words in D.
Let |C| be the number of documents in the category C.
Then estimate Prob(W|C) as (1+CW)/(|C|+V).

The Laplacian correction can give strongly counter-intuitive results when applied to a data set with a small number of features and a small number of total instances of some values of some of the features, but in this application that does not arise.

Features of Naive Bayes

Running time:
Offline (learning the classifier): O(max(size of the corpus, number of different words * number of categories)); in practice, this is almost always O(size of the corpus).

Online (applying the classifier to a document): O(length of document * number of categories).

Accuracy: Poor probability estimates but generally good classification decisions. That is, it generally makes the right guess, but is way over-confident of its decision. It is not among the highest quality classifiers.

Robust under noise: Small changes in the texts or inclusion of a small number of anomolous documents lead to small changes in the probability estimates.

Use when: Comparatively short running time is more important than extreme accuracy.
Very large amount of labelled text is available, so better results can be gotten from a crude algorithm over a large body of examples than a better algorithm over a data set that must be much smaller because of constraints of running time.