So how do we evaluate Prob(C|D)? The multinomial naive Bayes method proceeds as follows:

First, as before, Prob(C|D) = Prob(D|C) Prob(C)/Prob(D).

Second, let T1, T2 ... Tm be the sequence of
lexical terms in D. We will
assume that the occurrence of a term T in the Ith place of D depends
depends only on the category C and, given C, is conditionally
independent of all the other terms in D and of the position I.
Therefore

Prob(D|C) = Prob(T1|C)*Prob(T2|C) * ... * Prob(Tm|C)

where Prob(Ti|C) means "the probability that a randomly chosen word
token in C is the word Ti".

Third, we estimate Prob(Ti|C) and Prob(C) using the training set.

Prob(Ti|C) is estimated as the relative frequency
of Ti in documents of category C = (number of occurrences of Ti in C) /
(total number of words in C).

Prob(C) is estimated as the fraction of documents in the training set that
are of category C.

Fourth, note that Prob(D) is independent of the category C. Since we are
given D and are just trying to compare different categories, we need
not calculate Prob(D). (This is known as a * normalizing factor*)
its function is to ensure that the probabilities add up to 1.

We can now calculate the product

Prob(Ci)*Prob(T1|Ci)*Prob(T2|Ci) * ... * Prob(Tm|Ci) for each
category Ci and choose the category that maximizes this.

First, as before, Prob(C|D) = Prob(D|C) Prob(C)/Prob(D).

Second, let { T1, T2 ... Tm } be the set of * distinct *
lexical terms in D. We will
assume that the * appearance *
of a term Ti depends only on the category C and
is independent of all the other terms in D. Therefore

Prob(D|C) = Prob(T1|C)*Prob(T2|C) * ... * Prob(Tm|C) (where Prob(Ti|C)
means "the probability that a document of category C contains at least
one occurrence of Ti").

Third, we estimate Prob(Ti|C) and Prob(C) using the training set.

Prob(Ti|C) is estimated as the fraction of documents in C that contain
Ti.

The rest of the algorithm proceeds as above.

** Using logs **

Actually computing this product with standard floating
point will soon get into problems of underflow. Instead, one computes
the logarithm:

log(Prob(Ci)*Prob(T1|Ci)*Prob(T2|Ci) * ... * Prob(Tm|Ci)) =

log(Prob(Ci)) + log(Prob(T1|Ci)) + log(Prob(T2|Ci)) + ... + log(Prob(Tm|Ci))

This transformation shows that the classifier returned by Naive Bayes is a linear discrimination; corresponds to a hyper-plane division under a vector model of documents (a single hyperplane for two categories, a collection of hyperplanes for multiple categories).

** Laplacian correction **.

If document contains a term T that
does not occur in any of the documents in category C, then Prob(T|C) will
be estimated as 0. Then the product
Prob(Ci)*Prob(T1|Ci)*Prob(T2|Ci) * ... * Prob(Tm|Ci) will be equal to 0,
no matter how much other evidence there is favoring Ci.

The usual solution is to do a Laplacian correction by upping all the counts by i
1. That is, for multinomial Naive Bayes,

let C_{W} be the total number of occurrences of word W
in documents of category C.

Let V be the number of different words in D.

Let |C| be the sum of the lengths of all the documents in the category C.

Then estimate Prob(W|C) as (1+C_{W})/(|C|+V).

Note that the sum over W of Prob(W|D) is equal to 1, as it should be.

In the Bernoulli model, let C_{W} be the number of documents in C
that contain at least one occurrence of W.

Let V be the number of different words in D.

Let |C| be the number of documents in the category C.

Then estimate Prob(W|C) as (1+C_{W})/(|C|+V).

The Laplacian correction can give strongly counter-intuitive results when applied to a data set with a small number of features and a small number of total instances of some values of some of the features, but in this application that does not arise.

Offline (learning the classifier): O(max(size of the corpus, number of different words * number of categories)); in practice, this is almost always O(size of the corpus).

Online (applying the classifier to a document): O(length of document * number of categories).

** Accuracy: **
Poor probability estimates but generally good classification decisions. That
is, it generally makes the right guess, but is way over-confident of its
decision. It is not among the highest quality classifiers.

** Robust under noise: ** Small changes in the texts or inclusion of a small
number of anomolous documents lead to small changes in the probability
estimates.

** Use when: ** Comparatively short running time is more important than
extreme accuracy.

Very large amount of labelled text is available, so better results can be
gotten from a crude algorithm over a large body of examples than a better
algorithm over a data set that must be much smaller because of constraints
of running time.