Reading: Manning et al. chap 13-15, Chakrabarti chap. 5.
Given: A set of text documents and a set of categories.
Task: Categorize the documents.
Manual categorization
(Manning p. 322). Using human domain knowledge
and some experimentation, craft a query with high accuracy. Cites experiment
with 94% recall and 84% precision over 675 categories on Reuters news articles.
2 man days of work per category.
Note: Two functions here. The classifier maps a document to a category and the learning technique maps a labelled collection (training set) to a classifier.
There are different classifiers of different kinds: linear separators, non-linear separators, nearest neighbors, sets of features, rule sets, etc. A category of classifiers is sometimes (not in Manning) known as a version space. Any practical learning technique outputs classifiers within some particular version space.
So how do we evaluate Prob(C|D)? The multinomial naive Bayes method proceeds as follows:
First, as before, Prob(C|D) = Prob(D|C) Prob(C)/Prob(D).
Second, let T1, T2 ... Tm be the sequence of
lexical terms in D. We will
assume that the occurrence of a term T in the Ith place of D depends
depends only on the category C and, given C, is conditionally
independent of all the other terms in D and of the position I.
Therefore
Prob(D|C) = Prob(T1|C)*Prob(T2|C) * ... * Prob(Tm|C)
where Prob(Ti|C) means "the probability that a randomly chosen word
token in C is the word Ti".
Third, we estimate Prob(Ti|C) and Prob(C) using the training set.
Prob(Ti|C) is estimated as the relative frequency
of Ti in documents of category C = (number of occurrences of Ti in C) /
(total number of words in C).
Prob(C) is estimated as the fraction of documents in the training set that
are of category C.
Fourth, note that Prob(D) is independent of the category C. Since we are given D and are just trying to compare different categories, we need not calculate Prob(D). (This is known as a normalizing factor) its function is to ensure that the probabilities add up to 1.
We can now calculate the product
Prob(Ci)*Prob(T1|Ci)*Prob(T2|Ci) * ... * Prob(Tm|Ci) for each
category Ci and choose the category that maximizes this.
First, as before, Prob(C|D) = Prob(D|C) Prob(C)/Prob(D).
Second, let { T1, T2 ... Tm } be the set of distinct
lexical terms in D. We will
assume that the appearance
of a term Ti depends only on the category C and
is independent of all the other terms in D. Therefore
Prob(D|C) = Prob(T1|C)*Prob(T2|C) * ... * Prob(Tm|C) (where Prob(Ti|C)
means "the probability that a document of category C contains at least
one occurrence of Ti").
Third, we estimate Prob(Ti|C) and Prob(C) using the training set.
Prob(Ti|C) is estimated as the fraction of documents in C that contain
Ti.
The rest of the algorithm proceeds as above.
Using logs
Actually computing this product with standard floating
point will soon get into problems of underflow. Instead, one computes
the logarithm:
log(Prob(Ci)*Prob(T1|Ci)*Prob(T2|Ci) * ... * Prob(Tm|Ci)) =
log(Prob(Ci)) + log(Prob(T1|Ci)) + log(Prob(T2|Ci)) + ... + log(Prob(Tm|Ci))
This transformation shows that the classifier returned by Naive Bayes is a linear discrimination; corresponds to a hyper-plane division under a vector model of documents (a single hyperplane for two categories, a collection of hyperplanes for multiple categories).
Laplacian correction .
If document contains a term T that
does not occur in any of the documents in category C, then Prob(T|C) will
be estimated as 0. Then the product
Prob(Ci)*Prob(T1|Ci)*Prob(T2|Ci) * ... * Prob(Tm|Ci) will be equal to 0,
no matter how much other evidence there is favoring Ci.
The usual solution is to do a Laplacian correction by upping all the counts by i
1. That is, for multinomial Naive Bayes,
let CW be the total number of occurrences of word W
in documents of category C.
Let V be the number of different words in D.
Let |C| be the sum of the lengths of all the documents in the category C.
Then estimate Prob(W|C) as (1+CW)/(|C|+V).
Note that the sum over W of Prob(W|D) is equal to 1, as it should be.
In the Bernoulli model, let CW be the number of documents in C
that contain at least one occurrence of W.
Let V be the number of different words in D.
Let |C| be the number of documents in the category C.
Then estimate Prob(W|C) as (1+CW)/(|C|+V).
The Laplacian correction can give strongly counter-intuitive results when applied to a data set with a small number of features and a small number of total instances of some values of some of the features, but in this application that does not arise.
Online (applying the classifier to a document): O(length of document * number of categories).
Accuracy: Poor probability estimates but generally good classification decisions. That is, it generally makes the right guess, but is way over-confident of its decision. It is not among the highest quality classifiers.
Robust under noise: Small changes in the texts or inclusion of a small number of anomolous documents lead to small changes in the probability estimates.
Use when: Comparatively short running time is more important than
extreme accuracy.
Very large amount of labelled text is available, so better results can be
gotten from a crude algorithm over a large body of examples than a better
algorithm over a data set that must be much smaller because of constraints
of running time.
These measures are generally used in a greedy way; that is, the features that have the top rankings in terms of these measures are selected. This does not necessarily give the best set of features, because features can be correllated. That is, F1 and F2 may both have high mutual information with the categories; but there may be no point in using both F1 and F2 as features, because the additional information provided by F2 is small. (Consider, as an extreme case, the case where F2=F1.)
Running time
Offline (computing centroids): Linear in the size of the training set.
Online (applying classifier): O(size of D * number of categories)
Same as naive Bayes.
Single category
If you are trying to distinguish a single category C as opposed to "everything
else", then the centroid of "everything else" may not be particularly
meaningful. An alternative here is to compute the centroid X of the labelled
documents in C, and choose a threshhold T. A document D is classified as
in C if dist(D,X) < T. The Threshhold T can be chosen by computing the
dist(D,X) for all documents in the collection, and then choosing T so that
the number of false positives and false negatives in the training set
are equal (or, more generally, so that the total cost associated with the
false positives and false minimums is minimized.) Note that this is now
a spherical discriminator rather than a linear one.
For classifying into several well-defined categories and a general "other" find the centroids of the categories and set a threshhold. A document is classified as category C if it is within T of the centroid of C and closer to the centroid of C than any other category.
Choice of K: Various strategies. K=1 is too susceptible to noise. Usually K= a small odd number (3,5,7) but people have used K=50,100.
Implementation:
Offline:
Compute the coordinate of each word in the training set,
and save the training set in an inverted index.
Online:
{ for (each document DT in training set) DOTPROD[DT] := 0;
for (each word W in D)
for (each document DT that contains W)
DOTPROD[DT] += coordinate(W,D) * coordinate(W,DT);
NNS := K documents with maximal value of DOTPROD;
choose category by voting among NNS.
}
Running time:
Offline (constructing the classifier): Linear. Time to construct the inverted
index.
Online (applying the classifier): O(sum over (word W in D) (number of documents
containing W)). Very slow. There exist exact algorithms that are faster,
but these require tremendous amounts of memory.
There exist approximate algorithms that are faster that have reasonable memory demands, some of which are used effectively. See Nearest Neighbors in High-dimensional Space Piotr Indyk.
Noise sensitive kNN is sensitive to noise, especially for small k.
Very slow online running time.
If there exists a linear separator for a category C, then it can be found using linear programming. If there does not exist a linear separator, then the problem of finding a separator that misclasifies the minimum number of items is NP-complete.
"No one ever got fired for using an SVM" (Manning p. 307)
Suppose that hyperplane P is a linear separator for a labelled data set D. The margin of P is the minimal distance from any element in D to P. The SVM of D is the hyperplane P with maximal margin.
The SVM generally predicts unseen points more accurately than any other linear classifiers, because it requires that any future point classified as C must be quite far from any point labelled as not C. (There are also more sophisticated and cogent arguments for this conclusion.)
With a little algebra (see Manning p. 311-312), one can express the problem of finding the SVM as a problem of minimizing a quadratic function subject to linear constraints. (Essentially, the distance squared is a quadratic function on W and T and the constraint that all the points are correctly categorized is a set of linear constraints on W and T, though it takes a little munging to make this come out neatly.) This is a well-known optimization problem.
Measures of quality: For one category, precision/recall. Cost/benefit analysis: Percentage of false positives * cost of false positive + percentage of false negatives * cost of false negative. These costs are often not known precisely, and may change with time and circumstance.
For multiple exclusive categories, the situation is murkier. Overall accuracy is one option. A confusion matrix M[I,J] shows number of items actually of category I that are classified as J. This gives a good overall sense of where the system is going wrong, but not a single number that can be used for comparison. Again cost/benefit analysis can be used if there is a known cost associated with each kind of error.
Information theoretic measures such as cross entropy. If you know the classification, how many additional bits on average are needed to specify the actual category, in an efficient coding scheme?
Statistical measure: If the classification were the true distribution, what is the likelihood of this pattern of error?
Ideally, you would use each test set only once. That's not usually possible, but people do treat their test sets like gold; they are taken out of the closet very rarely and the human experimenters are careful not to actually look at them.
Experimental comparisons of ML techniques vary widely from one published paper to the next. The effectiveness of techniques depends delicately on features of the data set and specifics of how the learning is carried out.