Lecture 7: Classification

IR textbook

There is an extremely fine textbook for IR on the Web: Introduction to Information Retrieval by Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze, Cambridge U. Press, 2008

Text Classification

Reading: Manning et al. chap 13-15, Chakrabarti chap. 5.

Given: A set of text documents and a set of categories.
Task: Categorize the documents.

Sample classification tasks (Manning)

Manual categorization
(Manning p. 322). Using human domain knowledge and some experimentation, craft a query with high accuracy. Cites experiment with 94% recall and 84% precision over 675 categories on Reuters news articles. 2 man days of work per category.

Supervised Learning

Supervised learning. You have a corpus of documents correctly (presumably manually) labelled by category. Use ML techniques to learn a classifier which is a function that maps a document to a category.

Note: Two functions here. The classifier maps a document to a category and the learning technique maps a labelled collection (training set) to a classifier.

There are different classifiers of different kinds: linear separators, non-linear separators, nearest neighbors, sets of features, rule sets, etc. A category of classifiers is sometimes (not in Manning) known as a version space. Any practical learning technique outputs classifiers within some particular version space.

Naive Bayes

Suppose you have a collection of documents D1 ... Dn labelled as belonging to categories C1 ... Ck. A new document D is given for you to categorize. In a probabilistic approach, we are looking for the category C such that Prob(C|D) is maximal.

So how do we evaluate Prob(C|D)? The multinomial naive Bayes method proceeds as follows:

First, as before, Prob(C|D) = Prob(D|C) Prob(C)/Prob(D).

Second, let T1, T2 ... Tm be the sequence of lexical terms in D. We will assume that the occurrence of a term T in the Ith place of D depends depends only on the category C and, given C, is conditionally independent of all the other terms in D and of the position I. Therefore
Prob(D|C) = Prob(T1|C)*Prob(T2|C) * ... * Prob(Tm|C)
where Prob(Ti|C) means "the probability that a randomly chosen word token in C is the word Ti".

Third, we estimate Prob(Ti|C) and Prob(C) using the training set.
Prob(Ti|C) is estimated as the relative frequency of Ti in documents of category C = (number of occurrences of Ti in C) / (total number of words in C).
Prob(C) is estimated as the fraction of documents in the training set that are of category C.

Fourth, note that Prob(D) is independent of the category C. Since we are given D and are just trying to compare different categories, we need not calculate Prob(D). (This is known as a normalizing factor) its function is to ensure that the probabilities add up to 1.

We can now calculate the product
Prob(Ci)*Prob(T1|Ci)*Prob(T2|Ci) * ... * Prob(Tm|Ci) for each category Ci and choose the category that maximizes this.

Bernoulli model

A variant of this is the Bernoulli model, as follows:

First, as before, Prob(C|D) = Prob(D|C) Prob(C)/Prob(D).

Second, let { T1, T2 ... Tm } be the set of distinct lexical terms in D. We will assume that the appearance of a term Ti depends only on the category C and is independent of all the other terms in D. Therefore
Prob(D|C) = Prob(T1|C)*Prob(T2|C) * ... * Prob(Tm|C) (where Prob(Ti|C) means "the probability that a document of category C contains at least one occurrence of Ti").

Third, we estimate Prob(Ti|C) and Prob(C) using the training set.
Prob(Ti|C) is estimated as the fraction of documents in C that contain Ti.

The rest of the algorithm proceeds as above.

Modifications to the algorithms

Using logs
Actually computing this product with standard floating point will soon get into problems of underflow. Instead, one computes the logarithm:
log(Prob(Ci)*Prob(T1|Ci)*Prob(T2|Ci) * ... * Prob(Tm|Ci)) =
log(Prob(Ci)) + log(Prob(T1|Ci)) + log(Prob(T2|Ci)) + ... + log(Prob(Tm|Ci))

This transformation shows that the classifier returned by Naive Bayes is a linear discrimination; corresponds to a hyper-plane division under a vector model of documents (a single hyperplane for two categories, a collection of hyperplanes for multiple categories).

Laplacian correction .
If document contains a term T that does not occur in any of the documents in category C, then Prob(T|C) will be estimated as 0. Then the product Prob(Ci)*Prob(T1|Ci)*Prob(T2|Ci) * ... * Prob(Tm|Ci) will be equal to 0, no matter how much other evidence there is favoring Ci.

The usual solution is to do a Laplacian correction by upping all the counts by i 1. That is, for multinomial Naive Bayes,
let CW be the total number of occurrences of word W in documents of category C.
Let V be the number of different words in D.
Let |C| be the sum of the lengths of all the documents in the category C.
Then estimate Prob(W|C) as (1+CW)/(|C|+V).
Note that the sum over W of Prob(W|D) is equal to 1, as it should be.

In the Bernoulli model, let CW be the number of documents in C that contain at least one occurrence of W.
Let V be the number of different words in D.
Let |C| be the number of documents in the category C.
Then estimate Prob(W|C) as (1+CW)/(|C|+V).

The Laplacian correction can give strongly counter-intuitive results when applied to a data set with a small number of features and a small number of total instances of some values of some of the features, but in this application that does not arise.

Features of Naive Bayes

Running time:
Offline (learning the classifier): O(max(size of the corpus, number of different words * number of categories)); in practice, this is almost always O(size of the corpus).

Online (applying the classifier to a document): O(length of document * number of categories).

Accuracy: Poor probability estimates but generally good classification decisions. That is, it generally makes the right guess, but is way over-confident of its decision. It is not among the highest quality classifiers.

Robust under noise: Small changes in the texts or inclusion of a small number of anomolous documents lead to small changes in the probability estimates.

Use when: Comparatively short running time is more important than extreme accuracy.
Very large amount of labelled text is available, so better results can be gotten from a crude algorithm over a large body of examples than a better algorithm over a data set that must be much smaller because of constraints of running time.

Feature selection

ML techniques generally, and naive Bayes particularly, and the Bernoulli model especially, work better if irrelevant features (words) can be excluded a priori. Including irrelevant features adds noise and creates the danger of overfitting (fitting data to chance patterns in irrelevant features). Relevancy of features to categories can be measured by information theoretic measures (mutual information) or statistical measures (chi-squared). See Manning for details.

These measures are generally used in a greedy way; that is, the features that have the top rankings in terms of these measures are selected. This does not necessarily give the best set of features, because features can be correllated. That is, F1 and F2 may both have high mutual information with the categories; but there may be no point in using both F1 and F2 as features, because the additional information provided by F2 is small. (Consider, as an extreme case, the case where F2=F1.)

Rocchio classification

Learning: Treat each document as a normalized vector (unit length).
For each category C, compute the unnormalized centroid of the labelled documents in C. (I don't know why the centroid is unnormalized, which causes closeness as measured by dot product, by Euclidean distance, and by angle all to be different, but Manning et al. are quite specific about this.) Now for document D, find the closest centroid (on one of these measures) and put D into the corresponding category.

Running time
Offline (computing centroids): Linear in the size of the training set.
Online (applying classifier): O(size of D * number of categories)
Same as naive Bayes.

Single category
If you are trying to distinguish a single category C as opposed to "everything else", then the centroid of "everything else" may not be particularly meaningful. An alternative here is to compute the centroid X of the labelled documents in C, and choose a threshhold T. A document D is classified as in C if dist(D,X) < T. The Threshhold T can be chosen by computing the dist(D,X) for all documents in the collection, and then choosing T so that the number of false positives and false negatives in the training set are equal (or, more generally, so that the total cost associated with the false positives and false minimums is minimized.) Note that this is now a spherical discriminator rather than a linear one.

For classifying into several well-defined categories and a general "other" find the centroids of the categories and set a threshhold. A document is classified as category C if it is within T of the centroid of C and closer to the centroid of C than any other category.

K Nearest neighbors

Classification: For document D, find the K documents in the training set nearest to D in the vector model. Have these documents vote on the category to assigned to D, or give them a weighted vote proportional to the dot product.

Choice of K: Various strategies. K=1 is too susceptible to noise. Usually K= a small odd number (3,5,7) but people have used K=50,100.

Implementation:
Offline: Compute the coordinate of each word in the training set, and save the training set in an inverted index.
Online:

{ for (each document DT in training set) DOTPROD[DT] := 0;
  for (each word W in D)
    for (each document DT that contains W)
      DOTPROD[DT] += coordinate(W,D) * coordinate(W,DT);
NNS := K documents with maximal value of DOTPROD;
choose category by voting among NNS.
}

Running time: Offline (constructing the classifier): Linear. Time to construct the inverted index.
Online (applying the classifier): O(sum over (word W in D) (number of documents containing W)). Very slow. There exist exact algorithms that are faster, but these require tremendous amounts of memory.

There exist approximate algorithms that are faster that have reasonable memory demands, some of which are used effectively. See Nearest Neighbors in High-dimensional Space Piotr Indyk.

Noise sensitive kNN is sensitive to noise, especially for small k.

Very slow online running time.

Linear classifier

Categorize a vector V = < V1 ... V1 > as being in or out of category C according to whether
W1 V1 + ... + Wk Vk > T
for some set of weights W1 ... Wk and threshhold T.
Equivalent, a weight vector W such that the dot product of W dot V > T.

If there exists a linear separator for a category C, then it can be found using linear programming. If there does not exist a linear separator, then the problem of finding a separator that misclasifies the minimum number of items is NP-complete.

Support Vector Machine

"No one ever got fired for using an SVM" (Manning p. 307)

Suppose that hyperplane P is a linear separator for a labelled data set D. The margin of P is the minimal distance from any element in D to P. The SVM of D is the hyperplane P with maximal margin.

In this figure, P1, P2, and P3 are all linear separators between the blue crosses and the red circles, but P2 is the SVM because it has the largest margin.

The SVM generally predicts unseen points more accurately than any other linear classifiers, because it requires that any future point classified as C must be quite far from any point labelled as not C. (There are also more sophisticated and cogent arguments for this conclusion.)

With a little algebra (see Manning p. 311-312), one can express the problem of finding the SVM as a problem of minimizing a quadratic function subject to linear constraints. (Essentially, the distance squared is a quadratic function on W and T and the constraint that all the points are correctly categorized is a set of linear constraints on W and T, though it takes a little munging to make this come out neatly.) This is a well-known optimization problem.

SVM on data that is not linearly separable

Suppose the data is not actually linearly separable (either because of inherent properties of the category or because of noise) but is nearly so. We can apply SVM as follows: Define data set E to be an improvement of data set D around hyperplane P with cost C, if

Then find the value of P that minimizes the cost.

Evaluation

Divide the labelled corpus into a training set and a test set. Train the classifiers on the training set, test them on the test set.

Measures of quality: For one category, precision/recall. Cost/benefit analysis: Percentage of false positives * cost of false positive + percentage of false negatives * cost of false negative. These costs are often not known precisely, and may change with time and circumstance.

For multiple exclusive categories, the situation is murkier. Overall accuracy is one option. A confusion matrix M[I,J] shows number of items actually of category I that are classified as J. This gives a good overall sense of where the system is going wrong, but not a single number that can be used for comparison. Again cost/benefit analysis can be used if there is a known cost associated with each kind of error.

Information theoretic measures such as cross entropy. If you know the classification, how many additional bits on average are needed to specify the actual category, in an efficient coding scheme?

Statistical measure: If the classification were the true distribution, what is the likelihood of this pattern of error?

Tuning

If you want to tune parameters associated with a learning method, or to to choose between different learning methods, and you want an accurate measure of the accuracy of the final classifier, then you divide your labelled corpus into three; a training set, a validation set, and a test set. You run the learning methods over the training set, measure the quality of the output classifiers over the validation set, and choose the classifier that does best over the validation set. Then when you have committed to a classifier, you measure its quality on the test set. You can't rely on the accuracy of the classifier on the validation set, since you have used the validation set to choose the classifier.

Ideally, you would use each test set only once. That's not usually possible, but people do treat their test sets like gold; they are taken out of the closet very rarely and the human experimenters are careful not to actually look at them.

Experimental comparisons of ML techniques vary widely from one published paper to the next. The effectiveness of techniques depends delicately on features of the data set and specifics of how the learning is carried out.