**Reading: ** Manning et al. chap 13-15, Chakrabarti chap. 5.

Given: A set of text documents and a set of categories.

Task: Categorize the documents.

- Preprocessing for indexing: Determining the encoding (ASCII, Unicode, etc.), word segmentation, identifying the language.
- Spam detection
- Pornography detection. (Dr. Tao Yang mentioned this as a task that must be done with extremely high accuracy, because both false positives and false negative are extremely undesirable.)
- Sentiment detection (favorable vs. unfavorable reviews).
- Email sorting.
- Topic specific search engines. (Use the classifier to segregate documents relevant to the topic and then use keyword search for subtopics.)

** Manual categorization **

(Manning p. 322). Using human domain knowledge
and some experimentation, craft a query with high accuracy. Cites experiment
with 94% recall and 84% precision over 675 categories on Reuters news articles.
2 man days of work per category.

Note: Two functions here. The * classifier* maps a document to a
category and the * learning technique * maps a labelled collection
(training set) to a classifier.

There are different classifiers of different kinds: linear separators,
non-linear separators, nearest neighbors, sets of features, rule sets, etc.
A category of classifiers is sometimes (not in Manning) known as a **
version space.** Any practical learning technique outputs classifiers within
some particular version space.

So how do we evaluate Prob(C|D)? The multinomial naive Bayes method proceeds as follows:

First, as before, Prob(C|D) = Prob(D|C) Prob(C)/Prob(D).

Second, let T1, T2 ... Tm be the sequence of
lexical terms in D. We will
assume that the occurrence of a term T in the Ith place of D depends
depends only on the category C and, given C, is conditionally
independent of all the other terms in D and of the position I.
Therefore

Prob(D|C) = Prob(T1|C)*Prob(T2|C) * ... * Prob(Tm|C)

where Prob(Ti|C) means "the probability that a randomly chosen word
token in C is the word Ti".

Third, we estimate Prob(Ti|C) and Prob(C) using the training set.

Prob(Ti|C) is estimated as the relative frequency
of Ti in documents of category C = (number of occurrences of Ti in C) /
(total number of words in C).

Prob(C) is estimated as the fraction of documents in the training set that
are of category C.

Fourth, note that Prob(D) is independent of the category C. Since we are
given D and are just trying to compare different categories, we need
not calculate Prob(D). (This is known as a * normalizing factor*)
its function is to ensure that the probabilities add up to 1.

We can now calculate the product

Prob(Ci)*Prob(T1|Ci)*Prob(T2|Ci) * ... * Prob(Tm|Ci) for each
category Ci and choose the category that maximizes this.

First, as before, Prob(C|D) = Prob(D|C) Prob(C)/Prob(D).

Second, let { T1, T2 ... Tm } be the set of * distinct *
lexical terms in D. We will
assume that the * appearance *
of a term Ti depends only on the category C and
is independent of all the other terms in D. Therefore

Prob(D|C) = Prob(T1|C)*Prob(T2|C) * ... * Prob(Tm|C) (where Prob(Ti|C)
means "the probability that a document of category C contains at least
one occurrence of Ti").

Third, we estimate Prob(Ti|C) and Prob(C) using the training set.

Prob(Ti|C) is estimated as the fraction of documents in C that contain
Ti.

The rest of the algorithm proceeds as above.

** Using logs **

Actually computing this product with standard floating
point will soon get into problems of underflow. Instead, one computes
the logarithm:

log(Prob(Ci)*Prob(T1|Ci)*Prob(T2|Ci) * ... * Prob(Tm|Ci)) =

log(Prob(Ci)) + log(Prob(T1|Ci)) + log(Prob(T2|Ci)) + ... + log(Prob(Tm|Ci))

This transformation shows that the classifier returned by Naive Bayes is a linear discrimination; corresponds to a hyper-plane division under a vector model of documents (a single hyperplane for two categories, a collection of hyperplanes for multiple categories).

** Laplacian correction **.

If document contains a term T that
does not occur in any of the documents in category C, then Prob(T|C) will
be estimated as 0. Then the product
Prob(Ci)*Prob(T1|Ci)*Prob(T2|Ci) * ... * Prob(Tm|Ci) will be equal to 0,
no matter how much other evidence there is favoring Ci.

The usual solution is to do a Laplacian correction by upping all the counts by i
1. That is, for multinomial Naive Bayes,

let C_{W} be the total number of occurrences of word W
in documents of category C.

Let V be the number of different words in D.

Let |C| be the sum of the lengths of all the documents in the category C.

Then estimate Prob(W|C) as (1+C_{W})/(|C|+V).

Note that the sum over W of Prob(W|D) is equal to 1, as it should be.

In the Bernoulli model, let C_{W} be the number of documents in C
that contain at least one occurrence of W.

Let V be the number of different words in D.

Let |C| be the number of documents in the category C.

Then estimate Prob(W|C) as (1+C_{W})/(|C|+V).

The Laplacian correction can give strongly counter-intuitive results when applied to a data set with a small number of features and a small number of total instances of some values of some of the features, but in this application that does not arise.

Offline (learning the classifier): O(max(size of the corpus, number of different words * number of categories)); in practice, this is almost always O(size of the corpus).

Online (applying the classifier to a document): O(length of document * number of categories).

** Accuracy: **
Poor probability estimates but generally good classification decisions. That
is, it generally makes the right guess, but is way over-confident of its
decision. It is not among the highest quality classifiers.

** Robust under noise: ** Small changes in the texts or inclusion of a small
number of anomolous documents lead to small changes in the probability
estimates.

** Use when: ** Comparatively short running time is more important than
extreme accuracy.

Very large amount of labelled text is available, so better results can be
gotten from a crude algorithm over a large body of examples than a better
algorithm over a data set that must be much smaller because of constraints
of running time.

These measures are generally used in a * greedy* way; that is, the
features that have the top rankings in terms of these measures are selected.
This does not necessarily give the best * set* of features, because
features can be correllated. That is, F1 and F2 may both have high mutual
information with the categories; but there may be no point in using both
F1 and F2 as features, because the * additional* information
provided by F2 is small. (Consider, as an extreme case, the case where
F2=F1.)

For each category C, compute the unnormalized centroid of the labelled documents in C. (I don't know why the centroid is unnormalized, which causes closeness as measured by dot product, by Euclidean distance, and by angle all to be different, but Manning et al. are quite specific about this.) Now for document D, find the closest centroid (on one of these measures) and put D into the corresponding category.

** Running time **

Offline (computing centroids): Linear in the size of the training set.

Online (applying classifier): O(size of D * number of categories)

Same as naive Bayes.

** Single category **

If you are trying to distinguish a single category C as opposed to "everything
else", then the centroid of "everything else" may not be particularly
meaningful. An alternative here is to compute the centroid X of the labelled
documents in C, and choose a threshhold T. A document D is classified as
in C if dist(D,X) < T. The Threshhold T can be chosen by computing the
dist(D,X) for all documents in the collection, and then choosing T so that
the number of false positives and false negatives in the training set
are equal (or, more generally, so that the total cost associated with the
false positives and false minimums is minimized.) Note that this is now
a spherical discriminator rather than a linear one.

For classifying into several well-defined categories and a general "other" find the centroids of the categories and set a threshhold. A document is classified as category C if it is within T of the centroid of C and closer to the centroid of C than any other category.

Choice of K: Various strategies. K=1 is too susceptible to noise. Usually K= a small odd number (3,5,7) but people have used K=50,100.

Implementation:

Offline:
Compute the coordinate of each word in the training set,
and save the training set in an inverted index.

Online:

{ for (each document DT in training set) DOTPROD[DT] := 0; for (each word W in D) for (each document DT that contains W) DOTPROD[DT] += coordinate(W,D) * coordinate(W,DT); NNS := K documents with maximal value of DOTPROD; choose category by voting among NNS. }

**Running time: **
Offline (constructing the classifier): Linear. Time to construct the inverted
index.

Online (applying the classifier): O(sum over (word W in D) (number of documents
containing W)). Very slow. There exist exact algorithms that are faster,
but these require tremendous amounts of memory.

There exist approximate algorithms that are faster that have reasonable memory demands, some of which are used effectively. See Nearest Neighbors in High-dimensional Space Piotr Indyk.

** Noise sensitive ** kNN is sensitive to noise, especially for small k.

Very slow online running time.

W

for some set of weights W

Equivalent, a weight vector W such that the dot product of W dot V > T.

If there exists a linear separator for a category C, then it can be found using linear programming. If there does not exist a linear separator, then the problem of finding a separator that misclasifies the minimum number of items is NP-complete.

"No one ever got fired for using an SVM" (Manning p. 307)

Suppose that hyperplane P is a linear separator for a labelled data set D.
The * margin* of P is the minimal distance from any element in D to
P. The SVM of D is the hyperplane P with maximal margin.

In this figure, P1, P2, and P3 are all linear separators between the blue crosses and the red circles, but P2 is the SVM because it has the largest margin.

The SVM generally predicts unseen points more accurately than any other linear classifiers, because it requires that any future point classified as C must be quite far from any point labelled as not C. (There are also more sophisticated and cogent arguments for this conclusion.)

With a little algebra (see Manning p. 311-312), one can express the problem of finding the SVM as a problem of minimizing a quadratic function subject to linear constraints. (Essentially, the distance squared is a quadratic function on W and T and the constraint that all the points are correctly categorized is a set of linear constraints on W and T, though it takes a little munging to make this come out neatly.) This is a well-known optimization problem.

- E is discriminated by P with margin M;
- E differs from D in that some of the data points have been moved perpendicular to P into proper position;
- The total cost C is a linear combination of quadratic functions of M and the distances the points are moved.

Then find the value of P that minimizes the cost.

Measures of quality: For one category, precision/recall. Cost/benefit analysis: Percentage of false positives * cost of false positive + percentage of false negatives * cost of false negative. These costs are often not known precisely, and may change with time and circumstance.

For multiple exclusive categories, the situation is murkier. Overall accuracy
is one option. A * confusion matrix * M[I,J] shows number of items
actually of category I that are classified as J. This gives a good overall
sense of where the system is going wrong, but not a single number that can
be used for comparison. Again cost/benefit analysis can be used if
there is a known cost associated with each kind of error.

Information theoretic measures such as cross entropy. If you know the classification, how many additional bits on average are needed to specify the actual category, in an efficient coding scheme?

Statistical measure: If the classification were the true distribution, what is the likelihood of this pattern of error?

Ideally, you would use each test set only once. That's not usually possible, but people do treat their test sets like gold; they are taken out of the closet very rarely and the human experimenters are careful not to actually look at them.

Experimental comparisons of ML techniques vary widely from one published paper to the next. The effectiveness of techniques depends delicately on features of the data set and specifics of how the learning is carried out.