Linear Separators

Consider a classification problem of determining whether an instance X is a member of one particular category C. That is, the classification attribute has two values T and F.

Suppose that all the predictive attributes are numerical. Thus, an instance X can be considered to be a vector < X₁ ... X_k > where the X_i are all numbers.

The simplest kind of classifier for this case is a linear classifier which is a vector of k weights < W₁ ... W_k > and a threshhold T with the following rule:

If W₁X₁ + W₂X₂ + ... + W_kX_k > T then predict X.C=T else predict X.C=F.

For fixed W₁ ... W_k, the equation W₁X₁ + W₂X₂ + ... + W_kX_k = T defines a hyperplane in the k-dimensional space of values of X₁ ... X_k.
All points on one side of the hyperplane obey the equation
W₁X₁ + W₂X₂ + ... + W_kX_k > T
and are judged to be in category C. All points on the other side of the hyperplane obey the equation
W₁X₁ + W₂X₂ + ... + W_kX_k < T
and are judged to be outside category C.

Naive Bayes

Naive Bayes, applied to a problem where all the attributes are Boolean, is essentially a linear separator. We represent an instance as a tuple of 1's and -1's. Applying the logarithmic transformation, the Naive Bayes method states that the vector < x₁ ... x_k > is judged to be in category C if

log(P(I.A1= x₁ | I.C=T)) + log(P(I.A2= x₂ | I.C=T)) + ... + log(P(I.Ak= x_k | I.C=T)) + log(P(I.C=T)) >
log(P(I.A1= x₁ | I.C=F)) + log(P(I.A2= x₂ | I.C=F)) + ... + log(P(I.Ak= x_k | I.C=F)) + log(P(I.C=F))

However, this is equivalent to

(1/2) [log(P(I.A1=1 | I.C=T)) + log(P(I.A1=-1 | I.C=F)) - log(P(I.A1=1 | I.C=F)) - log(P(I.A1=-1| I.C=T))] x₁ + ... +
(1/2) [log(P(I.Ak=1 | I.C=T)) + log(P(I.Ak=-1 | I.C=F)) - log(P(I.Ak=1 | I.C=F)) - log(P(I.Ak=-1| I.C=T))] x_k
>
log(P(I.C=F)) - log(P(I.C=T)) +
(1/2) [log(P(I.A1=1 | I.C=T)) + log(P(I.A1=1 | I.C=F)) - log(P(I.A1=-1 | I.C=F)) - log(P(I.A1=-1| I.C=T))] + ... +
(1/2) [log(P(I.Ak=1 | I.C=T)) + log(P(I.Ak=1 | I.C=F)) - log(P(I.Ak=-1 | I.C=F)) - log(P(I.Ak=-1| I.C=T))]

which is a linear inequality in x₁ ... x_k

Therefore: let us consider a vector space in which each word in the language is a dimension, and let us say a document D corresponds to a vector V such that the coordinate of D in the dimension corresponding to word U is the number of occurrences of U in D (3 for "a" and "rose", 2 for "is"). Then the Naive Bayes classifier is a linear discriminator where the weight W_U for the dimension corresponding to U is equal to log(P(U|C)) - log(P(U|~C)) and the threshhold is log(P(~C)) - log(P(C)).

The Bernoulli model for Naive Bayes can also be construed as a linear separator; I leave that as an exercise.

Support Vector Machines

"No one ever got fired for using an SVM."
Introduction to Information Retrieval by Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze, Cambridge U. Press, 2008

Suppose that hyperplane P is a linear separator for a labelled data set D. The margin of P is the minimal distance from any element in D to P. The SVM of D is the hyperplane P with maximal margin.

In this figure, P1, P2, and P3 are all linear separators between the blue crosses and the red circles, but P2 is the SVM because it has the largest margin.

The SVM generally predicts unseen points more accurately than any other linear classifiers, because it requires that any future point classified as C must be quite far from any point labelled as not C. (There are also more sophisticated and cogent arguments for this conclusion.)

With a little algebra, one can express the problem of finding the SVM as a problem of minimizing a quadratic function subject to linear constraints. (Essentially, the distance squared is a quadratic function on W and T and the constraint that all the points are correctly categorized is a set of linear constraints on W and T, though it takes a little munging to make this come out neatly.) This is a well-known optimization problem.

SVM on data that is not linearly separable

Suppose the data is not actually linearly separable (either because of inherent properties of the category or because of noise) but is nearly so. We can apply SVM as follows: Define data set E to be an improvement of data set D around hyperplane P with cost C, if

E is discriminated by P with margin M;
E differs from D in that some of the data points have been moved perpendicular to P into proper position;
The total cost C is a linear combination of quadratic functions of M and the distances the points are moved.

Then find the value of P that minimizes the cost.

Kernel Machines

As described in Russell and Norvig (pp. 749-751), SVMs can also be used for finding non-linear separators, essentially by mapping the problem to a higher dimensional space, where the additional dimensions correspond to non-linear terms.