(a good answer to assignment 3 tries two different choices of feature and provides some text suggesting why one does better than the other, by example or argument)

We consider now a more direct approach: learning a discriminative function directly. That is, given a set of training examples of the form <x, y>, we learn a function f(x) which returns an appropiate value of y. To keep things simple, we will focus on the problem of binary classification (only two possible values of y, 0 and 1).

One particuarly simple case arises if the data is linearly separable -- that it is possible to draw a hyperplane in d-dimensional space so that all the training examples with y=0 are on one side of the plane, and all of the examples with y=1 are on the other side of the plane. The problem then reduces to one of finding this separating plane. In general, there will be multiple planes ... which one should we choose? To answer this question, we define the margin m as the minimum distance from any point (in the training set) to the plane. Support vector machines choose the plane which maximizes this margin; they are called maximal margin classifiers. The points which are closest to the plane -- a distance m away -- are called the support vectors. Intuitively, these support vectors -- the points closest to the boundary between y=0 and y=1 -- are the ones determining the plane and hence the discrimination function f.

Of course, life is usually not so simple ... the data is not linearly separable. In some cases this is because of noise in the data (due, for example, to annotator errors). To accomodate noise, we introduce the idea of a soft margin, in which we allow some of the training points to be on the wrong side of the hyperplane. For these points, the 'slack' is the distance the point is from the plane. Putting the slack values of all the training points together gives a slack vector. The SVM defines some norm on this slack vector, and finds the plane minimizing the norm.

To take a more linguistic example, suppose we are trying to detect sentences about foods. Suppose we treat the sentence as a set of words and the input attributes are the words in the sentence (that is, if English has 100,000 words, there are 100,000 binary attributes, most of which are zero). We'll find that we can create a fairly good classifier with these attributes, but we will be annoyed that it will sometimes fail because we need to take into account two words appearing together ... for example 'hot' and 'dog'. So we can introduce pairs of words as features. As we introduce additional features, we may be able to get to the point where we have a linearly separable problem. Unfortunately, there are a lot of such features ... 10,000,000,000, in fact. Even though most features will be 0 (so we could use some sparse vector representation), representing such feature vectors explicitly is inconvenient.

To get around this problem, we observe that the classifier produced by the SVM has the form

f(x)
= <w . x> + b

where w is the normal to the hyperplane, and < . > represents an inner (dot) product of two vectors. The w found by the SVM will be a linear combination of the training data points:

w
= sum_{(i)} a_{i} y_{i} x_{i}

f(x)
= sum_{(i)} a_{i} y_{i < }x_{i} . x>

The classifier function only depends on the dot product between training points and the test point. Suppose now that we used a function F to transform the input vectors x

f(x)
= sum_{(i)} a_{i} y_{i }<_{ }F(x_{i}) . F(x)>

Thus we never have to compute the F(x)'s explicitly; we only need to provide a function which computes

K(x,z) = <_{ }F(x) . F(z)>

We call this a kernel function. In many cases it can be substantially faster to compute the kernel directly than to compute F and then the dot product.

The theory and software for Support Vector Machines have mostly been developed over the past few years. There is now a book about SVMs ( Introduction to Support Vector Machines, by Nello Christianini and John Shawe-Taylor, 2000), as well as some short introductions, and several software packages, including SVMLight. SVMs are of interest because they have yielded the best results to date on several standard NLP tasks.

Taku Kudo; Yuji Matsumoto. Chunking
with
Support Vector Machines Proc. NAACL 01.

They also looked at the question of selecting the data to be annotated. Traditional training is based on sequential text annotation ... we just annotate a series of documents in sequence. Can we do better?

Ngai, G. and D. Yarowsky, Rule Writing or
Annotation: Cost-efficient Resource Usage for Base
Noun Phrase Chunking. *Proc. ACL-2000*