G22.2591 - Advanced Natural Language Processing - Spring 2004

Lecture 4

(take questions on assignment 3)
(a good answer to assignment 3 tries two different choices of feature and provides some text suggesting why one does better than the other, by example or argument)

Text Chunking, cont'd

Support Vector Machines

The natural language processing tasks which we have considered and most of those we will consider, are classification tasks:  we classify each token based on some features of the input, and possibly prior classifications.  In a generative probabilistic model like an HMM, we do this rather indirectly.  Let's call the input features x [a d-dimensional vector] and the output (classification) y.  [vectors are given in boldface]  We build (learn) a probabilistic model of possible feature outputs x for y = class A and and a model of possible feature outputs x for y = class B.  When we get a new input x', we determine the probability of producing this output for class A and class B, and choose the alternative for which the output is more likely.
 
We consider now a more direct approach:  learning a discriminative function directly.  That is, given a set of training examples of the form <x, y>, we learn a function f(x) which returns an appropiate value of y.  To keep things simple, we will focus on the problem of binary classification (only two possible values of y, 0 and 1).

One particuarly simple case arises if the data is linearly separable -- that it is possible to draw a hyperplane in d-dimensional space so that all the training examples with y=0 are on one side of the plane, and all of the examples with y=1 are on the other side of the plane.  The problem then reduces to one of finding this separating plane.  In general, there will be multiple planes ... which one should we choose?  To answer this question, we define the margin m as the minimum distance from any point (in the training set) to the plane.  Support vector machines choose the plane which maximizes this margin;  they are called maximal margin classifiers.  The points which are closest to the plane -- a distance m away -- are called the support vectors.  Intuitively, these support vectors -- the points closest to the boundary between y=0 and y=1 -- are the ones determining the plane and hence the discrimination function f.

Of course, life is usually not so simple ... the data is not linearly separable.  In some cases this is because of noise in the data (due, for example, to annotator errors).  To accomodate noise, we introduce the idea of a soft margin, in which we allow some of the training points to be on the wrong side of the hyperplane.  For these points, the 'slack' is the distance the point is from the plane.  Putting the slack values of all the training points together gives a slack vector.  The SVM defines some norm on this slack vector, and finds the plane minimizing the norm.

Kernel Methods

In other cases, it is in the nature of the problem, and the input features selected, that the data will not be linearly separable.  For example, if the input features are the longitude and latitude of a city, and the classification desired is 'can commute to NYU', the dividing surface is (roughly) a circle, not a plane.  By mapping the input to a different space (for example, where a coordinate is the distance from New York) we can transform the problem into one that is linearly separable.  Terminology:  we call the original input the attributes, and the set of attributes the input space; the transformed values the features, and the set of features the feature space; and the function which maps attribute vectors into feature vectors F.

To take a more linguistic example, suppose we are trying to detect sentences about foods.  Suppose we treat the sentence as a set of words and the input attributes are the words in the sentence (that is, if English has 100,000 words, there are 100,000 binary attributes, most of which are zero).  We'll find that we can create a fairly good classifier with these attributes, but we will be annoyed that it will sometimes fail because we need to take into account two words appearing together ... for example 'hot' and 'dog'.  So we can introduce pairs of words as features.  As we introduce additional features, we may be able to get to the point where we have a linearly separable problem.  Unfortunately, there are a lot of such features ... 10,000,000,000, in fact.  Even though most features will be 0 (so we could use some sparse vector representation), representing such feature vectors explicitly is inconvenient.

To get around this problem, we observe that the classifier produced by the SVM has the form

f(x) = <w . x> + b

where w is the normal to the hyperplane, and < . > represents an inner (dot) product of two vectors. The w found by the SVM will be a linear combination of the training data points:

w = sum(i) ai yi xi

So

f(x) = sum(i) ai yi < xi . x>

The classifier function only depends on the dot product between training points and the test point.  Suppose now that we used a function F to transform the input vectors xi into feature vectors F(xi).  The formula would then be

f(x) = sum(i) ai yi < F(xi) . F(x)>

Thus we never have to compute the F(x)'s explicitly;  we only need to provide a function which computes

K(x,z) = < F(x) . F(z)>

We call this a kernel function.  In many cases it can be substantially faster to compute the kernel directly than to compute F and then the dot product.

The theory and software for Support Vector Machines have mostly been developed over the past few years.  There is now a book about SVMs ( Introduction to Support Vector Machines, by Nello Christianini and John Shawe-Taylor, 2000), as well as some short introductions, and several software packages, including SVMLight.  SVMs are of interest because they have yielded the best results to date on several standard NLP tasks.

Using SVMs for Chunking

The best performance on the baseNP and chunking tasks was obtained using a Support Vector Machine method.  They obtained an accuracy of 94.22% with the small data set of Ramshaw and Marcus, and 95.77% by training on almost the entire Penn Treebank.

Taku Kudo; Yuji Matsumoto.  Chunking with Support Vector Machines  Proc. NAACL 01.

Hand Tagging vs. Machine Learning;  Active Learning

baseNP chunking is a task for which people (with some linguistics training) can write quite good rules fairly quickly.  This raises the practical question of whether we should be using machine learning at all.  Clearly if there is already a large relevant resource, it makes sense to learn from it.  However, if we have to develop a chunker for a new language, is it cheaper to annotate some data or to write the rules directly?  Ngai and Yarowsky addressed this question.

They also looked at the question of selecting the data to be annotated.  Traditional training is based on sequential text annotation ... we just annotate a series of documents in sequence.  Can we do better?

Ngai, G. and D. Yarowsky, Rule Writing or Annotation: Cost-efficient Resource Usage for Base Noun Phrase Chunking. Proc.  ACL-2000