G22.2591 - Advanced Natural Language Processing - Spring 2004
(take questions on assignment 3)
(a good answer to assignment 3 tries two different choices of feature
and provides some text suggesting why one does better than the other,
by example or argument)
Text Chunking, cont'd
Support Vector Machines
The natural language processing tasks which we have considered and most
of those we will consider, are classification tasks: we classify
each token based on some features of the input, and possibly prior
classifications. In a generative probabilistic model like an HMM,
we do this rather indirectly. Let's call the input features x [a d-dimensional vector] and the
y. [vectors are given in boldface] We build (learn) a
probabilistic model of possible feature
outputs x for y = class A and
and a model of possible feature outputs x for y = class B. When we get
a new input x', we determine
the probability of producing this output for class A and class B, and
choose the alternative for which the output is more likely.
We consider now a more direct approach: learning a discriminative
function directly. That is, given a set of training examples of
the form <x, y>, we learn
a function f(x) which returns
an appropiate value
of y. To keep things simple, we will focus on the problem of binary classification (only two
possible values of y, 0 and 1).
One particuarly simple case arises if the data is linearly separable -- that it is
possible to draw a hyperplane in d-dimensional space so that all the
training examples with y=0 are on one side of the plane, and all of the
examples with y=1 are on the other side of the plane. The problem
then reduces to one of finding this separating plane. In general,
there will be multiple planes ... which one should we choose? To
answer this question, we define the margin
m as the minimum distance from any point (in the training set)
to the plane. Support vector machines choose the plane which
maximizes this margin; they are called maximal margin classifiers.
The points which are closest to the plane
-- a distance m away -- are called the support vectors.
Intuitively, these support vectors -- the points closest to the
boundary between y=0 and y=1 -- are the ones determining the plane and
hence the discrimination function f.
Of course, life is usually not so simple ... the data is not linearly
separable. In some cases this is because of noise in the data
(due, for example, to annotator errors). To accomodate noise, we
introduce the idea of a soft margin,
in which we allow some of the training points to be on the wrong side
of the hyperplane. For these points, the 'slack' is the distance
the point is from the plane. Putting the slack values of all the
training points together gives a slack vector. The SVM defines
some norm on this slack vector, and finds the plane minimizing the norm.
In other cases, it is in the nature of the problem, and the input
features selected, that the data will not be linearly separable.
For example, if the input features are the longitude and latitude of a
city, and the classification desired is 'can commute to NYU', the
dividing surface is (roughly) a circle, not a plane. By mapping
the input to a different space (for example, where a coordinate is the
distance from New York) we can transform the problem into one that is
linearly separable. Terminology: we call the original input
the attributes, and the set
of attributes the input space;
the transformed values the features,
and the set of features the feature
space; and the function which maps attribute vectors into
feature vectors F.
To take a more linguistic example, suppose we are trying to detect
sentences about foods. Suppose we treat the sentence as a set of
words and the input attributes are the words in the sentence (that is,
if English has 100,000 words, there are 100,000 binary attributes, most
of which are zero). We'll find that we can create a fairly good
classifier with these attributes, but we will be annoyed that it will
sometimes fail because we need to take into account two words appearing
together ... for example 'hot' and 'dog'. So we can introduce
pairs of words as features. As we introduce additional features,
we may be able to get to the point where we have a linearly separable
problem. Unfortunately, there are a lot of such features ...
10,000,000,000, in fact. Even though most features will be 0 (so
we could use some sparse vector representation), representing such
feature vectors explicitly is inconvenient.
To get around this problem, we observe that the classifier produced by
the SVM has the form
= <w . x> + b
where w is the normal to the
hyperplane, and < . > represents an inner (dot) product of two
vectors. The w found by the
SVM will be a linear combination of the training data points:
= sum(i) ai yi xi
= sum(i) ai yi < xi . x>
The classifier function only depends on the dot product between
training points and the test point. Suppose now that we used a
function F to transform the input vectors xi into feature vectors F(xi). The formula
would then be
= sum(i) ai yi < F(xi) . F(x)>
Thus we never have to compute the F(x)'s explicitly; we only need
to provide a function which computes
K(x,z) = < F(x) . F(z)>
We call this a kernel function.
In many cases it can be substantially faster to compute the kernel
directly than to compute F and then the dot product.
The theory and software for Support Vector Machines have mostly been
developed over the past few years. There is now a book about SVMs
( Introduction to Support Vector Machines,
by Nello Christianini and John Shawe-Taylor, 2000), as well as some short
introductions, and several software packages, including SVMLight. SVMs are of
interest because they have yielded the best results to date on several
standard NLP tasks.
Using SVMs for Chunking
The best performance on the baseNP and chunking tasks was
using a Support Vector Machine method. They obtained an accuracy
of 94.22% with the small data set of Ramshaw and Marcus, and 95.77% by
training on almost the entire Penn Treebank.
Hand Tagging vs. Machine Learning; Active Learning
baseNP chunking is a task for which people (with some linguistics
training) can write quite good rules fairly quickly. This raises
the practical question of whether we should be using machine learning
at all. Clearly if there is already a large relevant resource, it
makes sense to learn from it. However, if we have to develop a
chunker for a new language, is it cheaper to annotate some data or to
write the rules directly? Ngai and Yarowsky addressed this
They also looked at the question of selecting the data to be
annotated. Traditional training is based on sequential text
annotation ... we just annotate a series of documents in
sequence. Can we do better?