G22.2591 - Advanced Natural Language Processing - Spring 2011
Name Recognition, cont'd
Continue discussion of using unsupervised clustering to control
Active learning for Name Tagging
In place of sequential annotation, we employ selective sampling, where
we select examples to be annotated. The hope is that by selecting
informative examples (given what we know so far), we can converge to an
accurate model much faster (i.e., with fewer annotated data points)
than if we annotated data sequentially.
The speed-up with active learning is, for a given level of
performance, the ratio of the number of examples which must be
annotated sequentially to the
number which must be annotated using active learning.
Active learning can produce large gains for NLP tasks because of the
Zipfian distribution of many phenomena in NLP ... a small number of
phenomena (words, structures) occur very frequently and a large number
phenomena occur rarely (the "tail" of the distribution). With
annotation we we keep labeling the same common phenomena again and
which may be a waste of time.
How do we decide what are the most informative examples? There are
- uncertainty-based sampling
- use some estimate of the uncertainty regarding the label for a
point; for example, for a probabilistic classifier involving a binary
choice, how close the probability is to 0.5. If there are many possible
labelings (for example, all the name labelings of a sentence) the margin
(the difference of the probability of the top two
hypotheses) can be used (Scheffer, Decomain, and Wrobel 2001).
- committee-based sampling
- create two or more classifiers and prefer data points on which
disagree; for example, use two classifiers which are trained on
different subsets of the features.
If the classifiers produce a probability distribution over labels,
we can use a measure of the degree to which these two distributions
differ (for example, KL-divergence)
With sequential labeling, common features will be heavily represented
in the training set. With uncertainty-based sampling, the active
may prefer features or combinations of features which have never been
seen before, and even the most common features may only appear once in
the training set. To provide better balance, some active learners
a representativeness weight, so that data points involving
common features are preferred to those involving rare features.
In principle, we want to select an example, label it, update the model,
and then select the next example. But updating the model and
all the unlabeled points may be slow. A natural solution is to select
a batch of data points at each iteration. Unfortunately, if the same
criteria and model are used, the data points selected in the batch are
likely to be very similar, defeating the benefits of active
So systems which do batch active learning need to add a diversity
condition -- requiring that the data points in the batch not be too
Annotation speed and the unit of annotation
Speed-up is typically measured in terms of number of items annotated.
However, this may not be a realistic measure of true speed-up
in labeling time) in some cases. Suppose you are training a
classifier, and at each iteration you ask the user about the class of
a single named entity. If you ask the user about a single name in
the middle of a document, the user has to read at least the containing
sentence and perhaps several prior sentences. The burden per name
annotated is considerably less for sequential annotation. This
suggests that the optimal unit of annotation may be at least an entire
Simulated active learning
Real active learning implies a person in the loop. That is a very
approach if one wants to conduct multiple experiments with different
active learning strategies. Consequently most A.L. experiments use
simulated A.L., where the training corpus is actually labeled in
but the labels are revealed to the learner only on request.
Active learning for named entities
Active learning has been applied to many NLP tasks ... POS tagging,
parsing, machine translation. There have been a few efforts to apply
it to named entity tagging:
Markus Becker, Ben Hachey, Beatrice Alex, and Claire Grover.
for Bootstrapping Named Entity Recognition.
Proceedings of the ICML-2005 Workshop on Learning with Multiple
They developed their system using a large annotated corpus (GENIA)
and then tried it on a new subdomain (astronomy) with limited test
data. They used a committee-based method involving two Markov Model
name taggers with different feature sets and selected sentences based
difference of the probability distributions of NE tags. They report
typical reductions of 40% in
labels required to reach the same level of performance.
Shen, D., Zhang, J., Su, J., Zhou, G., and Tan, C.
for named entity recognition.
Uses an SVM as binary discriminative classifier to decide whether
a token is a member of a given name class (tags just one type of name
at a time). Incorporates all 3 measures (informativeness,
representativeness, and diversity). Uses distance to the hyperplane
to measure informativeness. Applied to both MUC-6 and GENIA tasks,
finds typically 60% reduction in labels required for same level
Looking ahead -- lexico-syntactic patterns
of hyponyms from large text corpora.
Rion Snow, Daniel Jurafsky, and Andrew Ng.
for automatic hypernym discovery.