G22.2591 - Advanced Natural Language Processing - Spring 2009
Name Recognition, cont'd
Active learning for Name Tagging
In place of sequential annotation, we employ selective sampling,
where we select examples to be annotated. The hope is that by selecting
informative examples (given what we know so far), we can converge to an
accurate model much faster (i.e., with fewer annotated data points) than if
we annotated data sequentially.
The speed-up with active learning is, for a given level of performance, the
ratio of the number of examples which must be annotated sequentially to the
number which must be annotated using active learning.
Active learning can produce large gains for NLP tasks because of the
Zipfian distribution of many phenomena in NLP ... a small number of
phenomena (words, structures) occur very frequently and a large number of
phenomena occur rarely (the "tail" of the distribution). With sequential
annotation we we keep labeling the same common phenomena again and again,
which may be a waste of time.
How do we decide what are the most informative examples? There are two
- uncertainty-based sampling
- use some estimate of the uncertainty regarding the label for a data
point; for example, for a probabilistic classifier involving a binary
choice, how close the probability is to 0.5. If there are many possible
labelings (for example, all the name labelings of a sentence) the
margin (the difference of the probability of the top two
hypotheses) can be used (Scheffer, Decomain, and Wrobel 2001).
- committee-based sampling
- create two or more classifiers and prefer data points on which they
disagree; for example, use two classifiers which are trained on
different subsets of the features.
If the classifiers produce a probability distribution over labels,
we can use a measure of the degree to which these two distributions
differ (for example, KL-divergence)
With sequential labeling, common features will be heavily represented
in the training set. With uncertainty-based sampling, the active learner
may prefer features or combinations of features which have never been
seen before, and even the most common features may only appear once in
the training set. To provide better balance, some active learners include
a representativeness weight, so that data points involving
common features are preferred to those involving rare features.
In principle, we want to select an example, label it, update the model,
and then select the next example. But updating the model and re-evaluating
all the unlabeled points may be slow. A natural solution is to select
a batch of data points at each iteration. Unfortunately, if the same
criteria and model are used, the data points selected in the batch are
likely to be very similar, defeating the benefits of active learning.
So systems which do batch active learning need to add a diversity
condition -- requiring that the data points in the batch not be too similar.
Annotation speed and the unit of annotation
Speed-up is typically measured in terms of number of items annotated.
However, this may not be a realistic measure of true speed-up (reduction
in labeling time) in some cases. Suppose you are training a named-entity
classifier, and at each iteration you ask the user about the class of
a single named entity. If you ask the user about a single name in
the middle of a document, the user has to read at least the containing
sentence and perhaps several prior sentences. The burden per name
annotated is considerably less for sequential annotation. This
suggests that the optimal unit of annotation may be at least an entire
Simulated active learning
Real active learning implies a person in the loop. That is a very costly
approach if one wants to conduct multiple experiments with different
active learning strategies. Consequently most A.L. experiments use
simulated A.L., where the training corpus is actually labeled in advance
but the labels are revealed to the learner only on request.
Active learning for named entities
Active learning has been applied to many NLP tasks ... POS tagging,
parsing, machine translation. There have been a few efforts to apply
it to named entity tagging:
Markus Becker, Ben Hachey, Beatrice Alex, and Claire Grover.
Optimising Selective Sampling for Bootstrapping Named Entity Recognition.
Proceedings of the ICML-2005 Workshop on Learning with Multiple Views.
They developed their system using a large annotated corpus (GENIA)
and then tried it on a new subdomain (astronomy) with limited test
data. They used a committee-based method involving two Markov Model
name taggers with different feature sets and selected sentences based on
difference of distributions. They report typical reductions of 40% in
labels required to reach the same level of performance.
Shen, D., Zhang, J., Su, J., Zhou, G., and Tan, C.
Multi-criteria-based active learning for named entity recognition.
Uses an SVM as binary discriminative classifier to decide whether
a token is a member of a given name class (tags just one type of name
at a time). Incorporates all 3 measures (informativeness,
representativeness, and diversity). Uses distance to the hyperplane
to measure informativeness. Applied to both MUC-6 and GENIA tasks,
finds typically 60% reduction in labels required for same level
Looking ahead -- lexico-syntactic patterns
Automatic acquisition of hyponyms from large text corpora.
Rion Snow, Daniel Jurafsky, and Andrew Ng.
Learning syntactic patterns for automatic hypernym discovery.