G22.2591 - Advanced Natural Language Processing - Spring 2009

Lecture 4

Name Recognition, cont'd

Active learning for Name Tagging

Basic idea

In place of sequential annotation, we employ selective sampling, where we select examples to be annotated. The hope is that by selecting informative examples (given what we know so far), we can converge to an accurate model much faster (i.e., with fewer annotated data points) than if we annotated data sequentially.

The speed-up with active learning is, for a given level of performance, the ratio of the number of examples which must be annotated sequentially to the number which must be annotated using active learning.

Active learning can produce large gains for NLP tasks because of the Zipfian distribution of many phenomena in NLP ... a small number of phenomena (words, structures) occur very frequently and a large number of phenomena occur rarely (the "tail" of the distribution). With sequential annotation we we keep labeling the same common phenomena again and again, which may be a waste of time.

How do we decide what are the most informative examples? There are two standard approaches:

uncertainty-based sampling
use some estimate of the uncertainty regarding the label for a data point; for example, for a probabilistic classifier involving a binary choice, how close the probability is to 0.5. If there are many possible labelings (for example, all the name labelings of a sentence) the margin (the difference of the probability of the top two hypotheses) can be used (Scheffer, Decomain, and Wrobel 2001).
committee-based sampling
create two or more classifiers and prefer data points on which they disagree; for example, use two classifiers which are trained on different subsets of the features. If the classifiers produce a probability distribution over labels, we can use a measure of the degree to which these two distributions differ (for example, KL-divergence)

Pitfalls

Selecting outliers

With sequential labeling, common features will be heavily represented in the training set. With uncertainty-based sampling, the active learner may prefer features or combinations of features which have never been seen before, and even the most common features may only appear once in the training set. To provide better balance, some active learners include a representativeness weight, so that data points involving common features are preferred to those involving rare features.

Compute time

In principle, we want to select an example, label it, update the model, and then select the next example. But updating the model and re-evaluating all the unlabeled points may be slow. A natural solution is to select a batch of data points at each iteration. Unfortunately, if the same criteria and model are used, the data points selected in the batch are likely to be very similar, defeating the benefits of active learning. So systems which do batch active learning need to add a diversity condition -- requiring that the data points in the batch not be too similar.

Annotation speed and the unit of annotation

Speed-up is typically measured in terms of number of items annotated. However, this may not be a realistic measure of true speed-up (reduction in labeling time) in some cases. Suppose you are training a named-entity classifier, and at each iteration you ask the user about the class of a single named entity. If you ask the user about a single name in the middle of a document, the user has to read at least the containing sentence and perhaps several prior sentences. The burden per name annotated is considerably less for sequential annotation. This suggests that the optimal unit of annotation may be at least an entire sentence.

Simulated active learning

Real active learning implies a person in the loop. That is a very costly approach if one wants to conduct multiple experiments with different active learning strategies. Consequently most A.L. experiments use simulated A.L., where the training corpus is actually labeled in advance but the labels are revealed to the learner only on request.

Active learning for named entities

Active learning has been applied to many NLP tasks ... POS tagging, parsing, machine translation. There have been a few efforts to apply it to named entity tagging:

Markus Becker, Ben Hachey, Beatrice Alex, and Claire Grover. Optimising Selective Sampling for Bootstrapping Named Entity Recognition. Proceedings of the ICML-2005 Workshop on Learning with Multiple Views.

They developed their system using a large annotated corpus (GENIA) and then tried it on a new subdomain (astronomy) with limited test data. They used a committee-based method involving two Markov Model name taggers with different feature sets and selected sentences based on difference of distributions. They report typical reductions of 40% in labels required to reach the same level of performance.

Shen, D., Zhang, J., Su, J., Zhou, G., and Tan, C. Multi-criteria-based active learning for named entity recognition. ACL 2004.

Uses an SVM as binary discriminative classifier to decide whether a token is a member of a given name class (tags just one type of name at a time). Incorporates all 3 measures (informativeness, representativeness, and diversity). Uses distance to the hyperplane to measure informativeness. Applied to both MUC-6 and GENIA tasks, finds typically 60% reduction in labels required for same level of performance.

Looking ahead -- lexico-syntactic patterns

Marti Hearst. Automatic acquisition of hyponyms from large text corpora. COLING 1992.

Rion Snow, Daniel Jurafsky, and Andrew Ng. Learning syntactic patterns for automatic hypernym discovery. NIPS 2004.