G22.2591 - Advanced Natural Language Processing - Spring 2011

Lecture 3

Name Recognition, cont'd

February 8, 2011

Selecting a model -- HMMs vs feature-based models

One can build a rather good NE tagger using rather simple information ... the likelihood of a given token preceding, being part of, or following a name of some type.  This information is effectively captured in an HMM model.  HMMs have the benefit of being easy to train (just counting) and decode.  They are even easy to train incrementally, which can be helpful for semi-supervised and active learning.

But a tagger can benefit from more carefully selected evidence:
Such evidence is more readily incorporated into a feature-based model, where each instance (token) is characterized by a set of features.  However, feature-based models (except for Naive Bayes) are more difficult and expensive to train, particularly with large numbers of features.

Semi-supervised learning of names

The methods described so far require the preparation of a substantial amount of training data.  Can this requirement be reduced?  Several papers from the late 1990's showed how this was possible through bootstrapping (semi-supervised learning):

Tomek Strzalkowski; Jin Wang.  A Self-Learning Universal Concept Spotter. COLING 96.

One of the earliest efforts at bootstrapping name categories from a small set of seeds, using features based on words and bigrams preceeding, within, and following the name.  Demonstrated good results at finding organization names and moderately good results on products.

Michael Collins; Yoram Singer.  Unsupervised Models for Named Entity Classification.  EMNLP 99.

(presentation by Michal Novemsky)

Silviu Cucerzan; David Yarowsky.  Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence.  EMNLP 99.

Riloff, E. and Jones, R. (1999) "Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping", Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99) , 1999, pp. 474-479.

(presentation by Sunandan Chakraborty)

All of these efforts aim to produce taggers with moderate performance starting with minimal resources.  We showed that self-training can also be used for improving a tagger which has already been trained on a large annotated corpus.  The self-training strategy we used is quite simple: we tag a large corpus with an HMM name tagger, and select those sentences for which the margin -- the difference between the probability of the most likely analysis and the second most likely analysis is large (a large margin indicates confidence in the most likely analysis).  We add the selected sentences to the tagged corpus and retrain the tagger.  (Heng Ji and Ralph Grishman.  Data selection in semi-supervised learning for name tagging. ACL 06 Workshop on Information Extraction Beyond the Document.) 

Self-training can also be used to update an old name tagger.  If a tagger is trained on an old annotated corpus, it will be missing many of the names which appear in contemporary text.  The tagger can be effectively updated through self-training using contemporary unannotated texts;  this spares the effort to manually tag new text.  (Cristina Mota and Ralph Grishman.  Updating a name tagger using contemporary unlabeled data.  ACL 2009.)

Problems of self-training

Several problems arise with self-training ("bootstrapping").  One problem is semantic drift ... a procedure which is given a seed consisting of one type of name gradually starts assigning that label to names of another type.  This is particularly likely if the two sets of names intersect, such as women's first names and names of flowers (Rose, Violet, ...).  More generally, these bootstrapping methods often lack a good stopping criterion, and so tend to label too many examples.

This problem is less severe if we are assigning labels to all the examples and learn all the types concurrently.  For example, capitalization is a fairly good clue to identifying names in English, and most names in news text are people, organizations, or locations.  So concurrent self-training on these three types can be quite effective.

Locally, we have investigated the problem of tagging technical terms, where capitalization does not help in identification.  We have shown the benefit of 'competition', where we identify and add additional name classes (and train on all the classes concurrently) in order to improve the training of the original classes.  (Roman Yangarber; Winston Lin; Ralph Grishman.  Unsupervised Learning of Generalized Names.  COLING 2002.)  A similar approach was taken in Weighted Mutual Exclusion Bootstrapping (WMEB) by McIntosh and Curran.  The disadvantage of this approach is that it requires a manual analysis and creation of these competing classes.

The manual creation of competing classes can be avoided by using unsupervised term clustering (based on all the contexts in which a term appears) to create these negative categories (Tara McIntosh, Unsupervised discovery of negative in lexicon bootstrapping, EMNLP 2010.)  Unsupervised clustering has also proven effective to limit drift and halt bootstrapping for relation discovery (Ang Sun and Ralph Grishman, Semi-supervised Semantic Pattern Discovery with Guidance from Unsupervised Pattern Clusters, COLING 2010).