G22.2591 - Advanced Natural Language Processing - Spring 2009

Lecture 3

Name Recognition, cont'd

Selecting a model -- HMMs vs feature-based models

One can build a rather good NE tagger using rather simple information ... the likelihood of a given token preceding, being part of, or following a name of some type.  This information is effectively captured in an HMM model.  HMMs have the benefit of being easy to train (just counting) and decode.  They are even easy to train incrementally, which can be helpful for semi-supervised and active learning.

But a tagger can benefit from more carefully selected evidence:
Such evidence is more readily incorporated into a feature-based model, where each instance (token) is characterized by a set of features.  However, feature-based models (except for Naive Bayes) are more difficult and expensive to train, particularly with large numbers of features.

Semi-supervised learning of names

The methods described so far require the preparation of a substantial amount of training data.  Can this requirement be reduced?  We will look now at several papers which investigate this issue:

Tomek Strzalkowski; Jin Wang.  A Self-Learning Universal Concept Spotter. COLING 96.

Slide presentation.

Michael Collins; Yoram Singer.  Unsupervised Models for Named Entity Classification.  EMNLP 99.

presentation by Omer Farukhan Gunes.

Silviu Cucerzan; David Yarowsky.  Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence.  EMNLP 99.

presentation by Wei Xu.

Locally, we have investigated the problem of tagging technical terms, where capitalization does not help in identification, and shown the benefit of 'competition', where we train taggers for several name types at once.  Roman Yangarber; Winston Lin; Ralph Grishman.  Unsupervised Learning of Generalized Names.  COLING 2002.

All of these efforts aim to produce taggers with moderate performance starting with minimal resources.  More recently, we have demonstrated the benefit of self-training for improving a tagger which has already been trained on a large annotated corpus.  The self-training strategy is quite simple: we tag a large corpus with an HMM name tagger, and select those sentences for which the margin -- the difference between the probability of the most likely analysis and the second most likely analysis is large (a large margin indicates confidence in the most likely analysis).  We add the selected sentences to the tagged corpus and retrain the tagger.  (Heng Ji and Ralph Grishman.  Data selection in semi-supervised learning for name tagging. ACL 06 Workshop on Information Extraction Beyond the Document.)