G22.2591 - Advanced Natural Language Processing - Spring 2004

Lecture 5

Name Recognition

Why name recognition?

Name recognition was introduced as a separate task in Message Understanding Conference - 6 (see also the paper by Grishman and Sundheim).  Through earlier IE evaluations, system developers came to recognize that name recognition and classification was an important part of text processing, even if it was not recognized as basic in linguistic study.  Making it a separate task encouraged research to improve this technology, and emphasized its value for a range of applications (document indexing, and later question answering).

For MUC-6, there were three name categories -- people, organizations, and locations.  Date, time, percentage, and currency expressions were also included under name recognition.  Some evaluations since then have added categories ... artifact, facility, weapon, ... .  In fact, some systems for open-domain question answering have added a very large (100+) number of categories.  However, almost all studies have been done with the original set of three name categories.  Similar evaluations have been done for quite a few foreign languages;  CoNLL-2002 shared task did Dutch and Spanish;  CoNLL-2003 shared task did English and German.

How to measure scores?

Name recognition is scored by recall, precision, and F-measure (combination of recall and precision).  The simplest rule is to require perfect match -- you only get credit if you get the type, start, and end of a name correct (this is the metric used, for example, in the Japanese IREX evaluation);  the MUC evaluations used a more generous scoring, with partial score for identifying a name of any sort, for getting its type correct, and for getting its extent correct.

How well do people do?

In a small study for MUC-6 (17 articles), Sundheim reported <5% interannotator (key-to-key) error.  Agreement is probably enhanced in languages where names are capitalized and for text where the annotator is familiar with most of the names.  Without capitalization, it can be hard to tell unfamiliar organization names from common noun phrases.

Hand-coded rules

For a specific domain, it is possible to do very well with hand-coded rules and dictionaries.  On the MUC-6 evaluation (a very favorable situation, where the source and general topic of the test data was known in advance), the SRA system, based on hand-coded rules, got F=96.4.  Writing rules by hand, however, requires some skill and considerable time.

The hand-coded rules take advantage of
Note that sometimes the type decision is based upon left context, and sometimes upon right context, so it would be difficult for taggers which operate deterministically from left to right or from right to left to perform optimally.

Supervised training

Like POS tagging and chunking, named entity recognition has been tried with very many different machine learning methods.  More than the syntactic tasks, performance on NE recognition depends on the variety of resources which are brought to bear.  CoNLL evaluations are relatively 'pure' ... the systems basically just learn from the provided training corpus.  On the other hand, 'real' systems make use of as many lists and as much training data as available.  This has a substantial effect on performance.  In additiion, performance is strongly affected by the domain of the training and test data.  These two effects can make it difficult to compare results across different evaluations.

As with chunking, NE tagging can be recast as a token classification task.  We will have an "O" tag (token is not part of a named entity), and "B-X" and "I-X" tags for each name type X.

Markov Models for Name Recognition

The NYU Jet system uses a straightforward HMM for named entity tagging.

The simplest HMM has a single state for each name type, and a single state for not-a-name (NaN).  However, typically the first and last word of a name have different distributions, and the words immediately before or after a word often give a good indication of the name type (for example, 'Mr.' before a name is a clear indication of a person, while 'near' before a name probably indicates a location).  Therefore, we were able to create a more accurate model by having separate states for the words immediately before and after a name, and for the first and last tokens of a name.  This added about 2 points to recall (89 to 91) and 4 points to precision (82 to 86).

BBN's Nymble name tagger (Daniel M. Bikel; Scott Miller; Richard Schwartz; Ralph Weischedel.  Nymble: a High-Performance Learning Name-finder.  Proc. ANLP 97.) is perhaps the best-known name tagger.

They used several techniques to enhance performance over a basic HMM.  Most notably, they used bigram probabilities:  they differentiated between the probability of generating the first word of a name and subsequent words of a name.  The probability of generating the first word was made dependent on the prior state;  the probability of generating subsequent words was made dependent on the prior word.  The probability of a state transition was made dependent on the prior word.  This had to be combined with smoothing to handle the case of unseen bigrams.

HMMs are generative models, and we noted before some difficulties with such models.  A generative model produces a joint probability over observation and label sequences;  typically we compute P(new state | prior state) and P(current word | current state).  It is difficult to represent long-range or multiple interacting features in such a formalism.  Instead, researchers have used functions which compute the state probability given the input -- a formalism which allows for a richer set of features.

Sekine et al. (Satoshi Sekine; Ralph Grishman; Hiroyuki Shinnou.  A Decision Tree Method for Finding and Classifying Names in Japanese Texts.  Sixth WVLC, 1998)  used a decision tree method for Japanese named entity.  The decision tree yeilded information on the probability of the various tags.  A Viterbi algorithm then computed the most likely tagging of the entire sentence.

Borthwick et al. (Andrew Borthwick; John Sterling; Eugene Agichtein; Ralph Grishman.  Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition.  Sixth WVLC, 1998) used a maximum entropy method to compute the tags.  Again, a Viterbi decoder was used to select the best tagging.  By itself the method did fairly well (92.2 F on dry-run).  More interestingly, it could be combined with the patterns of the NYU hand-coded-rule system, with each rule a separate feature.  The rule-based system by itself also got 92.2 F;  the combined system got 95.6 F, roughly on a par with the best commercial system.

McCallum (Maximum Entropy Markov Models for Information Extraction and Segmentation. Andrew McCallum, Dayne Freitag and Fernando Pereira. ICML-2000) describes general Maximum Entropy Markov Models (MEMMs) as computing P(current state | input, prior state) using Maximum Entropy methods. The Ratnaparkhi POS tagger is close to this model.  McCallum notes that the Borthwick model is somewhat weaker in that  the current state probability is  conditioned only on the input, not on the prior state, and that may be why it did not do quite as well as the Nymble HMM model.

Discriminative training for HMMs

Another concern we had with HMMs was that the parameters learned may not be the optimal ones for the ultimate classification task.  As an alternative, we considered discriminative methods ... methods which were trained to make the discrimination between classes directly.  We considered one such approach, SVMs, last week.  Collins ( Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms, EMNLP 02; Collins and Duffy, ACL 2002) has described a somewhat different approach.  The basic idea was to use error-driven training.  Collins reported a 15% reduction in error rate on a named entity tagging task by using this approach.

Looking ahead to next week ...unsupervised learning of names

Tomek Strzalkowski; Jin Wang.  A Self-Learning Universal Concept Spotter. COLING 96.

Michael Collins; Yoram Singer.  Unsupervised Models for Named Entity Classification.  EMNLP 99.

Silviu Cucerzan; David Yarowsky.  Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence.  EMNLP 99.