G22.2591 - Advanced Topics in Natural Language Processing - Spring 2011

Lecture 2

February 1, 2011

Name Recognition and Classification

Why name recognition?

Name recognition was introduced as a separate task in Message Understanding Conference - 6 (see also Grishman and Sundheim COLING 1996).  Through earlier IE evaluations, system developers came to recognize that name recognition and classification was an important part of text processing, even if it was not recognized as basic in linguistic study.  Making it a separate task encouraged research to improve this technology, and emphasized its value for a range of applications (document indexing, and later question answering). Nadeau and Sekine 2008 provide a good recent survey of work on this task.

For MUC-6, there were three name categories -- people, organizations, and locations.  Date, time, percentage, and currency expressions were also included under name recognition.  Some evaluations since then have added individual categories ... artifact, facility, weapon, ... while others have developed definitions and taggers for larger (both broader and more finely grained) sets of categories, with up to 200 caetegories (Sekine and Nobata LREC 2004).  However, almost all studies have been done with the original set of three name categories.  Similar evaluations have been done for quite a few foreign languages;  CoNLL-2002 shared task did Dutch and Spanish;  CoNLL-2003 shared task did English and German.

These categories were developed for news reports, and provide good coverage for that subject matter.  Technical texts, however, require very different sets of name categories.  There has been particular interest in this decade in molecular biology and genomics texts.  The GENIA corpus is annotated with 36 categories, including proteins, DNA, RNA, cell lines and cell types, and their have been multi-site evaluations for tagging such text.

How to measure scores?

System performance is measured by comparison with a hand-prepared key.  The two basic measures are recall (= number of correct names / number of names in key) and precision (= number of correct names / number of names in system response).  As a single metric to rank systems we normally use F-measure (= 1/(1/recall + 1/precision)).  The simplest rule is to require perfect match -- you only get credit if you get the type, start, and end of a name correct (this is the metric used, for example, in the Japanese IREX evaluation);  the MUC evaluations used a more generous scoring, with partial score for identifying a name of any sort, for getting its type correct, and for getting its extent correct.

How well do people do?

In a small study for MUC-6 (17 articles), Sundheim reported <5% interannotator (key-to-key) error.  Agreement is probably enhanced in languages where names are capitalized and for text where the annotator is familiar with most of the names.  Without capitalization, it can be hard to tell unfamiliar organization names from common noun phrases.

Hand-coded rules

For a specific domain, it is possible to do very well with hand-coded rules and dictionaries.  On the MUC-6 evaluation (a very favorable situation, where the source and general topic of the test data was known in advance), the SRA system, based on hand-coded rules, got F=96.4.  Writing rules by hand, however, requires some skill and considerable time.

The hand-coded rules take advantage of
Note that sometimes the type decision is based upon left context, and sometimes upon right context, so it would be difficult for taggers which operate deterministically from left to right or from right to left to perform optimally.

Supervised training

Like POS tagging and chunking, named entity recognition has been tried with very many different machine learning methods.  More than the syntactic tasks, performance on NE recognition depends on the variety of resources which are brought to bear.  CoNLL evaluations are relatively 'pure' ... the systems basically just learn from the provided training corpus.  On the other hand, 'real' systems make use of as many lists and as much training data as available.  This has a substantial effect on performance.  In additiion, performance is strongly affected by the domain of the training and test data.  These two effects can make it difficult to compare results across different evaluations.

As with chunking, NE tagging can be recast as a token classification task.  We will have an "O" tag (token is not part of a named entity), and "B-X" and "I-X" tags for each name type X.

Markov Models for Name Recognition

Chapter 6 of J&M 2nd Edition provides a description of HMMs, Maximum Entropy, and Maximum Entropy Markov Models.

One of the simplest statistical, corpus-trained sequential models is the Hidden Markov Model.   HMMs are based on a generative model of a sentence:  given the previous n words, we generate the next word in two steps, first selecting the next part of speech based on the parts of speech of the previous one or two words, and then selecting the word given the part of speech.  The probability of selecting word wi is then

P ( ti | ti-1 ) P ( wi | ti )

Based on this model, we seek the most likely tag sequence for a sentence

argmax(T) product(i) P ( ti | ti-1 ) P ( wi | ti )

The probabilities can be easily estimated from a tagged corpus, using Maximum Likelihood Estimates.  The most likely tag sequence can then be determined using an HMM and the Viterbi decoder.

The NYU Jet system uses a straightforward HMM for named entity tagging.

The simplest HMM has a single state for each name type, and a single state for not-a-name (NaN).  However, typically the first and last word of a name have different distributions, and the words immediately before or after a word often give a good indication of the name type (for example, 'Mr.' before a name is a clear indication of a person, while 'near' before a name probably indicates a location).  Therefore, we were able to create a more accurate model by having separate states for the words immediately before and after a name, and for the first and last tokens of a name.  This added about 2 points to recall (89 to 91) and 4 points to precision (82 to 86).

BBN's Nymble name tagger (Daniel M. Bikel; Scott Miller; Richard Schwartz; Ralph Weischedel.  Nymble: a High-Performance Learning Name-finder.  Proc. ANLP 97.) is perhaps the best-known name tagger.

They used several techniques to enhance performance over a basic HMM.  Most notably, they used bigram probabilities:  they differentiated between the probability of generating the first word of a name and subsequent words of a name.  The probability of generating the first word was made dependent on the prior state;  the probability of generating subsequent words was made dependent on the prior word.  The probability of a state transition was made dependent on the prior word.  This had to be combined with smoothing to handle the case of unseen bigrams.

(student presentation)

HMMs are generative models which produce a joint probability over observation and label sequences;  typically we compute P(new state | prior state) and P(current word | current state).  It is difficult to represent long-range or multiple interacting features in such a formalism.  Instead, researchers have used functions which compute the state probability given the input -- a formalism which allows for a richer set of features.

Sekine et al. (Satoshi Sekine; Ralph Grishman; Hiroyuki Shinnou.  A Decision Tree Method for Finding and Classifying Names in Japanese Texts.  Sixth WVLC, 1998)  used a decision tree method for Japanese named entity.  The decision tree yeilded information on the probability of the various tags.  A Viterbi algorithm then computed the most likely tagging of the entire sentence.

Borthwick et al. (Andrew Borthwick; John Sterling; Eugene Agichtein; Ralph Grishman.  Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition.  Sixth WVLC, 1998) used a maximum entropy method to compute the tags.  Again, a Viterbi decoder was used to select the best tagging.  By itself the method did fairly well (92.2 F on dry-run).  More interestingly, it could be combined with the patterns of the NYU hand-coded-rule system, with each rule a separate feature.  The rule-based system by itself also got 92.2 F;  the combined system got 95.6 F, roughly on a par with the best commercial system.

(presentation by Bonan Min)

McCallum (Maximum Entropy Markov Models for Information Extraction and Segmentation. Andrew McCallum, Dayne Freitag and Fernando Pereira. ICML-2000) describes general Maximum Entropy Markov Models (MEMMs) as computing P(current state | input, prior state) using Maximum Entropy methods.  McCallum notes that the Borthwick model is somewhat weaker in that  the current state probability is  conditioned only on the input, not on the prior state, and that may be why it did not do quite as well as the Nymble HMM model.

He later describes a conditional random field (CRF) as an improvement over MEMMs for the NE task. (Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons.  Andrew McCallum and Wei Li. CoNLL 2003).

Discriminative methods

Another concern  with HMMs is that the parameters learned may not be the optimal ones for the ultimate classification task.  As an alternative, discriminative methods are trained to make the discrimination between classes directly.  Collins ( Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms, EMNLP 02; Collins and Duffy, ACL 2002) described an approach using error-driven HMM training and reported a 15% reduction in error rate on a named entity tagging task.  Support Vector Machines are currently the most widely used discriminative method in NLP, and have been effectively applied to the named entity task (Efficient support vector classifiers for named entity recognition, Hideki Isozaki and Hideto Kazawa, COLING 2002).

Looking ahead to next week ...unsupervised learning of names

Michael Collins; Yoram Singer.  Unsupervised Models for Named Entity Classification.  EMNLP 99.

Riloff, E. and Jones, R. (1999) "Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping", Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99) , 1999, pp. 474-479.