G22.2591 - Advanced Natural Language Processing - Spring 2004
Why name recognition?
Name recognition was introduced as a separate task in Message
Understanding Conference - 6 (see also the paper by Grishman and
Sundheim). Through earlier IE evaluations, system developers came
to recognize that name recognition and classification was an important
part of text processing, even if it was not recognized as basic in
linguistic study. Making it a separate task encouraged research
to improve this technology, and emphasized its value for a range of
applications (document indexing, and later question answering).
For MUC-6, there were three name categories -- people, organizations,
and locations. Date, time, percentage, and currency expressions
were also included under name recognition. Some evaluations since
then have added categories ... artifact, facility, weapon, ... .
In fact, some systems for open-domain question answering have added a
very large (100+) number of categories. However, almost all
studies have been done with the original set of three name
categories. Similar evaluations have been done for quite a few
foreign languages; CoNLL-2002
shared task did Dutch and Spanish; CoNLL-2003 shared task
did English and German.
How to measure scores?
Name recognition is scored by recall, precision, and F-measure
(combination of recall and precision). The simplest rule is to
require perfect match -- you only get credit if you get the type,
start, and end of a name correct (this is the metric used, for example,
in the Japanese IREX
evaluation); the MUC evaluations used a more generous
scoring, with partial score for identifying a name of any sort, for
getting its type correct, and for getting its extent correct.
How well do people do?
In a small study for MUC-6 (17 articles), Sundheim reported <5%
interannotator (key-to-key) error. Agreement is probably enhanced
in languages where names are capitalized and for text where the
annotator is familiar with most of the names. Without
capitalization, it can be hard to tell unfamiliar organization names
from common noun phrases.
For a specific domain, it is possible to do very well with hand-coded
rules and dictionaries. On the MUC-6 evaluation (a very favorable
situation, where the source and general topic of the test data was
known in advance), the SRA system, based on hand-coded rules, got
F=96.4. Writing rules by hand, however, requires some skill and
The hand-coded rules take advantage of
Note that sometimes the type decision is based upon left context, and
sometimes upon right context, so it would be difficult for taggers
which operate deterministically from left to right or from right to
left to perform optimally.
- known names (through lists of well-known places, organizations,
- characteristic suffixes for organizations (Corp., Associates,
...) and locations (Island, Bay)
- first names for people
- titles for people
- other mentions of the same name in an article
Like POS tagging and chunking, named entity recognition has been tried
with very many different machine learning methods. More than the
syntactic tasks, performance on NE recognition depends on the variety
of resources which are brought to bear. CoNLL evaluations are
relatively 'pure' ... the systems basically just learn from the
provided training corpus. On the other hand, 'real' systems make
use of as many lists and as much training data as available. This
has a substantial effect on performance. In additiion,
performance is strongly affected by the domain of the training and test
data. These two effects can make it difficult to compare results
across different evaluations.
As with chunking, NE tagging can be recast as a token classification
task. We will have an "O" tag (token is not part of a named
entity), and "B-X" and "I-X" tags for each name type X.
Markov Models for Name Recognition
The NYU Jet system uses a straightforward HMM for named entity tagging.
The simplest HMM has a single state for
each name type, and a single state for not-a-name (NaN). However,
typically the first and last word of a name have different
distributions, and the words immediately before or after a word often
give a good indication of the name type (for example, 'Mr.' before a
name is a clear indication of a person, while 'near' before a name
probably indicates a location). Therefore, we were able to create
a more accurate model by having separate states for the words
immediately before and after a name, and for the first and last tokens
of a name. This added about 2 points to recall (89 to 91) and 4
points to precision (82 to 86).
BBN's Nymble name tagger (Daniel M. Bikel; Scott Miller; Richard
Schwartz; Ralph Weischedel. Nymble: a
High-Performance Learning Name-finder. Proc. ANLP 97.) is
perhaps the best-known name tagger.
They used several techniques to enhance
performance over a basic HMM. Most notably, they used bigram
probabilities: they differentiated between the probability of
generating the first word of a name and subsequent words of a
name. The probability of generating the first word was made
dependent on the prior state; the probability of generating
subsequent words was made dependent on the prior word. The
probability of a state transition was made dependent on the prior
word. This had to be combined with smoothing to handle the case
of unseen bigrams.
HMMs are generative models, and we noted before some difficulties with
such models. A generative model produces a joint probability over
observation and label sequences; typically we compute P(new state
| prior state) and P(current word | current state). It is
difficult to represent long-range or multiple interacting features in
such a formalism. Instead, researchers have used functions which
compute the state probability given the input -- a formalism which
allows for a richer set of features.
Sekine et al. (Satoshi Sekine; Ralph
Grishman; Hiroyuki Shinnou. A Decision Tree
Method for Finding and Classifying Names in Japanese Texts
Sixth WVLC, 1998) used a decision tree method for Japanese named
entity. The decision tree yeilded information on the probability
of the various tags. A Viterbi algorithm then computed the most
likely tagging of the entire sentence.
Borthwick et al. (Andrew Borthwick; John Sterling; Eugene Agichtein;
Ralph Grishman. Exploiting Diverse
Knowledge Sources via Maximum Entropy in Named Entity Recognition
Sixth WVLC, 1998) used a maximum entropy method to compute the
tags. Again, a Viterbi decoder was used to select the best
tagging. By itself the method did fairly well (92.2 F on
dry-run). More interestingly, it could be combined with the
patterns of the NYU hand-coded-rule system, with each rule a separate
feature. The rule-based system by itself also got 92.2 F;
the combined system got 95.6 F, roughly on a par with the best
Entropy Markov Models for Information Extraction and Segmentation
Andrew McCallum, Dayne Freitag and Fernando Pereira. ICML-2000)
describes general Maximum Entropy Markov Models (MEMMs) as computing
P(current state | input, prior state) using Maximum Entropy methods.
The Ratnaparkhi POS tagger is close to this model. McCallum notes
that the Borthwick model is somewhat weaker in that the current
state probability is conditioned only on the input, not on the
prior state, and that may be why it did not do quite as well as the
Nymble HMM model.
Discriminative training for HMMs
Another concern we had with HMMs was that the parameters learned may
not be the optimal ones for the ultimate classification task. As
an alternative, we considered discriminative
methods ... methods which were trained to make the
discrimination between classes directly. We considered one such
approach, SVMs, last week. Collins ( Discriminative
Training Methods for Hidden Markov Models: Theory and Experiments with
Perceptron Algorithms, EMNLP 02; Collins and Duffy, ACL 2002) has
described a somewhat different approach. The basic idea was to
use error-driven training. Collins reported a 15% reduction in
error rate on a named entity tagging task by using this approach.
Looking ahead to next week ...unsupervised learning of names