CSCI-GA.2590 - Natural Language Processing - Spring 2013 - Prof.
Lecture 6 Outline
March 5, 2013
Named Entity Tagging
As we have noted before, identifying names is an important part of many
natural language processing applications. Names are very common
in most types of text, and -- unlike general vocabulary -- cannot be
looked up in a dictionary.
The simple ChunkPatterns set
Jet treats any sequence of capitalized words as a name. This
is very crude ... it doesn't handle names at the beginning of sentences
or names with some lower-case words, such as "University of
Pennsylvania" or "City of New York". It doesn't work for
headlines, and such a strategy would not work for many languages with
no case information (Chinese, Japanese) or where other nouns are
capitalized (German). Furthermore, it doesn't classify names
(people vs. companies, for example), although that is essential for
almost any real application.
Fortunately, name identification has become a widely-studied task since
the mid-90's, so there are now many corpora annotated with name
information, in many languages. The 'standard set', introduced at
Understanding Conference - 6 in 1995, recognizes three types of
names -- people, organizations, and locations -- as well as four other
types of expressions -- dates, times, percentages, and monetary
amounts. These corpora have been used to develop both detailed
hand-coded rules and statistical models.
These three name types dominate in 'hard news', but there are many
other types of names, such as names of products ("Cheerios",
"Prius"), works of art ("Mona Lisa"), and laws (the "Glass-Steagal Act").
Furthermore, specialized domains may have their own quite different
name types, such as gene names in the biology literature or disease
names in medical reports.
Some names are simply memorized -- for example, the names of well known
companies (IBM, Ford). Other names can be identified and
classified based on both internal and external evidence. Examples
of internal evidence are common first names ("Fred Kumquat") or
corporate suffixes ("Blightly Associates", "Zippo Corp.");
examples of external evidence are
titles ("President Huber") and verbs which take human subjects ("Zenca
died"). Such evidence can be used by both hand-coded and
Many different statistical models have been used for named entity
tagging; HMMs were one of the first and are still commonly
used (see Nymble: a
High-Performance Learning Name-finder and the Advanced
NLP notes on NE). The simplest HMM has one state for each
type of name,
plus one state for "other". However, such a model does not
capture any context information. To include context information
and a bit of internal structure, Jet uses a more elaborate HMM, with 6
states for each name type. Other HMMs for name recognition,
such as Nymble, capture some context by
conditioning the transition and emission probabilities on the prior word.
(See diagrams of the Nymble and Jet HMM networks.)
The name tagger for Jet is run with the command "tagNames". The parameter "NameTags.fileName" specifies
the HMM to be used for name tagging. Jet includes the "MUCnameHMM.txt" file, an HMM
trained on the MUC-7 name corpus. The tagger produces annotations
of type ENAMEX for names
and type TIMEX for dates
Maximum entropy modeling (J&M sec. 6.6 - 6.8)
(J&M present Maximum entropy within the broader context of logistic
regression; for a more direct presentation, see Berger's
tutorial or Manning & Schutze, sec. 16.2)
Up to now, we have looked at one simple statistical model --
HMMs. HMMs are very simple to train and have proven quite
effective for tasks such as POS tagging and name tagging.
However, if we want to refine and improve a basic HMM model based on
our intuition or linguistic analysis, things get more
complicated. We can split states (for example, separate states
for first and last name) but if we split too much we will run into
sparse data problems.
In practice, we may
(through our linguistic analysis or intuition) be able to identify a
heterogenous set of features which contribute in some way to a choice
about the current word: whether the current story is a sports
article; whether it's in the first sentence of the story;
whether there's a preposition anywhere in the last five words;
whether this word is on a word list; and so forth. Maximum
entropy modeling provides one
mathematically well-founded method for combining such features in a
The features provide constraints on the model. We'd like to have
a probability distribution which, outside of these constraints, is as
uniform as possible -- has the maximum entropy among all models which
satisfy these constraints.
Suppose we have a tagging task, where we want to assign a tag t to a
word w based on the 'context' h of w (the words around w, including w
itself). In other words, we want to compute p(h,t). We will
specify a set of K features in the form of binary-valued indicator
functions fi (h, t). For example,
f1(h, t) = 1 if the
preceding word in h is "to" and t = "VB"
We will use a log-linear model, where
p(h, t) = (1/Z) Πi=1 to K
where αi is the weight for feature i, and Z is a
normalizing constant. The goal of the learning procedure is to
determine the values of the αi's so that the expected
value of each fi
Σh,t p(h, t) fi(h,
is equal to its expected value (i.e., its average value) over the
training set of N words (whose contexts are h1, ..., hN):
(1/N) Σj fi(hj,
The procedure for finding the αi's is called generalized iterative scaling.
number of packages are now available which implement this algorithm
and improvements thereon.
MaxEnt is typically used for a multi-class classifier. We are
given a set of training data, where each datum is labeled with a set of
features and a class (tag). Each feature-class pair constitutes
an indicator function. We train a classifier using this data,
computing the αs. We can then classify new data by selecting
the class (tag) which maximizes p(h ,t).
Sequence classifiers: Maximum Entropy Markov Models (J&M 6.8)
Maximum entropy modeling can be combined with a Markov model, so that,
for each state, the probability of a transition to that state is
computed by a Max Ent model, based on the prior state and arbitrary
features of the input sequence. The result is a Maximum Entropy
Markov Model (MEMM). Decoding (selecting the best tag sequence)
can be done deterministically left-to-right or (better) using Viterbi
decoding like an HMM.
When using a package such as MaxEnt, the
computational linguist's job becomes one of feature engineering --
identifying the features which will be most predictive of the tags we
are trying to assign. Simple features are the current token (or the
previous token, or the next token) having a particular value, or being
on a particular list (such as a list of common titles or common first
Keep in mind that each feature contributes an independent factor to
the probability. Consequently if you believe a pair of features
appearing together is more significant than each separately, the
conjunction can be defined as an additional feature. Carrying this
to its logical conclusion, one can make complete name patterns from
a hand-coded name tagger into individual features.
MaxEnt can even be used to combine the output of different systems, encoding
each system's output as a set of features (Borthwick, Sterling,
Agichtein and Grishman, WVLC 1998). A related technique can
be used to adapt a tagger to a new domain where limited training data
is available: first train the tagger on the original domain (with
extensive training data); then train the tagger on the new domain,
treating the output of the tagger trained on the original domain as
an additional feature (Florian et al. NAACL 2004).