G22.2590 - Natural Language Processing - Spring 2010 - Prof.
Lecture 7 Outline
March 4, 2010
Named Entity Tagging
As we have noted before, identifying names is an important part of many
natural language processing applications. Names are very common
in most types of text, and -- unlike general vocabulary -- cannot be
looked up in a dictionary.
The simple ChunkPatterns set
Jet treats any sequence of capitalized words as a name. This
is very crude ... it doesn't handle names at the beginning of sentences
or names with some lower-case words, such as "University of
Pennsylvania" or "City of New York". It doesn't work for
headlines, and such a strategy would not work for many languages with
no case information (Chinese, Japanese) or where other nouns are
capitalized (German). Furthermore, it doesn't classify names
(people vs. companies, for example), although that is essential for
almost any real application.
Fortunately, name identification has become a widely-studied task over
the last decade, so there are now many corpora annotated with name
information, in many languages. The 'standard set', introduced at
Understanding Conference - 6 in 1995, recognizes three types of
names -- people, organizations, and locations -- as well as four other
types of expressions -- dates, times, percentages, and monetary
amounts. These corpora have been used to develop both detailed
hand-coded rules and statistical models.
Some names are simply memorized -- for example, the names of well known
companies (IBM, Ford). Other names can be identified and
classified based on both internal and external evidence. Examples
of internal evidence are common first names ("Fred Kumquat") or
corporate suffixes ("Blightly Associates", "Zippo Corp.");
examples of external evidence are
titles ("President Huber") and verbs which take human subjects ("Zenca
died"). Such evidence can be used by both hand-coded and
Many different statistical models have been used for named entity
tagging; HMMs were one of the first and remain one of the most
popular (see Nymble: a
High-Performance Learning Name-finder and the Advanced
notes on NE). The simplest HMM has one state for each
type of name,
plus one state for "other". However, such a model does not
capture any context information. To include context information
and a bit of internal structure, Jet uses a more elaborate HMM, with 6
states for each name type. Other HMMs for name recognition
condition the transition and emission probabilities on the prior word.
The name tagger for Jet is run with the command "tagNames". The parameter "NameTags.fileName" specifies
the HMM to be used for name tagging. Jet includes the "MUCnameHMM.txt" file, an HMM
trained on the MUC-7 name corpus. The tagger produces annotations
of type ENAMEX for names
and type TIMEX for dates
Maximum entropy modeling (J&M sec. 6.6 - 6.8)
(J&M present Maximum entropy within the broader context of logistic
regression; for a more direct presentation, see Berger's
tutorial or Manning & Schutze, sec. 16.2)
Up to now, we have looked at one simple statistical model --
HMMs. HMMs are very simple to train and have proven quite
effective for tasks such as POS tagging and name tagging.
However, if we want to refine and improve a basic HMM model based on
our intuition or linguistic analysis, things get more
complicated. We can split states (for example, separate states
for first and last name) but if we split too much we will run into
sparse data problems.
In practice, we may
(through our linguistic analysis or intuition) be able to identify a
heterogenous set of features which contribute in some way to a choice
about the current word: whether the current story is a sports
article; whether it's in the first sentence of the story;
whether there's a preposition anywhere in the last five words;
whether this word is on a word list; and so forth. Maximum
entropy modeling provides one
mathematically well-founded method for combining such features in a
The features provide constraints on the model. We'd like to have
a probability distribution which, outside of these constraints, is as
uniform as possible -- has the maximum entropy among all models which
satisfy these constraints.
Suppose we have a tagging task, where we want to assign a tag t to a
word w based on the 'context' h of w (the words around w, including w
itself). In other words, we want to compute p(h,t). We will
specify a set of K features in the form of binary-valued indicator
functions fi (h, t). For example,
f1(h, t) = 1 if the
preceding word in h is "to" and t = "VB"
We will use a log-linear model, where
p(h, t) = (1/Z) product(i=1 to K)
where alphai is the weight for feature i, and Z is a
normalizing constant. The goal of the learning procedure is to
determine the values of the alphai's so that the expected
value of each fi
sumh,t p(h, t) fi(h,
is equal to its expected value (i.e., its average value) over the
training set of N words (whose contexts are h1, ..., hN):
(1/N) sumj fi(hj,
The procedure for finding the alphai's is called generalized iterative scaling.
number of packages are now available which implement this algorithm
and improvements thereon.
MaxEnt is typically used for a multi-class classifier. We are
given a set of training data, where each datum is labeled with a set of
features and a class (tag). Each feature-class pair constitutes
an indicator function. We train a classifier using this data,
computing the alphas. We can then classify new data by selecting
the class (tag) which maximizes p(h ,t).
Sequence classifiers: Maxium
entropy Markov models (J&M 6.8)
Maximum entropy modeling can be combined with a Markov model, so that,
for each state, the probability of a transition to that state is
computed by a Max Ent model, based on the prior state and arbitrary
features of the input sequence. The result is a Maximum Entropy
Markov Model (MEMM). Decoding (selecting the best tag sequence)
can be done deterministically left-to-right or (better) using Viterbi
decoding like an HMM.
Feature engineering: when using a package such as Max Ent, the
computational linguists job becomes one of feature engineering --
identifying the features which will be most predictive of the tags we
are trying to assign.