CSCI-GA.2590 - Natural Language Processing - Spring 2013 - Prof. Grishman

Lecture 6 Outline

March 5, 2013

Named Entity Tagging

As we have noted before, identifying names is an important part of many natural language processing applications.  Names are very common in most types of text, and -- unlike general vocabulary -- cannot be looked up in a dictionary.

The simple ChunkPatterns set for Jet treats any sequence of capitalized words as a name.  This is very crude ... it doesn't handle names at the beginning of sentences or names with some lower-case words, such as "University of Pennsylvania" or "City of New York".  It doesn't work for headlines, and such a strategy would not work for many languages with no case information (Chinese, Japanese) or where other nouns are capitalized (German).  Furthermore, it doesn't classify names (people vs. companies, for example), although that is essential for almost any real application.

Fortunately, name identification has become a widely-studied task since the mid-90's, so there are now many corpora annotated with name information, in many languages.  The 'standard set', introduced at Message Understanding Conference - 6 in 1995, recognizes three types of names -- people, organizations, and locations -- as well as four other types of expressions -- dates, times, percentages, and monetary amounts.  These corpora have been used to develop both detailed hand-coded rules and statistical models.

These three name types dominate in 'hard news', but there are many other types of names, such as names of products ("Cheerios", "Prius"), works of art ("Mona Lisa"), and laws (the "Glass-Steagal Act"). Furthermore, specialized domains may have their own quite different name types, such as gene names in the biology literature or disease names in medical reports.

Some names are simply memorized -- for example, the names of well known companies (IBM, Ford).  Other names can be identified and classified based on both internal and external evidence.  Examples of internal evidence are common first names ("Fred Kumquat") or corporate suffixes ("Blightly Associates", "Zippo Corp.");  examples of external evidence are titles ("President Huber") and verbs which take human subjects ("Zenca died").  Such evidence can be used by both hand-coded and corpus-trained models.

Many different statistical models have been used for named entity tagging;  HMMs were one of the first and are still commonly used (see Nymble: a High-Performance Learning Name-finder and the Advanced NLP notes on NE).  The simplest HMM has one state for each type of name, plus one state for "other".  However, such a model does not capture any context information.  To include context information and a bit of internal structure, Jet uses a more elaborate HMM, with 6 states for each name type.  Other HMMs for name recognition, such as Nymble, capture some context by conditioning the transition and emission probabilities on the prior word.
(See diagrams of the Nymble and Jet HMM networks.)
The name tagger for Jet is run with the command "tagNames".  The parameter "NameTags.fileName" specifies the HMM to be used for name tagging.  Jet includes the "MUCnameHMM.txt" file, an HMM trained on the MUC-7 name corpus.  The tagger produces annotations of type ENAMEX for names and type TIMEX for dates and times.

Maximum entropy modeling (J&M sec. 6.6 - 6.8)

(J&M present Maximum entropy within the broader context of logistic regression;  for a more direct presentation, see Berger's online tutorial or Manning & Schutze, sec. 16.2) 

Up to now, we have looked at one simple statistical model -- HMMs.  HMMs are very simple to train and have proven quite effective for tasks such as POS tagging and name tagging.  However, if we want to refine and improve a basic HMM model based on our intuition or linguistic analysis, things get more complicated.  We can split states (for example, separate states for first and last name) but if we split too much we will run into sparse data problems.

In practice, we may (through our linguistic analysis or intuition) be able to identify a rather heterogenous set of features which contribute in some way to a choice about the current word:  whether the current story is a sports article;  whether it's in the first sentence of the story;  whether there's a preposition anywhere in the last five words;  whether this word is on a word list;  and so forth.  Maximum entropy modeling provides one mathematically well-founded method for combining such features in a probabilistic model.

The features provide constraints on the model.  We'd like to have a probability distribution which, outside of these constraints, is as uniform as possible -- has the maximum entropy among all models which satisfy these constraints.

Suppose we have a tagging task, where we want to assign a tag t to a word w based on the 'context' h of w (the words around w, including w itself).  In other words, we want to compute p(h,t).  We will specify a set of K features in the form of binary-valued indicator functions fi (h, t).  For example,

f1(h, t) = 1 if the preceding word in h is "to" and t = "VB"
           = 0 otherwise

We will use a log-linear model, where

p(h, t) = (1/Z) Πi=1 to K αifi(h, t)

where αi is the weight for feature i, and Z is a normalizing constant.  The goal of the learning procedure is to determine the values of the αi's so that the expected value of each fi

Σh,t p(h, t) fi(h, t)

is equal to its expected value (i.e., its average value) over the training set of N words (whose contexts are h1, ..., hN):

(1/N) Σj fi(hj, t)

The procedure for finding the αi's is called generalized iterative scaling.  A number of packages are now available which implement this algorithm and improvements thereon.

MaxEnt is typically used for a multi-class classifier.  We are given a set of training data, where each datum is labeled with a set of features and a class (tag).  Each feature-class pair constitutes an indicator function.  We train a classifier using this data, computing the αs.  We can then classify new data by selecting the class (tag) which maximizes p(h ,t).

Sequence classifiers:  Maximum Entropy Markov Models (J&M 6.8)

Maximum entropy modeling can be combined with a Markov model, so that, for each state, the probability of a transition to that state is computed by a Max Ent model, based on the prior state and arbitrary features of the input sequence.  The result is a Maximum Entropy Markov Model (MEMM).  Decoding (selecting the best tag sequence) can be done deterministically left-to-right or (better) using Viterbi decoding like an HMM.

Feature engineering

When using a package such as MaxEnt, the computational linguist's job becomes one of feature engineering -- identifying the features which will be most predictive of the tags we are trying to assign. Simple features are the current token (or the previous token, or the next token) having a particular value, or being on a particular list (such as a list of common titles or common first names).

Keep in mind that each feature contributes an independent factor to the probability. Consequently if you believe a pair of features appearing together is more significant than each separately, the conjunction can be defined as an additional feature. Carrying this to its logical conclusion, one can make complete name patterns from a hand-coded name tagger into individual features.

MaxEnt can even be used to combine the output of different systems, encoding each system's output as a set of features (Borthwick, Sterling, Agichtein and Grishman, WVLC 1998). A related technique can be used to adapt a tagger to a new domain where limited training data is available: first train the tagger on the original domain (with extensive training data); then train the tagger on the new domain, treating the output of the tagger trained on the original domain as an additional feature (Florian et al. NAACL 2004).