G22.2590 - Natural Language Processing - Spring 2008 Prof. Grishman
Lecture 8 Outline
March 13, 2008
Discuss term projects..
Maximum entropy modeling
(Manning & Schutze, sec. 16.2; see also Berger's
Up to now, we have looked at one simple statistical model --
HMMs. HMMs are very simple to train and have proven quite
effective for tasks such as POS tagging and name tagging.
However, if we want to refine and improve a basic HMM model based on
our intuition or linguistic analysis, things get more
complicated. We can split states (for example, separate states
for first and last name) but if we split too much we will run into
sparse data problems.
In practice, we may
(through our linguistic analysis or intuition) be able to identify a
heterogenous set of features which contribute in some way to a choice
about the current word: whether the current story is a sports
article; whether it's in the first sentence of the story;
whether there's a preposition anywhere in the last five words;
whether this word is on a word list; and so forth. Maximum
entropy modeling provides one
mathematically well-founded method for combining such features in a
The features provide constraints on the model. We'd like to have
a probability distribution which, outside of these constraints, is as
uniform as possible -- has the maximum entropy among all models which
satisfy these constraints.
Suppose we have a tagging task, where we want to assign a tag t to a
word w based on the 'context' h of w (the words around w, including w
itself). In other words, we want to compute p(h,t). We will
specify a set of K features in the form of binary-valued indicator
functions fi (h, t). For example,
f1(h, t) = 1 if the
preceding word in h is "to" and t = "VB"
We will use a log-linear model, where
p(h, t) = (1/Z) product(i=1 to K)
where alphai is the weight for feature i, and Z is a
normalizing constant. The goal of the learning procedure is to
determine the values of the alphai's so that the expected
value of each fi
sumh,t p(h, t) fi(h,
is equal to its expected value (i.e., its average value) over the
training set of N words (whose contexts are h1, ..., hN):
(1/N) sumj fi(hj,
The procedure for finding the alphai's is called generalized iterative scaling.
A number of packages are now available which implement this algorithm
and improvements thereon.
Maximum entropy modeling can be combined with a Markov model, so that,
for each state, the probability of a transition to that state is
computed by a Max Ent model. The result is a Maximum Entropy
Markov Model (MEMM).
Feature engineering: when using a package such as Max Ent, the
computational linguists job becomes one of feature engineering --
identifying the features which will be most predictive of the tags we
are trying to assign.
Capturing Semantic Constraints in
(continued from last week)
How do we capture the
constraints in a domain? Let's consider the executive succession
domain ... keeping track of people who were hired for or who left
executive jobs. In general, articles which contain information
about executive succession also talk about other stuff, but we will
only be concerned for the moment with references to executive
succession. Other information in the article will be ignored.
We are going to look for patterns like
"appointed" person "as" position
company "named" person
company "selected" person "as" position
The first problem we face in trying to make these patterns a bit more
general is that we may have different inflected forms of each
verb. A headline might have a present tense, for example
WorldCom appoints Fred Smith as vice
president for lunar phone service
The directors of WorldCom appoint Fred
so maybe we need a pattern like
"appoint" | "appoints") person "as" position
That's not very convenient; we'd like to express the pattern in
terms of the base form of the
verb. Fortunately, the Jet
English lexicon assigns a feature
structure to every inflected form of the verb, including a pa feature of the form [head = base-form], so we can write this more
[Note: this requires that one use both the Jet lexicon and the
statistical part-of-speech tagger; in this case, the tagger is
used to filter the entries
provided by the lexicon, using the Jet command pruneTags.]
This still doesn't address the problem of verb groups which include "appoint"
Enron has appointed Fred Smith as
treasurer for the day.
We could write a verb group pattern for each verb, in which we
constrain the head of the
Enron will appoint Fred Smith as comptroller.
:= [constit cat=tv pa=[head=appoint]] | [constit cat=w] vg-inf-appoint
| tv-vbe vg-ving-appoint;
and then create a unique verb group category, but that's clearly
inefficient. Instead we create a general verb group constituent
which has a pa property equal to the pa of the head of the
phrase, by writing a general verb group pattern which propagates
the information from the head to the phrase. This can be done in
the Jet pattern language using variable (symbol beginning with a
capital letter) for a feature:
vg-inf-appoint := [constit cat=v
pa=[head=appoint]] | "be" vg-ving;
vg-ving-appoint := [constit cat=ving pa=[head=appoint]];
when vg-appoint add [constit cat=vgroup-appoint];
:= [constit cat=tv pa=PA-verb] | [constit cat=w] vg-inf | tv-vbe vg-ving;
vg-inf := [constit cat=v pa=PA-verb] | "be" vg-ving;
vg-ving := [constit cat=ving pa=PA-verb];
when vg add [constit cat=vgroup pa=PA-verb];
We can take an exactly parallel approach for noun groups. The Jet
lexicon assigns to each form of the noun (singular and plural) a feature
pa = [head = base-form-of-noun number = singular or plural], and
we can propagate this information in the same way from the head of the
noun group to be a feature on the noun group itself. With rare
exceptions, selectional constraints act between the heads of the noun
and verb groups.
For verb groups, we still have to write
cat=vgroup pa=[head=appoint]] | [constit cat=vgroup pa=[head=name]] |
in order to capture the alternative (synonymous) verbs for hiring
someone. To make this neater, Jet provides a separate component
-- a semantic concept
hierarchy or ontology -- for grouping together related words.
The concept heirarchy allows us to create a tree of concepts, and to
associate one or more words with a concept. We associate the
verbs similar to 'appoint' with a concept node cAppoint in the hierarchy, and
this matches any word associated with the cAppoint node, or a node below
cAppoint in the hierarchy.
With this, we are ready to put together patterns
for finding instances of appointment events. We will have
very modest goals for this example, and will only look for person -
position pairs (we will consider how to capture the organization name
in a later version). There are three patterns we look for:
In this version, very little allowance is made for modifiers which may
the pattern (other than modifiers in noun groups); the only
allowed is an age after a name: "Fred Smith, 42, ". Also,
this version does not impose constraints on the classes of the noun
groups, though we certainly could. This will give us some
additional recall but at some loss of precision.
- active clauses (where we just look for the VP): ...
- passive clauses: Fred was appointed as dogcatcher
- nominalizations: the appointment of Fred as dogcatcher