### Lecture 8 Outline

March 13, 2008

Discuss term projects..

### Maximum entropy modeling

Up to now, we have looked at one simple statistical model -- HMMs.  HMMs are very simple to train and have proven quite effective for tasks such as POS tagging and name tagging.  However, if we want to refine and improve a basic HMM model based on our intuition or linguistic analysis, things get more complicated.  We can split states (for example, separate states for first and last name) but if we split too much we will run into sparse data problems.

In practice, we may (through our linguistic analysis or intuition) be able to identify a rather heterogenous set of features which contribute in some way to a choice about the current word:  whether the current story is a sports article;  whether it's in the first sentence of the story;  whether there's a preposition anywhere in the last five words;  whether this word is on a word list;  and so forth.  Maximum entropy modeling provides one mathematically well-founded method for combining such features in a probabilistic model.

The features provide constraints on the model.  We'd like to have a probability distribution which, outside of these constraints, is as uniform as possible -- has the maximum entropy among all models which satisfy these constraints.

Suppose we have a tagging task, where we want to assign a tag t to a word w based on the 'context' h of w (the words around w, including w itself).  In other words, we want to compute p(h,t).  We will specify a set of K features in the form of binary-valued indicator functions fi (h, t).  For example,

f1(h, t) = 1 if the preceding word in h is "to" and t = "VB"
= 0 otherwise

We will use a log-linear model, where

p(h, t) = (1/Z) product(i=1 to K) alphaifi(h, t)

where alphai is the weight for feature i, and Z is a normalizing constant.  The goal of the learning procedure is to determine the values of the alphai's so that the expected value of each fi

sumh,t p(h, t) fi(h, t)

is equal to its expected value (i.e., its average value) over the training set of N words (whose contexts are h1, ..., hN):

(1/N) sumj fi(hj, t)

The procedure for finding the alphai's is called generalized iterative scaling.  A number of packages are now available which implement this algorithm and improvements thereon.

Maximum entropy modeling can be combined with a Markov model, so that, for each state, the probability of a transition to that state is computed by a Max Ent model.  The result is a Maximum Entropy Markov Model (MEMM).

Feature engineering:  when using a package such as Max Ent, the computational linguists job becomes one of feature engineering -- identifying the features which will be most predictive of the tags we are trying to assign.

### Capturing Semantic Constraints in Jet Patterns

(continued from last week)
How do we capture the constraints in a domain?  Let's consider the executive succession domain ... keeping track of people who were hired for or who left executive jobs.  In general, articles which contain information about executive succession also talk about other stuff, but we will only be concerned for the moment with references to executive succession.  Other information in the article will be ignored.

We are going to look for patterns like

company "appointed" person "as" position
company "named" person "as" position
company "selected" person "as" position

The first problem we face in trying to make these patterns a bit more general is that we may have different inflected forms of each verb.  A headline might have a present tense, for example

WorldCom appoints Fred Smith as vice president for lunar phone service
The directors of WorldCom appoint Fred Smith ...

so maybe we need a pattern like

company ("appointed" | "appoint" | "appoints") person "as" position

That's not very convenient;  we'd like to express the pattern in terms of the base form of the verb.  Fortunately, the Jet English lexicon assigns a feature structure to every inflected form of the verb, including a pa feature of the form [head = base-form], so we can write this more succinctly as

company [constit cat=tv pa=[head=appoint]] person "as" position

[Note:  this requires that one use both the Jet lexicon and the statistical part-of-speech tagger;  in this case, the tagger is used to filter the entries provided by the lexicon, using the Jet command pruneTags.]

This still doesn't address the problem of verb groups which include "appoint" ...

Enron has appointed Fred Smith as treasurer for the day.
Enron will appoint Fred Smith as comptroller.

We could write a verb group pattern for each verb, in which we constrain the head of the verb group

vg-appoint := [constit cat=tv pa=[head=appoint]] | [constit cat=w] vg-inf-appoint | tv-vbe vg-ving-appoint;
vg-inf-appoint := [constit cat=v pa=[head=appoint]] | "be" vg-ving;
vg-ving-appoint := [constit cat=ving pa=

and then create a unique verb group category, but that's clearly inefficient.  Instead we create a general verb group constituent which has a pa property  equal to the pa of the head of the phrase,  by writing a general verb group pattern which propagates the information from the head to the phrase.  This can be done in the Jet pattern language using variable (symbol beginning with a capital letter) for a feature:

vg := [constit cat=tv pa=PA-verb] | [constit cat=w] vg-inf | tv-vbe vg-ving;
vg-inf := [constit cat=v pa=PA-verb] | "be" vg-ving;
vg-ving := [constit cat=ving pa=PA-verb];
pa=PA-verb];

We can take an exactly parallel approach for noun groups.  The Jet lexicon assigns to each form of the noun (singular and plural) a feature
pa = [head = base-form-of-noun  number = singular or plural], and we can propagate this information in the same way from the head of the noun group to be a feature on the noun group itself.  With rare exceptions, selectional constraints act between the heads of the noun and verb groups.

For verb groups, we still have to write