March 27, 2008

## Probabilistic Grammars

(J&M Chapter 12)

We face a quandary in building broad-coverage grammars.  As we encounter new constructs and extend the grammar by adding new productions, we find that these productions also provide alternative, unwanted parses for 'ordinary' sentences.  We can try to address this problem by adding feature-based constraints, but the resulting grammar becomes very hard to maintain.

One solution we already have considered is to perform only partial parsing -- to recognize a limited set of constituents, and not to insist on getting a complete syntactic analysis of every sentence.

An alternative solution is to employ probabilistic context-free grammars (PCFGs).  We associate a probability with each production in the grammar, with the requirement that the sum of the probabilities of the productions expanding a given symbol sum to 1.  We can then view the grammar as a probabilistic sentence generator, just as we did Markov Models and HMMs.  We start with the sentence symbol, and at each step expand the leftmost non-terminal, choosing a production based on its probability.  In contrast to HMMs, the PCFG allows us to easily capture more global constraints ... for example, that each sentence has a subject and object.

When we parse with a PCFG, we seek the most probable analysis of a sentence (just as we sought the most probable state sequence for an HMM).  The probility of a parse is the product of the probabilities of all the productions which are used to expand symbols in the tree.

Why does a PCFG help us in managing large grammars?  The probabilities provide a way of choosing between parses, preferring the more probable ones.  If we add an additional production to handle some very rare phenomenon we had not previously encountered, we will assign a very low probability to that production, so (roughly speaking) we will generate an analysis using that production only if an analysis using 'ordinary' (more probable) productions is not available.

Of course, if we had to assign such probabilities by hand, judging which productions are more likely, the task of grammar preparation would be almost impossible.  However, if we have a 'tree bank' -- a large set of sentences with manually-assigned parses -- we can build a PCFG in a very straightforward fashion by counting the number of times each production appears in the tree bank.  The probability of a production r(n) expanding a symbol n is

P(r(n)) = count (r(n)) / count(n)
An early tree bank for this purpose was developed by the speech group at IBM Research (Yorktown Heights);  the most influential and widely-used tree bank was developed at the University of Pennsylvania by Mitch Marcus and his colleagues.  It has led to the development of tree banks for other languages, including Japanese, German, Czech, and Chinese.

### Parsing with Probabilistic Grammars

Minimally modifying a parser to incorporate probabilities is relatively simple:  we retain a probability with each partial parse (with each node in the bottom-up parser, with each edge in the chart parser).  When a production is added to the partial parse, we multiply in the probability of that production.

If we are factoring the parse tree, combining all partial analyses of the same word sequence as the same symbol, then we need to retain only the most probable analysis.

The efficiency of probabilisitic parsing can be improved by doing a best-first search (extending the most probable hypothesis) or a heuristic search.

### Extending Probabilistic Grammars:  Probabilistic Lexicalized CFGs

PCFGs assume that probabilities are independent of context.  This is not a very accurate assumption;  for example, pronouns occur much more often as the subject of a sentence than as the direct object.  We can make more accurate probability estimates by conditioning the probability on the parent symbol in the parse tree (for example).

In addition, PCFGs do not take the lexical items into account, even though we recognize that some lexical items are much more likely in some environments.  There are dependencies between words and constructions, which we are familiar with as subcategorization.  For example,  in the context  "I ___ that Fred has two dogs.", it is much more likely that the blank is filled by the word "know" or "believe", and not by "am" or "sold".  In addition, there are dependencies between the words of a sentence.  For example, in the context "I bought a ____ about computational linguistics.", it is much more likely that the blank is filled with "book" than "dog".

To handle such dependencies, we need to make productions dependent on the lexical items in the sentence.  But which lexical items?  We don't have enough data to estimate probabilities for all combinations of words.  To limit the lexical items used to condition the probabilities, we use the notion of head introduced as part of constituent structure and employed in feature grammars.  For each production, we designate one element of the right-hand side as the head of the construct;  for example, the verb in a verb phrase production.  We pass up the tree the word associated with the head element, so that there is a head word h(n) associated with each node n in the tree.  If we have a production A --> B C D, and C is the head element of the production, then h(A) = h(C) is the head of the phrase, and we say that h(B) and h(D) are dependents of the head of the phrase.

We then compute the probability of the tree as the product, over all nodes n, of

p(r(n) | n, h(n)) * P(h(n) | n, h(parent(n)))
where r(n) is the production used to expand node n.  The quality of a parser is measured in terms of how many of the constituents were correctly identified.  A simple PCFG, trained on the Penn Tree Bank (1 million words) may get in the 70's;  the best lexicalized PCFGs get almost 90% correct when tested on the same type of text.