G22.2590 - Natural Language Processing - Spring 2003 Prof. Grishman

Lecture 6 Outline

February 27, 2003

Facing reality: problems of grammatical coverage and ambiguity

Until the mid-90's, the primary approach to developing a syntactic analyzer was to have linguists develop the necessary grammar and dictionary.  Unfortunately, we don't really know how to write a “complete” and “tight” grammar and dictionary (where tight means that it doesn't produce lots of incorrect analyses).  Beyond the “core grammar” generally discussed by linguists, there are a large number of relatively rare constructs.  If we simply add productions for all these rare constructs,  they end up 'firing' when we don't want them, producing lots of bad parses.  This leads us to write ever-more-complex constraints on the grammar.

Good hand-written grammars/parsers can produce a full (but not necessarily fully correct) analysis of about 70% of sentences for newspaper text (the rate is considerably better for simpler and more uniform text, such as some types of manuals and other specialized texts).  These grammars are quite large (at least several hundred productions) and complex (many features and grammatical constraints).

To improve their performance, most systems are able to return partial analyses if a full sentence analysis cannot be obtained.  Including both full and partial analyses, such parsers get about 65% of syntactic structures correct.

To compound the problem, if we are successful in parsing, we may get a very large number of parses for a single sentence if we rely on grammatical constraints alone.  For example, the ambiguity due to attachment problems increases exponentially with the number of ambiguous modifiers.

Partial Parsers:  Overview

How can we process unrestricted text without laboriously building such a complex grammar and parser?  One possibility is to do partial parsing (J&M 10.5):  instead of building a full parse for each sentence, we just identify a few more basic structures, such as The goal is to identify types of constructs which can be reliably identified based on local evidence.  The constructs listed are relatively simple --- they are not recursive and avoid attachment ambiguity.  So we may hope that they are easier to identify than full parses.  In particular, it may be convenient to state them in terms of finite state patterns, which makes recognition very fast.  The downside is that, because they do not provide as much syntactic structure, they leave more work for subsequent (semantic) processing to do.

Finding these structures is not simply a matter of taking existing grammar rules and applying a parser bottom-up.  We are making deterministic choices, so we must try to determine the correct noun and verb groups (for example) without benefit of the constraints provided by the 'upper’ syntactic structure.  This can be done by using a part-of-speech tagger to resolve part-of-speech ambiguities.  Or it can be done by writing rules for these constructs which make some limited use of local context. (Or both.)

To the extent that we are successful we will have an analyzer which will be relatively fast (because it is making deterministic choices) and robust with respect to variations in global grammatical structure (since we are relying more heavily on local clues).

Typically, we will use a cascade (sequence) of such partial analyzers ... for example, first identifying names, then noun groups and verb groups.  Each stage will make use of the analysis performed by the previous stages.

Partial Parsers:  Strategies ... Symbolic Learners

Each stage of analysis can be performed either by symbolic rules (typically, finite-state patterns) or by a probabilistic model (such as an HMM).  The symbolic rules can either be written by hand or learned from a corpus (e.g., Brill tagger).  Learning symbolic rules has the benefit (over probabilistic models) that the results are inspectable and in some cases can even be improved based on linguistic insights.

One of the most popular symbolic learners is Transformation-Based Learning, introduced by Brill.  In transformation-based tagging, we begin by assigning each word its most common part-of-speech.  We then apply a series of transformations to the corpus.  Each transformation has the general form
if (some condition on the previous and following tags) then change the current tag from X to Y
the condition can take the form 'the previous tag is z', or 'the tag of the word 2 back is w', etc.  Each transformation is tried at each position of the corpus, although it will 'fire' only on those meeting the condition.

Transformation-based learning creates a set of possible transformations (transformation templates).  Starting with the initial tag assignments (of the most common part of speech for each word), we try all possible transformations, and select the one which produces the maximum improvement in accuracy (measured against a hand-tagged corpus).  We apply that transformation, and then repeat the learning process.  This yields a series of transformations which can be used to tag new text.  

There is a trade-off  between the size of the space of possible transformation and the training time ... allowing a larger space can improve potential performance but greatly slow down training.  In practice, TBLs are much slower to train than HMMs.  We use both HMMs and hand-written finite-state rules in Jet, but have not incorporated any symbolic learners.

JET:  an Architecture for Cascaded Annotation

The design of Jet is typical of systems whose main model of processing is a cascade of analyzers, each of which adds some structural information to the text.  Many current systems for information extraction and similar tasks use a similar design.

Central to Jet is the notion of an annotated document.  The Document consists of a text (which normally does not change) and a set of annotations.  Each annotation consists of a type, a span (a start and end point in the document), and a set of zero or more features.  The main annotation type we have used so far is the constit annotation, which represents a syntactic constituent.  It has a feature cat, which records the syntactic category, and may have other features, such as number.

When we use the console to try a sentence, Jet creates a 1-line document and submits it for processing.  If we a processing an entire document (as we shall starting next week), Jet begins by identifying the portions of the document to be processed (document zoning) and dividing the document zones into sentences.

Jet applies to each sentence a series of actions, as specified by the processSentence keyword in the parameter file.  Most of these actions specify annotators -- programs which add annotations to the document.  Examples of annotators are

tokenizer
(applied automatically to text typed at the console)  divides text into tokens, adding token annotations
lexLookup
looks up tokens in the dictionary and adds constit annotations
tagPOS
uses an HMM POS tagger to assign constit annotations with cat = Penn part-of-speech to tokens
tagJet
uses an HMM POS tagger to assign tagger annotations with cat = Penn part-of-speech, and then converts them to constit annotations with cat = Jet part-of-speech
pruneTags uses an HMM POS tagger to assign tagger annotations with cat = Penn part-of-speech, and then uses these tags to prune constit annotations which are not compatible with the tagger tags (normally used after lexLookup)

Jet patterns

In addition to these annotators, Jet allows you to build an annotator from a set of rules involving finite-state patterns.  Each pattern is a regular (finite-state) expression involving literals (which match specific tokens) and annotations.  If the pattern is matched, Jet can add additional annotations, as well as print messages.  We associate a pattern set name with a set of such pattern-action rules, and treat pat(pattern-set-name) as an annotator.

As an example, we look at a small set of patterns to identify noun and verb groups.