CSCI-GA.2590 - Natural Language Processing - Spring 2013 Prof. Grishman

Lecture 5 Outline

February 26, 2013

Facing reality: problems of grammatical coverage and ambiguity

Until the mid-90's, the primary approach to developing a syntactic analyzer was to have linguists develop the necessary grammar and dictionary. Unfortunately, we don't really know how to write a “complete” and “tight” grammar and dictionary (where tight means that it doesn't produce lots of incorrect analyses).  Beyond the “core grammar” generally discussed by linguists, there are a large number of relatively rare constructs.  If we simply add productions for all these rare constructs,  they end up 'firing' when we don't want them, producing lots of bad parses.  This leads us to write ever-more-complex constraints on the grammar.

Good hand-written grammars/parsers can produce a full (but not necessarily fully correct) analysis of about 70% of sentences for newspaper text (the rate is considerably better for simpler and more uniform text, such as some types of manuals and other specialized texts).  These grammars are quite large (at least several hundred productions) and complex (many features and grammatical constraints).

To improve their performance, most systems are able to return partial analyses if a full sentence analysis cannot be obtained.  Including both full and partial analyses, such parsers get about 65% of syntactic structures correct.

To compound the problem, if we are successful in parsing, we may get a very large number of parses for a single sentence if we rely on grammatical constraints alone.  For example, the ambiguity due to attachment problems increases exponentially with the number of ambiguous modifiers.

Two basic approaches are currently used to address these problems:  partial parsers and statistical parsers based on treebanks.  We shall consider the approach based on partial parsers over the next few weeks, and will consider statistical parsers later in the semester.

Partial Parsers:  Overview (J&M 13.5)

How can we process unrestricted text without laboriously building such a complex grammar and parser?  One possibility is to do partial parsing:  instead of building a full parse for each sentence, we just identify a few more basic structures, such as The goal is to identify types of constructs which can be reliably identified based on local evidence.  The constructs listed are relatively simple --- they are not recursive and avoid attachment ambiguity.  So we may hope that they are easier to identify than full parses.  In particular, it may be convenient to state them in terms of finite state patterns, which makes recognition very fast.  The downside is that, because they do not provide as much syntactic structure, they leave more work for subsequent (semantic) processing to do.

Finding these structures is not simply a matter of taking existing grammar rules and applying a parser bottom-up.  We are making deterministic choices, so we must try to determine the correct noun and verb groups (for example) without benefit of the constraints provided by the 'upper’ syntactic structure.  This can be done by using a part-of-speech tagger to resolve part-of-speech ambiguities.  Or it can be done by writing rules for these constructs which make some limited use of local context. (Or both.)

To the extent that we are successful we will have an analyzer which will be relatively fast (because it is making deterministic choices) and robust with respect to variations in global grammatical structure (since we are relying more heavily on local clues).

Typically, we will use a cascade (sequence) of such partial analyzers ... for example, first identifying names, then noun groups and verb groups.  Each stage will make use of the analysis performed by the previous stages.

JET:  an Architecture for Cascaded Annotation

The design of Jet is typical of systems whose main model of processing is a cascade of analyzers, each of which adds some structural information to the text.  Many current systems for information extraction and similar tasks use a similar design.

Some of these design ideas were spread as part of the Tipster architecture, a design developed for the US Government in the mid-1990's.  Several implementations were made of this architecture, notably the GATE system developed at the Univ. of Sheffield and now widely used in Europe.  The Unstructured Information Management Architecture (UIMA), which was originally developed by IBM Research, is now available as Apache-licensed open source software.
Central to Jet is the notion of an annotated document.  The Document consists of a text (which normally does not change) and a set of annotations.  Each annotation consists of a type, a span (a start and end point in the document), and a set of zero or more features.  The main annotation type we have used so far is the constit annotation, which represents a syntactic constituent.  It has a feature cat, which records the syntactic category, and may have other features, such as number.

When we use the console to try a sentence, Jet creates a 1-line document and submits it for processing.  If we a processing an entire document (as we shall starting next week), Jet begins by identifying the portions of the document to be processed (document zoning) and dividing the document zones into sentences.

Jet applies to each sentence a series of actions, as specified by the processSentence keyword in the parameter file.  Most of these actions specify annotators -- programs which add annotations to the document.  Examples of annotators are

divides text into tokens, adding token annotations
looks up tokens in the dictionary and adds constit annotations
uses an HMM POS tagger to assign constit annotations with cat = Penn part-of-speech to tokens
uses an HMM POS tagger to assign tagger annotations with cat = Penn part-of-speech, and then converts them to constit annotations with cat = Jet part-of-speech
pruneTags uses an HMM POS tagger to assign tagger annotations with cat = Penn part-of-speech, and then uses these tags to prune constit annotations which are not compatible with the tagger tags (normally used after lexLookup)

Jet patterns

In addition to these annotators, Jet allows you to build an annotator from a set of rules involving finite-state patterns.  Each pattern is a regular (finite-state) expression involving literals (which match specific tokens) and annotations.  If the pattern is matched, Jet can add additional annotations, as well as print messages.  We associate a pattern set name with a set of such pattern-action rules, and treat pat(pattern-set-name) as an annotator.

As an example, we look at a small set of patterns to identify noun and verb groups.

Corpus Trained Partial Parsers (J&M 13.5.2)

Noun and verb groups have been extensively studied, and one can do quite well at recognizing them using a small set of rules.  However, to get the best coverage we may want to consider a corpus-trained strategy, as we did for part-of-speech tagging.

To do so, we want to reduce noun group finding to a token classification problem.  At first glance, you might think that we need only two classes, N (in a noun group) and O (outside a noun group), but this is unable to distinguish two consecutive tokens which are elements of one noun group or two consecutive noun groups.  To make this distinction, we introduce three classes, B (beginning token of a noun group), I (inside ... second or subsequent token of a noun group), and O (outside a noun group).  To handle noun and verb groups, we would need 5 classes, B-N, I-N, B-V, I-V, and O.

The "chunker" (noun and verb group tagger) can then be implemented by training either a set of symbolic rules (typically, finite-state patterns) or a probabilistic model (such as an HMM). We already saw (from out part-of-speech tagger) how to implement a classifier using an HMM.

Transformation-Based Learning

Learning symbolic rules has the benefit (over probabilistic models) that the results are inspectable and in some cases can even be improved based on linguistic insights.

One of the most popular symbolic learners is Transformation-Based Learning, introduced by Brill.  In transformation-based tagging, we begin by assigning each word its most common part-of-speech.  We then apply a series of transformations to the corpus.  Each transformation has the general form
if (some condition on the previous and following tags) then change the current tag from X to Y
the condition can take the form 'the previous tag is z', or 'the tag of the word 2 back is w', etc.  Each transformation is tried at each position of the corpus, although it will 'fire' only on those meeting the condition.

Transformation-based learning creates a set of possible transformations (transformation templates).  Starting with the initial tag assignments (of the most common part of speech for each word), we try all possible transformations, and select the one which produces the maximum improvement in accuracy (measured against a hand-tagged corpus).  We apply that transformation, and then repeat the learning process.  This yields a series of transformations which can be used to tag new text. 

Using the Penn Tree Bank as training data, a very successful chunker has been built using transformation-based learning (Lance Ramshaw and Mitch Marcus.  Text Chunking using Transformation-Based Learning  (WVLC 1995)).

There is a trade-off  between the size of the space of possible transformation and the training time ... allowing a larger space can improve potential performance but greatly slow down training.  In practice, TBLs are much slower to train than HMMs.  We use both HMMs and hand-written finite-state rules in Jet, but have not incorporated any symbolic learners

Evaluating chunkers (J&M 13.5.3)

Our noun/verb group tagger ('chunker') won't be perfect;  like our POS tagger, we need to evaluate it by hand-annotating a substantial corpus and then comparing the results of our chunker against this standard.

What metric should we use?  As discussed last week, we can map a chunk annotation into an assignment of a tag to each word.  For example, if we are tagging just noun groups, we can tag the first word of a noun group as B-NG, subsequent words of a noun group as I-NG, and words not in a noun group as O.  Then we can measure the accuracy of our tagging (correct tags / total number of words).  This can be readily extended to other chunk types, or multiple chunk types.

However, accuracy can be a deceptive measure for less frequent annotations.  If we are scoring verb group annotations, and 80% of words are not part of a verb group (i.e., get tag O), then a tagger which can't find any verb groups at all would get an 80% accuracy score.

So for annotations which can span one or more words we typically use recall and precision measures. In comparing the key and system response, we count the total number of annotations in the key, the total number in the response, and the number of correct annotations (annotations in the response which exactly match one in the key (same starting and ending word)).  We then compute
recall = number correct / number in key
precision = number correct / number in response
F = geometric mean of recall and precision  = 2 / (1 / recall + 1 / precision)

To assist in evaluating chunkers and other similar annotators, Jet provides the capability to annotate entire documents and the ability to score the output for precision and recall.