G22.2590 - Natural Language Processing - Spring 2003 Prof. Grishman
Lecture 6 Outline
February 27, 2003
Facing reality: problems of grammatical coverage and ambiguity
Until the mid-90's, the primary approach to developing a syntactic analyzer
was to have linguists develop the necessary grammar and dictionary. Unfortunately,
we don't really know how to write a “complete” and “tight” grammar and dictionary
(where tight means that it doesn't produce lots of incorrect analyses).
Beyond the “core grammar” generally discussed by linguists, there are a
large number of relatively rare constructs. If we simply add productions
for all these rare constructs, they end up 'firing' when we don't
want them, producing lots of bad parses. This leads us to write ever-more-complex
constraints on the grammar.
Good hand-written grammars/parsers can produce a full (but not necessarily
fully correct) analysis of about 70% of sentences for newspaper text (the
rate is considerably better for simpler and more uniform text, such as some
types of manuals and other specialized texts). These grammars are
quite large (at least several hundred productions) and complex (many features
and grammatical constraints).
To improve their performance, most systems are able to return partial analyses
if a full sentence analysis cannot be obtained. Including both full
and partial analyses, such parsers get about 65% of syntactic structures
To compound the problem, if we are successful in parsing, we may get a
very large number of parses for a single sentence if we rely on grammatical
constraints alone. For example, the ambiguity due to attachment problems
increases exponentially with the number of ambiguous modifiers.
Partial Parsers: Overview
How can we process unrestricted text without laboriously building such
a complex grammar and parser? One possibility is to do partial parsing
(J&M 10.5): instead of building a full parse for each sentence,
we just identify a few more basic structures, such as
The goal is to identify types of constructs which can be reliably identified
based on local evidence. The constructs listed are relatively
simple --- they are not recursive and avoid attachment ambiguity.
So we may hope that they are easier to identify than full parses.
In particular, it may be convenient to state them in terms of finite state
patterns, which makes recognition very fast. The downside is that,
because they do not provide as much syntactic structure, they leave more
work for subsequent (semantic) processing to do.
- Proper noun phrases (names)
- Noun groups (nouns with their left modifiers)
- Verb groups (auxiliaries + head verb)
Finding these structures is not simply a matter of taking existing grammar
rules and applying a parser bottom-up. We are making deterministic
choices, so we must try to determine the correct noun and verb groups
(for example) without benefit of the constraints provided by the 'upper’
syntactic structure. This can be done by using a part-of-speech tagger
to resolve part-of-speech ambiguities. Or it can be done by writing
rules for these constructs which make some limited use of local context.
To the extent that we are successful we will have an analyzer which will
be relatively fast (because it is making deterministic choices) and robust
with respect to variations in global grammatical structure (since we are
relying more heavily on local clues).
Typically, we will use a cascade (sequence) of such partial analyzers
... for example, first identifying names, then noun groups and verb groups.
Each stage will make use of the analysis performed by the previous
Partial Parsers: Strategies ... Symbolic Learners
Each stage of analysis can be performed either by symbolic rules (typically,
finite-state patterns) or by a probabilistic model (such as an HMM). The
symbolic rules can either be written by hand or learned from a corpus (e.g.,
Brill tagger). Learning symbolic rules has the benefit (over probabilistic
models) that the results are inspectable and in some cases can even be improved
based on linguistic insights.
One of the most popular symbolic learners is Transformation-Based Learning,
introduced by Brill. In transformation-based tagging, we begin by
assigning each word its most common part-of-speech. We then apply
a series of transformations to the corpus. Each transformation has
the general form
if (some condition on the previous and following tags) then change
the current tag from X to Y
the condition can take the form 'the previous tag is z', or 'the tag of
the word 2 back is w', etc. Each transformation is tried at each position
of the corpus, although it will 'fire' only on those meeting the condition.
Transformation-based learning creates a set of possible transformations
(transformation templates). Starting with the initial tag assignments
(of the most common part of speech for each word), we try all possible transformations,
and select the one which produces the maximum improvement in accuracy (measured
against a hand-tagged corpus). We apply that transformation, and then
repeat the learning process. This yields a series of transformations
which can be used to tag new text.
There is a trade-off between the size of the space of possible transformation
and the training time ... allowing a larger space can improve potential
performance but greatly slow down training. In practice, TBLs are
much slower to train than HMMs. We use both HMMs and hand-written
finite-state rules in Jet, but have not incorporated any symbolic learners.
JET: an Architecture for Cascaded Annotation
The design of Jet is typical of systems whose main model of processing
is a cascade of analyzers, each of which adds some structural information
to the text. Many current systems for information extraction and similar
tasks use a similar design.
Central to Jet is the notion of an annotated document. The
Document consists of a text (which normally does not change) and a set of
annotations. Each annotation consists of a type, a span (a start
and end point in the document), and a set of zero or more features. The
main annotation type we have used so far is the constit annotation,
which represents a syntactic constituent. It has a feature cat, which
records the syntactic category, and may have other features, such as number.
When we use the console to try a sentence, Jet creates a 1-line document
and submits it for processing. If we a processing an entire document
(as we shall starting next week), Jet begins by identifying the portions
of the document to be processed (document zoning) and dividing the document
zones into sentences.
Jet applies to each sentence a series of actions, as specified by the processSentence
keyword in the parameter file. Most of these actions specify
annotators -- programs which add annotations to the document. Examples
of annotators are
|(applied automatically to text typed at the console)
divides text into tokens, adding token annotations
|looks up tokens in the dictionary and adds
|uses an HMM POS tagger to assign constit annotations
with cat = Penn part-of-speech to tokens
|uses an HMM POS tagger to assign tagger annotations
with cat = Penn part-of-speech, and then converts them to constit
annotations with cat = Jet part-of-speech
||uses an HMM POS tagger to assign tagger annotations
with cat = Penn part-of-speech, and then uses these tags to prune
constit annotations which are not compatible with the tagger
tags (normally used after lexLookup)
In addition to these annotators, Jet allows you to build an annotator from
a set of rules involving finite-state patterns. Each pattern is a
regular (finite-state) expression involving literals (which match specific
tokens) and annotations. If the pattern is matched, Jet can add additional
annotations, as well as print messages. We associate a pattern set
name with a set of such pattern-action rules, and treat pat(pattern-set-name)
as an annotator.
As an example, we look at a small set of patterns to identify noun and