G22.2590 - Natural Language Processing - Spring 2008 Prof. Grishman
Lecture 6 Outline
February 28, 2008
Facing reality: problems of grammatical coverage and ambiguity
Until the mid-90's, the primary approach to developing a syntactic
analyzer was to have linguists develop the necessary grammar and
dictionary. Unfortunately, we don't really know how to write a
“complete” and “tight” grammar and dictionary (where tight
means that it doesn't produce lots of incorrect analyses). Beyond
the “core grammar” generally discussed by linguists, there are a
large number of relatively rare constructs. If we simply add
productions for all these rare constructs, they end up 'firing'
when we don't
want them, producing lots of bad parses. This leads us to write
ever-more-complex constraints on the grammar.
Good hand-written grammars/parsers can produce a full (but not
necessarily fully correct) analysis of about 70% of sentences for
newspaper text (the rate is considerably better for simpler and more
uniform text, such as some types of manuals and other specialized
texts). These grammars are
quite large (at least several hundred productions) and complex (many
and grammatical constraints).
To improve their performance, most systems are able to return
if a full sentence analysis cannot be obtained. Including both
and partial analyses, such parsers get about 65% of syntactic
To compound the problem, if we are successful in parsing, we may get
a very large number of parses for a single sentence if we rely on
grammatical constraints alone. For example, the ambiguity due to
attachment problems increases exponentially with the number of
Two basic approaches are currently used to address these
problems: partial parsers and statistical parsers based on
treebanks. We shall consider the approach based on partial
parsers over the next few weeks, and will consider statistical parsers
later in the semester.
Partial Parsers: Overview
How can we process unrestricted text without laboriously building such
a complex grammar and parser? One possibility is to do partial
parsing (J&M 10.5): instead of building a full parse for each
sentence, we just identify a few more basic structures, such as
The goal is to identify types of constructs which can be reliably
identified based on local evidence. The constructs listed are
simple --- they are not recursive and avoid attachment
So we may hope that they are easier to identify than full parses.
In particular, it may be convenient to state them in terms of finite
patterns, which makes recognition very fast. The downside is
because they do not provide as much syntactic structure, they leave
work for subsequent (semantic) processing to do.
- Proper noun phrases (names)
- Noun groups (nouns with their left modifiers)
- Verb groups (auxiliaries + head verb)
Finding these structures is not simply a matter of taking existing
grammar rules and applying a parser bottom-up. We are making deterministic
choices, so we must try to determine the correct noun and verb
groups (for example) without benefit of the constraints provided by the
'upper’ syntactic structure. This can be done by using a
part-of-speech tagger to resolve part-of-speech ambiguities. Or
it can be done by writing rules for these constructs which make some
limited use of local context.
To the extent that we are successful we will have an analyzer which
will be relatively fast (because it is making deterministic choices)
and robust with respect to variations in global grammatical structure
(since we are
relying more heavily on local clues).
Typically, we will use a cascade (sequence) of such partial
analyzers ... for example, first identifying names, then noun groups
and verb groups. Each stage will make use of the analysis
performed by the previous
JET: an Architecture for Cascaded Annotation
The design of Jet is typical of systems whose main model of processing
is a cascade of analyzers, each of which adds some structural
to the text. Many current systems for information extraction and
similar tasks use a similar design.
Some of these design ideas were spread as part of the Tipster
architecture, a design developed for the US Government in the
mid-1990's. Several implementations were made of this
architecture, notably the GATE
system developed at the Univ. of Sheffield and now widely used in
Europe. The Unstructured
Information Management Architecture (UIMA) now being promoted by
IBM also incorporates some of these ideas.
Central to Jet is the notion of an annotated document.
The Document consists of a text (which normally does not change)
and a set of annotations. Each annotation consists of a
type, a span (a start and end point in the document), and a set of zero
or more features. The main annotation type we have used so far is
the constit annotation, which represents a syntactic
constituent. It has a feature cat, which
records the syntactic category, and may have other features, such as number.
When we use the console to try a sentence, Jet creates a 1-line
document and submits it for processing. If we a processing an
entire document (as we shall starting next week), Jet begins by
identifying the portions
of the document to be processed (document zoning) and dividing the
zones into sentences.
Jet applies to each sentence a series of actions, as specified by the processSentence
keyword in the parameter file. Most of these actions specify
annotators -- programs which add annotations to the document.
of annotators are
|divides text into tokens, adding token
|looks up tokens in the dictionary and
adds constit annotations
|uses an HMM POS tagger to assign constit annotations
with cat = Penn part-of-speech to tokens
|uses an HMM POS tagger to assign tagger annotations
with cat = Penn part-of-speech, and then converts them to constit
annotations with cat = Jet part-of-speech
||uses an HMM POS tagger to assign tagger annotations
with cat = Penn part-of-speech, and then uses these tags to
prune constit annotations which are not compatible with the
tags (normally used after lexLookup)
In addition to these annotators, Jet allows you to build an annotator
from a set of rules
involving finite-state patterns. Each pattern is a
regular (finite-state) expression involving literals (which match
tokens) and annotations. If the pattern is matched, Jet can add
annotations, as well as print messages. We associate a pattern
name with a set of such pattern-action rules, and treat pat(pattern-set-name)
as an annotator.
As an example, we look at a small set of
patterns to identify noun and
Corpus Trained Partial Parsers
Noun and verb groups have been extensively studied, and one can do
quite well at recognizing them using a small set of rules.
However, to get the best coverage we may want to consider a
corpus-trained strategy, as we did for part-of-speech tagging.
To do so, we want to reduce noun group finding to a token
classification problem. At first glance, you might think that we
need only two classes, N (in a noun group) and O (outside a noun
group), but this is unable to distinguish two consecutive tokens which
are elements of one noun group or two consecutive noun groups. To
make this distinction, we introduce three classes, B (beginning token
of a noun group), I (inside ... second or subsequent token of a noun
group), and O (outside a noun group). To handle noun and verb
groups, we would need 5 classes, B-N, I-N, B-V, I-V, and O.
The "chunker" (noun and verb group tagger) can then be implemented by
training either a set of symbolic rules
(typically, finite-state patterns) or a probabilistic model (such as
an HMM). We already saw (from out part-of-speech tagger) how to
implement a classifier using an HMM.
rules has the benefit (over probabilistic models) that the results are
inspectable and in some cases can even be improved based on linguistic
One of the most popular symbolic learners is Transformation-Based
Learning, introduced by Brill. In transformation-based tagging,
we begin by
assigning each word its most common part-of-speech. We then apply
a series of transformations to the corpus. Each transformation
the general form
if (some condition on the previous and following tags) then
change the current tag from X to Y
the condition can take the form 'the previous tag is z', or 'the tag of
the word 2 back is w', etc. Each transformation is tried at each
position of the corpus, although it will 'fire' only on those meeting
Transformation-based learning creates a set of possible transformations
(transformation templates). Starting with the initial tag
assignments (of the most common part of speech for each word), we try
all possible transformations, and select the one which produces the
maximum improvement in accuracy (measured against a hand-tagged
corpus). We apply that transformation, and then repeat the
learning process. This yields a series of transformations which
can be used to tag new text.
Using the Penn Tree Bank as training data, a very successful chunker
has been built using transformation-based learning (Lance Ramshaw and
Mitch Marcus. Text Chunking using
Transformation-Based Learning (WVLC 1995)).
There is a trade-off between the size of the space of possible
transformation and the training time ... allowing a larger space can
performance but greatly slow down training. In practice, TBLs are
much slower to train than HMMs. We use both HMMs and hand-written
finite-state rules in Jet, but have not incorporated any symbolic