CSCI-GA.2590 - Natural Language Processing - Spring 2013 Prof. Grishman
Lecture 5 Outline
February 26, 2013
Facing reality: problems of grammatical coverage and ambiguity
Until the mid-90's, the primary approach to developing a syntactic
analyzer was to have linguists develop the necessary grammar and
dictionary. Unfortunately, we don't really know how to write a
“complete” and “tight” grammar and dictionary (where tight
means that it doesn't produce lots of incorrect analyses). Beyond
the “core grammar” generally discussed by linguists, there are a
large number of relatively rare constructs. If we simply add
productions for all these rare constructs, they end up 'firing'
when we don't
want them, producing lots of bad parses. This leads us to write
ever-more-complex constraints on the grammar.
Good hand-written grammars/parsers can produce a full (but not
necessarily fully correct) analysis of about 70% of sentences for
newspaper text (the rate is considerably better for simpler and more
uniform text, such as some types of manuals and other specialized
texts). These grammars are
quite large (at least several hundred productions) and complex (many
and grammatical constraints).
To improve their performance, most systems are able to return
if a full sentence analysis cannot be obtained. Including both
and partial analyses, such parsers get about 65% of syntactic
To compound the problem, if we are successful in parsing, we may get
a very large number of parses for a single sentence if we rely on
grammatical constraints alone. For example, the ambiguity due to
attachment problems increases exponentially with the number of
Two basic approaches are currently used to address these
problems: partial parsers and statistical parsers based on
treebanks. We shall consider the approach based on partial
parsers over the next few weeks, and will consider statistical parsers
later in the semester.
Partial Parsers: Overview (J&M 13.5)
How can we process unrestricted text without laboriously building such
a complex grammar and parser? One possibility is to do partial
parsing: instead of building a full parse for each
sentence, we just identify a few more basic structures, such as
The goal is to identify types of constructs which can be reliably
identified based on local evidence. The constructs listed are
simple --- they are not recursive and avoid attachment
So we may hope that they are easier to identify than full parses.
In particular, it may be convenient to state them in terms of finite
patterns, which makes recognition very fast. The downside is
because they do not provide as much syntactic structure, they leave
work for subsequent (semantic) processing to do.
- Proper noun phrases (names)
- Noun groups (nouns with their left modifiers)
- Verb groups (auxiliaries + head verb)
Finding these structures is not simply a matter of taking existing
grammar rules and applying a parser bottom-up. We are making deterministic
choices, so we must try to determine the correct noun and verb
groups (for example) without benefit of the constraints provided by the
'upper’ syntactic structure. This can be done by using a
part-of-speech tagger to resolve part-of-speech
it can be done by writing rules for these constructs which make some
limited use of local context.
To the extent that we are successful we will have an analyzer which
will be relatively fast (because it is making deterministic choices)
and robust with respect to variations in global grammatical structure
(since we are
relying more heavily on local clues).
Typically, we will use a cascade (sequence) of such partial
analyzers ... for example, first identifying names, then noun groups
and verb groups. Each stage will make use of the analysis
performed by the previous
JET: an Architecture for Cascaded Annotation
The design of Jet is typical of systems whose main model of processing
is a cascade of analyzers, each of which adds some structural
to the text. Many current systems for information extraction and
similar tasks use a similar design.
Some of these design ideas were spread as part of the Tipster
architecture, a design developed for the US Government in the
mid-1990's. Several implementations were made of this
architecture, notably the GATE
system developed at the Univ. of Sheffield and now widely used in
Europe. The Unstructured
Information Management Architecture (UIMA), which was originally
developed by IBM Research, is now available as Apache-licensed
open source software.
Central to Jet is the notion of an annotated document.
The Document consists of a text (which normally does not change)
and a set of annotations. Each annotation consists of a
type, a span (a start and end point in the document), and a set of zero
or more features. The main annotation type we have used so far is
the constit annotation, which represents a syntactic
constituent. It has a feature cat, which
records the syntactic category, and may have other features, such as number.
When we use the console to try a sentence, Jet creates a 1-line
document and submits it for processing. If we a processing an
entire document (as we shall starting next week), Jet begins by
identifying the portions
of the document to be processed (document zoning) and dividing the
zones into sentences.
Jet applies to each sentence a series of actions, as specified by the processSentence
keyword in the parameter file. Most of these actions specify
annotators -- programs which add annotations to the document.
of annotators are
|divides text into tokens, adding token
|looks up tokens in the dictionary and
adds constit annotations
|uses an HMM POS tagger to assign constit annotations
cat = Penn part-of-speech to tokens
|uses an HMM POS tagger to assign tagger annotations
cat = Penn part-of-speech, and then converts them to constit
annotations with cat = Jet part-of-speech
||uses an HMM POS tagger to assign tagger annotations
cat = Penn part-of-speech, and then uses these tags to
prune constit annotations which are not compatible with the
tags (normally used after lexLookup)
In addition to these annotators, Jet allows you to build an annotator
from a set of rules
finite-state patterns. Each pattern is a
regular (finite-state) expression involving literals (which match
tokens) and annotations. If the pattern is matched, Jet can add
annotations, as well as print messages. We associate a pattern
name with a set of such pattern-action rules, and treat pat(pattern-set-name)
As an example, we look at a small set of
patterns to identify noun and
Corpus Trained Partial Parsers (J&M 13.5.2)
Noun and verb groups have been extensively studied, and one can do
quite well at recognizing them using a small set of rules.
However, to get the best coverage we may want to consider a
corpus-trained strategy, as we did for part-of-speech tagging.
To do so, we want to reduce noun group finding to a token
classification problem. At first glance, you might think that we
need only two classes, N (in a noun group) and O (outside a noun
group), but this is unable to distinguish two consecutive tokens which
are elements of one noun group or two consecutive noun groups. To
make this distinction, we introduce three classes, B (beginning token
of a noun group), I (inside ... second or subsequent token of a noun
group), and O (outside a noun group). To handle noun and verb
groups, we would need 5 classes, B-N, I-N, B-V, I-V, and O.
The "chunker" (noun and verb group tagger) can then be implemented by
training either a set of symbolic rules
(typically, finite-state patterns) or a probabilistic model (such as
an HMM). We already saw (from out part-of-speech tagger) how to
implement a classifier using an HMM.
rules has the benefit (over probabilistic models) that the results are
inspectable and in some cases can even be improved based on linguistic
One of the most popular symbolic learners is Transformation-Based
Learning, introduced by Brill. In transformation-based tagging,
we begin by
assigning each word its most common part-of-speech. We then apply
a series of transformations to the corpus. Each transformation
the general form
if (some condition on the previous and following tags) then
change the current tag from X to Y
the condition can take the form 'the previous tag is z', or 'the tag of
the word 2 back is w', etc. Each transformation is tried at each
position of the corpus, although it will 'fire' only on those meeting
Transformation-based learning creates a set of possible transformations
(transformation templates). Starting with the initial tag
assignments (of the most common part of speech for each word), we try
all possible transformations, and select the one which produces the
maximum improvement in accuracy (measured against a hand-tagged
corpus). We apply that transformation, and then repeat the
learning process. This yields a series of transformations which
can be used to tag new text.
Using the Penn Tree Bank as training data, a very successful chunker
has been built using transformation-based learning (Lance Ramshaw and
Mitch Marcus. Text Chunking using
Transformation-Based Learning (WVLC 1995)).
There is a trade-off between the size of the space of possible
transformation and the training time ... allowing a larger space can
performance but greatly slow down training. In practice, TBLs are
much slower to train than HMMs. We use both HMMs and hand-written
finite-state rules in Jet, but have not incorporated any symbolic
Evaluating chunkers (J&M
Our noun/verb group tagger
('chunker') won't be perfect; like our POS tagger, we need to
evaluate it by hand-annotating a substantial corpus and then comparing
the results of our chunker against this standard.
What metric should we use? As discussed last week, we can map a
chunk annotation into an
assignment of a tag to each word. For example, if we are tagging
just noun groups, we can tag the first word of a noun group as B-NG,
subsequent words of a noun group as I-NG, and words not in a noun group
as O. Then we can measure the accuracy of our tagging (correct
tags / total number of words). This can be readily extended to
other chunk types, or multiple chunk types.
However, accuracy can be a deceptive measure for less frequent
annotations. If we are scoring verb group annotations, and 80% of
words are not part of a verb group (i.e., get tag O), then a tagger
which can't find any verb groups at all would get an 80% accuracy score.
So for annotations which can span one or more words we typically use
recall and precision measures. In comparing the key and system
response, we count the total number of annotations in the key, the
total number in the response, and the number of correct annotations
(annotations in the response which exactly match one in the key (same
starting and ending word)). We then compute
recall = number correct / number in key
precision = number correct / number in response
F = geometric mean of recall and precision = 2 / (1 / recall + 1
To assist in evaluating chunkers and other similar annotators, Jet
provides the capability to annotate entire
documents and the ability to score the output
for precision and recall.