G22.2591 - Advanced Natural Language Processing - Spring 2004
Lecture 3
Part of speech tagging using maximum entropy
Ratnaparkhi used MaxEnt for POS tagging. For a sentence with
words w1, ..., wn and tags t1, ..., tn,
the features combined with ti were wi-2, wi-1,
wi, wi+1, wi+2, ti-1, and ti-2ti-1.
In general, all features which occurred at least 10 times were
kept. The search algorithm used was a beam search, keeping the N
best tag sequences at each word. He got about 96.5% tagging
accuracy.
Ratnaparkhi points out that MaxEnt has the advantage of allowing
specialized features (like TBL) and providing probabilities (like an
HMM). As an example of specialized features, he tried using
conjunctions of features for difficult words, but found very little
gain.
He noted that he could improve performance to 97% by using training and
test data from only one of the Treebank annotators.
Summary thoughts on POS tagging
- the Penn tag set is now the standard for assessing English
tagging, but it forces annotators to make some hard decisions, and has
an interannotator error of around 3%
- there are many supervised learning methods which can get 96-97%
accuracy on held-out Wall Street Journal data; we looked at HMMs,
TBL, and MaxEnt; the error in the Penn Treebank probably masks
differences between the methods
- unsupervised methods can do quite well using a dictionary,
essentially bootstrapping from those (many) words which have only a
single part-of-speech; unsupervised TBL by itself can do almost
as well as supervised methods; adding a bit of supervised
training gets near the performance of supervised training on the entire
PTB; Baum-Welch can do almost as well, but needs clever grouping
of ambiguity classes
- ENGCG tags can be more consistently assigned, and hand-written
ENGCG rules can get 98%+ accuracy
Text Chunking (J&M sec. 10.5)
What is text chunking?
Text chunking subsumes a range of tasks. The simplest is finding
'noun groups' or 'base NPs' ... non-recursive noun phrases up to the
head (for English). More ambitious systems may add additional
chunk types, such as verb groups, or may seek a complete partitioning
of the sentence into chunks of different types:
[NP He ]
[VP reckons ]
[NP the current account deficit ]
[VP will narrow ]
[PP to ]
[NP only $1.8 billion ]
[PP in ] [NP
September ] .
In any case, the chunks are non-recursive structures which can
potentially be handled by finite-state methods.
Quite high performance on NP chunking can be obtained with a small
number of regular expressions (J & M, sec. 10.5). With a
larger rule set, using Constraint Grammar rules, Voutilainen reports
recall of 98%+ with precison of 95-98% for noun chunks.
Why do text chunking?
Full parsing is expensive, and is not very robust.
Partial parsing can be much faster, more robust, yet may be sufficient
for many applications (IE, QA). It can also serve as a possible
first step for full parsing
Learning methods for text chunking
In his paper on POS tagging, Church also described a method for finding
base noun phrases. For each pair of parts of speech, he
determined the probability that there is an open bracket (the start of
a base NP) between the two words, and the probability that there is a
close bracket (end of a base NP) between the two words. He
conducted an informal test (243 NPs) and reported very good results
(238 correct). However, his notion of chunks is more restrictive
than later test sets, so the results are not directly comparable.
Ramshaw and Marcus adapted the TBL method which had been introduced by
Brill for POS tagging. They pointed out that one-level bracketing
can be restated as a word tagging task. For NP chunking, they
used 3 tags: I (inside a baseNP), O (outside a baseNP,), and B
(the start of a baseNP which immediately follows another baseNP).
Initial tags were assigned based on the most likely tag for a given
part-of-speech. The contexts for TBL rules included words,
part-of-speech assignments, and prior IOB tags.
Results can be scored based on the correct assignment of tags, or on
recall and precision of complete baseNPs. The latter is normally
used as the metric, since it corresponds to the actual objective --
different tag sets can be used as an intermediate representation.
R&M obtained about 92% recall and precision with their system for
baseNPs, using 200K words of training. (Without lexical
information, they got about 90.5% recall and precision.)
R&M mention two major sources of error (and these are also error
sources for simple finite-state patterns for baseNP): participles
and conjunction. Whether a particple is part of a noun phrase
will depend on the particular choice of words
He enjoys writing letters.
He sells writing paper.
and sometimes is genuinely ambiguous ...
He enjoys baking potatoes.
He has broken bottles in the basement.
The rules for conjoined NPs are complicated by the bracketing rules of
the Penn Tree Bank. Conjoined prenominal nouns are generally
treated as part of a single baseNP: "brick and mortar university"
(with "brick and mortar" modifying "university"). Conjoined heads
with shared modifiers are
also to be treated as a single baseNP: "ripe apples and
bananas"; however, if the modifier is not shared, there are two
baseNPs: "ripe apples and cinnamon". (bracketing guidelines
for Treebank II Style Penn Treebank Project, section 8.1, p.l35).
Modifier sharing, however, is sometimes hard for people to judge and is
not always consistently annotated in the Treebank. This limits
the maximum performance of any Treebank-based NP tagger.
Text chunking as a shared task
Lots of people have tried text chunking, using many different learning
methods. The CoNLL-2000
shared task (organized by the Special Interest Group of the ACL on
Computational Natural Language Learning) included both full chunking
and noun phrase (group) chunking. Like R&M, they used data
derived from the Penn Treebank. not aware of any separate study of
human performance
(the best systems do almost 96% against test data, so human consistency
probably at least that high).
Looking ahead
The best performance on the baseNP and chunking tasks was obtained
using a Support Vector Machine method. They obtained an accuracy
of 94.22% with the small data set of Ramshaw and Marcus, and 95.77% by
training on almost the entire Penn Treebank.
baseNP chunking is a task for which people (with some linguistics
training) can write quite good rules fairly quickly. This raises
the practical question of whether we should be using machine learning
at all. Clearly if there is already a large relevant resource, it
makes sense to learn from it. However, if we have to develop a
chunker for a new language, is it cheaper to annotate some data or to
write the rules directly? Ngai and Yarowsky addressed this
question: