G22.2591 - Advanced Natural Language Processing - Spring 2004

Lecture 3

Part of speech tagging using maximum entropy

Ratnaparkhi used MaxEnt for POS tagging.  For a sentence with words w1, ..., wn and tags t1, ..., tn, the features combined with ti were wi-2, wi-1, wi, wi+1, wi+2, ti-1, and ti-2ti-1.  In general, all features which occurred at least 10 times were kept.  The search algorithm used was a beam search, keeping the N best tag sequences at each word.  He got about 96.5% tagging accuracy.

Ratnaparkhi points out that MaxEnt has the advantage of allowing specialized features (like TBL) and providing probabilities (like an HMM).  As an example of specialized features, he tried using conjunctions of features for difficult words, but found very little gain.

He noted that he could improve performance to 97% by using training and test data from only one of the Treebank annotators.
Adwait Ratnaparkhi.  A Maximum Entropy Model for Part-Of-Speech Tagging (EMNLP 1996)

Summary thoughts on POS tagging

Text Chunking  (J&M sec. 10.5)

What is text chunking?

Text chunking subsumes a range of tasks.  The simplest is finding 'noun groups' or 'base NPs' ... non-recursive noun phrases up to the head (for English).  More ambitious systems may add additional chunk types, such as verb groups, or may seek a complete partitioning of the sentence into chunks of different types:

[NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only $1.8 billion ] [PP in ] [NP September ] .
In any case, the chunks are non-recursive structures which can potentially be handled by finite-state methods.

Steve Abney, Parsing by Chunks

Quite high performance on NP chunking can be obtained with a small number of regular expressions (J & M, sec. 10.5).  With a larger rule set, using Constraint Grammar rules, Voutilainen reports recall of 98%+ with precison of 95-98% for noun chunks.

Atro Voutilainen,  NPtool, a Detector of English Noun Phrases, WVLC 93.

Why do text chunking?

Full parsing is expensive, and is not very robust.

Partial parsing can be much faster, more robust, yet may be sufficient for many applications (IE, QA).  It can also serve as a possible first step for full parsing

Learning methods for text chunking      

In his paper on POS tagging, Church also described a method for finding base noun phrases.  For each pair of parts of speech, he determined the probability that there is an open bracket (the start of a base NP) between the two words, and the probability that there is a close bracket (end of a base NP) between the two words.  He conducted an informal test (243 NPs) and reported very good results (238 correct).  However, his notion of chunks is more restrictive than later test sets, so the results are not directly comparable.

Ken Church, A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text, Second Conference on Applied Natural Language Processing, 1988.

Ramshaw and Marcus adapted the TBL method which had been introduced by Brill for POS tagging.  They pointed out that one-level bracketing can be restated as a word tagging task.  For NP chunking, they used 3 tags:  I (inside a baseNP), O (outside a baseNP,), and B (the start of a baseNP which immediately follows another baseNP).  Initial tags were assigned based on the most likely tag for a given part-of-speech.  The contexts for TBL rules included words, part-of-speech assignments, and prior IOB tags.

Results can be scored based on the correct assignment of tags, or on recall and precision of complete baseNPs.  The latter is normally used as the metric, since it corresponds to the actual objective -- different tag sets can be used as an intermediate representation.  R&M obtained about 92% recall and precision with their system for baseNPs, using 200K words of training.  (Without lexical information, they got about 90.5% recall and precision.)

R&M mention two major sources of error (and these are also error sources for simple finite-state patterns for baseNP):  participles and conjunction.  Whether a particple is part of a noun phrase will depend on the particular choice of words

He enjoys writing letters.
He sells writing paper.

and sometimes is genuinely ambiguous ...

He enjoys baking potatoes.
He has broken bottles in the basement.

The rules for conjoined NPs are complicated by the bracketing rules of the Penn Tree Bank.  Conjoined prenominal nouns are generally treated as part of a single baseNP:  "brick and mortar university" (with "brick and mortar" modifying "university").  Conjoined heads with shared modifiers are also to be treated as a single baseNP:  "ripe apples and bananas";  however, if the modifier is not shared, there are two baseNPs:  "ripe apples and cinnamon".  (bracketing guidelines for Treebank II Style Penn Treebank Project, section 8.1, p.l35).  Modifier sharing, however, is sometimes hard for people to judge and is not always consistently annotated in the Treebank.  This limits the maximum performance of any Treebank-based NP tagger.

Lance Ramshaw and Mitch Marcus.  Text Chunking using Transformation-Based Learning  (WVLC 1995)

Text chunking as a shared task

Lots of people have tried text chunking, using many different learning methods.  The CoNLL-2000 shared task (organized by the Special Interest Group of the ACL on Computational Natural Language Learning) included both full chunking and noun phrase (group) chunking.  Like R&M, they used data derived from the Penn Treebank. not aware of any separate study of human performance
(the best systems do almost 96% against test data, so human consistency probably at least that high).

Looking ahead

The best performance on the baseNP and chunking tasks was obtained using a Support Vector Machine method.  They obtained an accuracy of 94.22% with the small data set of Ramshaw and Marcus, and 95.77% by training on almost the entire Penn Treebank.

Taku Kudo; Yuji Matsumoto.  Chunking with Support Vector Machines  Proc. NAACL 01.

baseNP chunking is a task for which people (with some linguistics training) can write quite good rules fairly quickly.  This raises the practical question of whether we should be using machine learning at all.  Clearly if there is already a large relevant resource, it makes sense to learn from it.  However, if we have to develop a chunker for a new language, is it cheaper to annotate some data or to write the rules directly?  Ngai and Yarowsky addressed this question:

Ngai, G. and D. Yarowsky, Rule Writing or Annotation: Cost-efficient Resource Usage for Base Noun Phrase Chunking. Proc.  ACL-2000