G22.2591 - Advanced Natural Language Processing - Spring 2004


Corpus Methods for Natural Language Processing

Mondays 5:00-6:50 pm, 102 WWH

Prof. Ralph Grishman
715 Broadway, Room 703
212-998-3497
grishman@cs.nyu.edu
office hour:  Monday noon-1 pm

Lecture 1

Background

In the introductory natural language processing course, we considered a number of applications involving natural language -- machine translation, information extraction, question answering.  We saw how each of these applications required us to analyze the structure of the text, and to classify pieces of the text.  We considered a number of types of analysis which were common to most of these applications, including part-of-speech tagging, syntactic analysis, and name classification.

Natural language processing has been going on for a long time;  there were several major projects already in the 1950's, primarily for machine translation.  Until the late 1980's, most language analyzers were based on hand-written rules and were evaluated fairly anecdotally.  For several reasons, including dissatisfaction with the old approaches and the availability of much more on-line data, the paradigm shifted in the 1990's to emphasize the use of large text corpora.  These corpora were used to evaluate language processing systems and to train language processing systems.

We are going to study a number of these NLP analysis tasks, and for each task will consider
Our emphasis will be on the linguistic tasks rather than the machine learning methods, but the course will have to include some discussion of a variety of learning methods.

Course Organization

For each task, we will have a background lecture discussing the task, and then will read a number of papers about the task.  Students will give presentations summarizing and commenting on these papers, and we will have small assignments associated with the papers we read.

In addition, each student will prepare a term project involving more extensive experiments on one language processing task.

There is no exam.  Grades will be based on the small assignments, student presentations, and the term project.

Almost all of the papers we will look at have been published through the Assn. for Computational Linguistics (ACL).  It maintains a fairly complete on-line archive of proceedings going back 25 years, and in some cases 40 years, at http://acl.ldc.upenn.edu/ .  Studying the original papers will give some historical perspective on how the field is developing.  However, I will also give references to two excellent texts which are suitable for background reading in natural language processing and basic statistical methods:

Daniel Jurafsky and James Martin, Speech and Language Processing.  Prentice Hall, 2000. (The textbook for the basic natural language processing course;  cited as J&M)

Christopher Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing.  MIT Press, 1999. (cited as M&S)

An earlier and less comprehensive book on corpus-based methods is:  Eugene Charniak, Statistical Language Learning.  MIT Press, 1993.

Standard abbreviations for citations (all available through the ACL archive):

    ACL = Proc. of the Annual Conference of the Assn. for Computational Linguistics
    ANLP =Proc. of the Conf. on Applied Natural Language Processing
    COLING = Proc. of Int'l Conf. on Computational Linguistics
    EMNLP = Empirical Methods in Natural Language Processing
    CL = Computational Linguistics (journal)

Part of Speech Tagging

(J&M, chapter 8;  M&S, chapter 10;  Charniak, sec. 3.2)

Tag Sets

Words are assigned parts of speech in order to capture generalizations about grammatically well-formed sentences, such as The noun is adjective.   Determining the parts of speech of the words in a sentence can help us to identify the syntactic structure of the sentence, and in some cases determine the pronunciation or meaning of individual words ("Did he cross the desert?" vs. "Did he desert the army?").

There is no unique set of part-of-speech tags.  Words can be grouped in different ways to capture different generalizations, and into coarser or finer categories (J&M 8.2).  The first large (1 MW) 'balanced' corpus was the Brown Corpus, collected by Kucera and Francis at Brown in the 1960's;  it used a set of 87 tags.  Currently the most widely used tag sets are those for the Penn Tree Bank (45 tags) and for the British National Corpus (CLAWS C5 tag set, 61 tags).  We shall concentrate on the Penn Tag Set.

Having a relatively small tag set makes it somewhat easier for people to tag the text, but it loses some distinctions which may be grammatically important.  For example, this tag set does not distinguish between prepositions and subordinating conjunctions, or between auxiliary and main verbs.

Evaluation

In principle, evaluation is relatively easy:  we hand code a 'gold standard' corpus, compare our system output with the gold standard, word for word, and see what percentage are correct.  (This assumes that we agree on what a word is -- the tokenization rules.  For languages with white space between words, this is fairly simple;  for Japanese and Chinese there needs to be prior agreement on what constitutes a word.)

One factor to be kept in mind in evaluation is the type of text being tagged.  Taggers will typically do much better on some types of texts than others;  in particular, corpus-trained taggers will do better on texts similar to those on which it was trained.

Human performance -- tagging guidelines

You might think that native English speakers ought to be able to do a good job tagging English text, particularly since they are taught about parts of speech in school.  This is not the case.  While for most words the tags are straightforward, there are many difficult and subtle cases:

particle (He talked over the deal.) vs. preposition (He talked over the telephone.)

past tense (The horse walked past the barn.) vs. past participle (The horse walked past the barn fell.)

noun or adjective?  (The executive decision.)

noun or present participle (Fishing can be fun.)

To improve consistency, any effort at manual annotation requires an annotation guide.  For the Penn Tag Set, the manual is 34 pages and includes guidelines for many of the more difficult decisions, such as those listed above.  Still, people don't always follow the guidelines, and there are cases not covered by the guidelines;  the inter-annotator disagreement rate on this task is several percent.

Once a portion of the corpus has been tagged, one can consider training an automatic tagger and then having an annotator correct this output rather than tag data from scratch.  This can improve throughput, but may introduce biases into the data.

Automatic Taggers

Doing fairly well, once one has a corpus, is not hard.  The simplest strategy ...  using the most common POS for each word... is about 91% accurate on the Brown Corpus (Charniak, p. 49).  However, 91% is not very good for POS tagging ... it means that in a typical newspaper sentence (about 20 words) there will be two incorrect tags.
Stochastic Taggers
Several stochastic taggers were built in the late 1980s, based on bigram or trigram models (J&M, sec. 8.5).  Probably the most-often cited of these taggers was Ken Church, A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text, Second Conference on Applied Natural Language Processing, 1988.   These taggers are based on a generative model of a sentence:  given the previous n words, we generate the next word in two steps, first selecting the next part of speech based on the parts of speech of the previous one or two words, and then selecting the word given the part of speech.  The probability of selecting word wi is then

P ( ti | ti-1 ) P ( wi | ti )

Based on this model, we seek the most likely tag sequence for a sentence

argmax(T) product(i) P ( ti | ti-1 ) P ( wi | ti )

The probabilities can be easily estimated from a tagged corpus, using Maximum Likelihood Estimates.  The most likely tag sequence can then be determined using an HMM and the Viterbi decoder.  Both bigram and trigram models have been used, with accuracies (on the Brown Corpus and Penn Tree Bank) of about 96%.

One problem with this method (and indeed all HMM taggers) is unknown words (J&M p. 314, M&S p. 351) ... words which are not in the training corpus.  One can assume that such words can be any part of speech, or any open class part of speech (noun, verb, adjective, adverb), with a probability corresponding to the distribution of these parts of speech in the training corpus (or the distribution of parts of speech of words with frequency 1 in the training corpus).  Some improvement can be obtained by computing probabilities based on the suffix (last few characters) of a word.  For example, "-ly" is probably an adverb, and "-ing" is probably a present participle (in inflected languages, suffixes are an even clearer indication of part of speech).  Finally, part of speech lists without probabilities may be available from dictionaries.
Transformation-Based Tagging
Early attempts at rule-based taggers, in the 1960's and 70's, had not been very successful.  Brill (A simple Rule-Based Part of Speech Tagger, ANLP 1992) addressed the question of whether good rules could be produced automatically from a tagged corpus  (J&M, sec. 8.6).  In his tagger, words are first tagged with their most likely tag.  The tagger then applies a number of transformations of the form 'in context C, if a word is tagged A, change its tag to B'.  The context may be that the preceding/following word is tagged X, that one of the two preceding/following words is tagged X, that the preceding word is tagged X and the following word is tagged Y, etc.  The rule learner starts with the initial tagging (each word assigned its most common tag), tries all possible transformations, and selects the one which produces the greatest improvement (maximizes errors corrected - errors introduced).  For example, the best transformation for the Penn Tree Bank states that if the prior tag is TO and the current tag is NN, change the current tag to VB.  The transformation is applied and the rule learner operates again on the corrected corpus;  this repeats until the improvement from any rule falls below some threshold.  Reported error rates on the Penn Tree Bank are about 96 - 96.5%.  (demo)

One benefit of transformation-based learners (TBLs) is that the rules are inspectable;  in theory, one can use a TBL to generate rules and then improve them by hand.  Satoshi Sekine has done this here at NYU with the tagger for his OAK system, pushing the performance on the PTB up to 97%.

Limitations

Problems with these taggers ... cases requiring longer context (VBN vs VBD;  words following conjunctions). (demo)
Neither HMMs nor TBLs do well with these;  can any tagger handle these cases?

Looking ahead to next week

Training HMMs -- the forward-backward algorithm (J&M appendix D)

Using forward-backward for training a model without a tagged corpus:  the Xerox tagger

Doug Cutting; Julian Kupiec; Jan Pedersen; Penelope SibunA Practical Part-of-Speech Tagger.  ANLP 1992.

Constraint grammar tagger

Pasi Tapanainen; Atro Voutilainen.  Tagging accurately - Don't guess if you know. (ANLP 1994)
Christer Samuelsson, Atro Voutilainen. Comparing a Linguistic and a Stochastic Tagger (ACL 1997)

Maximum entropy modeling

Adwait Ratnaparkhi.  A Maximum Entropy Model for Part-Of-Speech Tagging (EMNLP 1996)