G22.2591 - Advanced Topics in Natural Language Processing - Spring 2011 - Prof. Grishman

Corpus-Based Methods in NLP


Lecture 1 Outline

January 25, 2011

Introduction

Why do we need NLP?
To perform these tasks we need to analyze the language at many levels
Each of these levels of analysis requires detailed information about the language

Shift starting in late 1980's (Church [ANLP 1988], DeRose [CL 1988]) to training analyzers from corpora
Supervised methods are less successful for semantic tasks
Types of learning
Goal of this course:  look at a number of NLP tasks
and see how they have been addressed by supervised and semi-supervised learning.  In particular, we will consider what properties of the data allow us to learn from unannotated data ... distributional hypothesis, parallel data.

Course organization:

References

Almost all of the papers we will look at have been published through the Assn. for Computational Linguistics (ACL).  It maintains a fairly complete on-line archive of proceedings going back 30 years, and in some cases 40 years, at http://aclweb.org/anthology-new/ .  Studying the original papers will give some historical perspective on how the field is developing.  However, I will also give references to two excellent texts which are suitable for background reading in natural language processing and basic statistical methods:

Daniel Jurafsky and James Martin, Speech and Language Processing.  Prentice Hall, first edition, 2000;  second edition, 2008. (The textbook for the basic natural language processing course;  cited as J&M)

Christopher Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing.  MIT Press, 1999. (cited as M&S)

An earlier and less comprehensive book on corpus-based methods is:  Eugene Charniak, Statistical Language Learning.  MIT Press, 1993.

A recent book on some of the topics of this course is: Steven Abney, Semisupervised Learning for Computational Linguistics, Chapman and Hall, 2008.

Standard abbreviations for citations (all available through the ACL archive):

    ACL = Proc. of the Annual Conference of the Assn. for Computational Linguistics
    NAACL = Proc. of the Conf. of the North American Chapter of the ACL
    EACL = Proc. of the Conf. of the European Chapter of the ACL
    ANLP =Proc. of the Conf. on Applied Natural Language Processing (ended 2000)
    COLING = Proc. of Int'l Conf. on Computational Linguistics
    EMNLP = Empirical Methods in Natural Language Processing
    CL = Computational Linguistics (journal)

Perspectives

Classifiers:

Semi-supervised learning strategies:  self-training

procedure selfTrain (L0, U) {
  // L0 is the labeled data, U the unlabeled data
  c = train(L0)
  loop until (stopping criterion is met) {
    L = L0 + select (label (U, c))
    c = train(L)
  }
  return c
}

Looking ahead:  papers targeted for student presentations for next week

Daniel M. Bikel; Scott Miller; Richard Schwartz; Ralph Weischedel.  Nymble: a High-Performance Learning Name-finder.  Proc. ANLP 97.

Andrew Borthwick, John Sterling, Eugene Agichtein, and Ralph Grishman.  Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition.  Sixth WVLC, 1998.