Statistical Natural Language Processing

Course#: CSCI-GA.3033-001
Instructor: Slav Petrov
Lecture: Thursdays 5:00-6:50PM, Room 312 Warren Weaver Hall
Mailing List:
Office hours: By appointment

Class Summary:

In this course we will explore statistical, model-based approaches to natural language processing. There will be a focus on corpus-driven methods that make use of supervised and unsupervised machine learning approaches and algorithms. We will examine some of the core tasks in natural language processing, starting with simple word-based models for text classification and building up to rich, structured models for syntactic parsing and machine translation. In each case we will discuss recent research progress in the area and how to design efficient systems for practical user applications.

This course assumes a good background in basic probability and a strong ability and interest to program (in Java). There will be five course assignments and a final project. In the assignments, you will construct basic systems for core NLP tasks and then improve them through a cycle of error analysis and model redesign. The assignments make up 40% of your final grade. For the final project you will be able to choose a single topic or application and investigate it in greater depth. The final project will be 30% of your final grade. The remaining 30% will come from class participation.

The class is open to graduate as well as undergraduate students.


The primary text book for this course will be:
Jurafsky and Martin, Speech and Language Processing, Second Edition
Make sure to get the purple second edition of the book and not the white first edition.

The following book is useful as secondary literature, especially since it is available online:
Manning and Schuetze, Foundations of Statistical Natural Language Processing

Additional readings will come from recent research papers.

Syllabus (subject to change):

Date Topic Textbook Reading Recent Papers Reading Assignments
9/6 Introduction &
Language Modeling
Jurafsky & Martin Chapter 1, 4
(or Manning & Schuetze Chapters 1-3, 6)
MT Tutorial, Chen & Goodman,
Kneyser-Ney, Large Language Models
Assignment 1 out
9/13 Text Classification via
Naive Bayes & Maximum Entropy
Classification Tutorial, MaxEnt Tutorial,
Generative and Discriminative Classifiers
9/20 More Classification,
Jurafsky & Martin Chapters 6.6, 19.1, 20
(or Manning & Schuetze Chapter 7)
Graphical Models, Latent Dirichlet Allocation Assignment 1 due,
Assignment 2 out
9/27 Part-of-Speech Tagging Jurafsky & Martin Chapter 5
(or Manning & Schuetze Chapter 10)
TnT Tagger, Toutanova & Manning '00
10/4 Advanced Part-of-Speech
Jurafsky & Martin Chapter 6
(or Manning & Schuetze Chapter 9)
Merialdo '94, CRFs,
Johnson '07, Bilingual POS Induction
Assignment 2 due,
Assignment 3 out
10/11 Word Alignments Jurafsky & Martin Chapter 25
(or Manning & Schuetze Chapter 13)
MT Tutorial, Overview, IBM Models,
HMM-Alignments, Agreement
10/18 Phrase-Based

Chapter 5&6 available online Decoding, Phrases, Moses Assignment 3 due,
Assignment 4 out
10/25 Syntactic Parsing Jurafsky & Martin Chapters 12, 13
(or Manning & Schuetze Chapters 3, 11),
Dependency Parsing Chapters 1-4
(available from NYU network)
Best-First, A*, K-Best,
Shift-Reduce Parsing
Project Proposal due
11/8 Hurricane Sandy
11/8 Advanced Constituency
Jurafsky & Martin Chapter 14
(or Manning & Schuetze Chapter 12)
Unlexicalized, Lexicalized, Latent Variable Assignment 4 due,
Assignment 5 out
11/15 Semantic Parsing
11/22 Thanksgiving Assignment 5 due
11/29 Hierarchical (Syntax-Based)
Hiero, GHKM, Syntax vs. Phrases,
Synchronous Grammars
12/6 Sentiment Analysis,
Pang & Lee Sentiment Analysis Book,
Jurafsky & Martin Chapter 23.3 - 23.7
Sentiment: Aspects, Lexicons, Summarization
Summarization: Query, N-Gram, Topical
12/13 Final Presentations How to (not) write a paper,
How to give a talk
Project due 12/20
(at midnight)


There will be five assignments in which you will build systems for various NLP tasks. You will be provided with a (Java) code base that will handle most of the basic infrastructure and will only need to write the core components. For the final project you can choose any NLP topic - even if it was not covered in the class. The scope of the project should be comparable to that of two assignments. Working in groups is encouraged for the final project. More guidelines and advice will be provided in class.