Statistical Natural Language Processing

Course#: CSCI-GA.3033-001
Instructor: Slav Petrov
Lecture: Tuesdays 5:10-7:00PM, Warren Weaver Hall Room 317
Mailing List: On Piazza
Office hours: By appointment

Announcements:

9/01/13: We will be using Piazza for class discussions. Please follow this link.
7/02/13: Webpage is up.

Class Summary:

In this course we will explore statistical, model-based approaches to natural language processing. There will be a focus on corpus-driven methods that make use of supervised and unsupervised machine learning approaches and algorithms. We will examine some of the core tasks in natural language processing, starting with simple word-based models for text classification and building up to rich, structured models for syntactic parsing and machine translation. In each case we will discuss recent research progress in the area and how to design efficient systems for practical user applications.

This course assumes a good background in basic probability and a strong ability and interest to program (in Java). There will be five course assignments and a final project. In the assignments, you will construct basic systems for core NLP tasks and then improve them through a cycle of error analysis and model redesign. The assignments make up 40% of your final grade. For the final project you will be able to choose a single topic or application and investigate it in greater depth. The final project will be 30% of your final grade. The remaining 30% will come from class participation.

The class is open to graduate as well as undergraduate students.

Textbooks:

The primary text book for this course will be:
Jurafsky and Martin, Speech and Language Processing, Second Edition
Make sure to get the purple second edition of the book and not the white first edition.

The following book is useful as secondary literature, especially since it is available online:
Manning and Schuetze, Foundations of Statistical Natural Language Processing

Additional readings will come from recent research papers.

Syllabus:

Date Topic Textbook Reading Recent Papers Reading Assignments
9/3 Introduction &
Language Modeling
Jurafsky & Martin Chapter 1, 4
(or Manning & Schuetze Chapters 1-3, 6)
MT Tutorial, Chen & Goodman,
Kneyser-Ney, Large Language Models
Assignment 1 out
9/10 Text Classification via
Naive Bayes & Maximum Entropy
Classification Tutorial, MaxEnt Tutorial,
Generative and Discriminative Classifiers
Assignment 2 out
9/17 More Classification,
Clustering
Jurafsky & Martin Chapters 6.6, 19.1, 20
(or Manning & Schuetze Chapter 7)
Graphical Models, Latent Dirichlet Allocation Assignment 1 due,
9/24 Part-of-Speech Tagging Jurafsky & Martin Chapter 5
(or Manning & Schuetze Chapter 10)
TnT Tagger, Toutanova & Manning '00 Assignment 2 due,
Assignment 3 out
10/1 Advanced Part-of-Speech
Tagging
Jurafsky & Martin Chapter 6
(or Manning & Schuetze Chapter 9)
Merialdo '94, CRFs,
Johnson '07, Bilingual POS Induction
10/8 Word Alignments Jurafsky & Martin Chapter 25
(or Manning & Schuetze Chapter 13)
MT Tutorial, Overview, IBM Models,
HMM-Alignments, Agreement
Assignment 3 due,
Assignment 4 out
10/15 Fall Break
10/22 Phrase-Based
Translation

Chapter 5&6 available online Decoding, Phrases, Moses Assignment 4 due
10/29 Constituency Parsing Jurafsky & Martin Chapters 12, 13, 14
(or Manning & Schuetze Chapters 3, 11, 12),
Best-First, Unlexicalized,
Lexicalized, Latent Variable
Project Proposal due,
Assignment 5 out
11/5 Dependency Parsing Dependency Parsing Chapters 1-4
(available from NYU network)
MST Parsing, Vine-Pruning,
Shift-Reduce Parsing
11/12 Hierarchical (Syntax-Based)
Translation
Hiero, GHKM, Syntax vs. Phrases,
Synchronous Grammars
Assignment 5 due
11/19 Semantic Parsing Jurafsky & Martin Chapters 17, 18, 19, 20.9 Joint, ILP for SRL Framenet Parsing,
Compositional, Question Answering
11/26 Deep Learning for NLP Guide, LMs, Semantic Compositionality
12/3 Sentiment Analysis,
Summarization
Pang & Lee Sentiment Analysis Book,
Jurafsky & Martin Chapter 23.3 - 23.7
Sentiment: Aspects, Lexicons, Summarization
Summarization: Query, N-Gram, Topical
12/10 Final Presentations How to (not) write a paper,
How to give a talk
Project due 12/20
(at midnight)

Assignments:

There will be five assignments in which you will build systems for various NLP tasks. You will be provided with a (Java) code base that will handle most of the basic infrastructure and will only need to write the core components. For the final project you can choose any NLP topic - even if it was not covered in the class. The scope of the project should be comparable to that of two assignments. Working in groups is encouraged for the final project. More guidelines and advice will be provided in class.

Leaderboard:

Final standings, sorted by average rank.

Rank Handle #1
WER
#2
Accuracy
#3
Accuracy
#4
AER
#5
Accuracy

The leadeboard is based on code from Wang Ling and Chris Dyer, which in turn is based on this course.