Statistical Natural Language Processing

Course#: CSCI-GA.3033-001
Instructor: Slav Petrov
Lecture: Tuesdays 5:00-6:50PM, Room 312 Warren Weaver Hall
Mailing List: csci_ga_3033_001_fa11@cs.nyu.edu
Office hours: By appointment

Announcements:

10/22/11: Project Proposals are due Nov. 1, before class.
10/22/11: Assignment 4 is out, due Nov. 8, before class.
10/12/11: Assignment 3 is out, due Oct. 25, before class.
10/5/11: First part of assignment 2 is now optional.
10/5/11: Extra class on language modeling scheduled for Oct. 11.
9/27/11: Assignment 2 is out, due Oct. 11, at 5p.m.
9/17/11: Note the room change!
9/17/11: Posted lectures notes for remaining classes in September.
9/17/11: Posted class schedule for rest of semester (subject to change based on interest).
9/13/11: Assignment 1 is out, due Sept. 27, before class.
9/13/11: Posted lectures notes and reading for first classes.
8/28/11: Posted lecture slides for first class.
5/5/11: Webpage is up.

Class Summary:

In this course we will explore statistical, model-based approaches to natural language processing. There will be a focus on corpus-driven methods that make use of supervised and unsupervised machine learning approaches and algorithms. We will examine some of the core tasks in natural language processing, starting with simple word-based models for text classification and building up to rich, structured models for syntactic parsing and machine translation. In each case we will discuss recent research progress in the area and how to design efficient systems for practical user applications.

This course assumes a good background in basic probability and a strong ability and interest to program (in Java). There will be four course assignments and a final project. In the assignments, you will construct basic systems for core NLP tasks and then improve them through a cycle of error analysis and model redesign. Each of the assignments is worth 10% of your final grade. For the final project you will be able to choose a single topic or application and investigate it in greater depth. The final project will be 30% of your final grade. The remaining 30% will come from class participation.

The class is open to graduate as well as undergraduate students.

Textbooks:

The primary text book for this course will be:
Jurafsky and Martin, Speech and Language Processing, Second Edition
Make sure to get the purple second edition of the book and not the white first edition.

The following book is useful as secondary literature, especially since it is available online:
Manning and Schuetze, Foundations of Statistical Natural Language Processing

Additional readings will come from recent research papers.

Syllabus:

Date Topic Textbook Reading Recent Papers Reading Assignments
9/6 Introduction Jurafsky & Martin Chapter 1
(or Manning & Schuetze Chapters 1-3)
9/13 Language Models Jurafsky & Martin Chapter 4
(or Manning & Schuetze Chapter 6)
MT Tutorial, Chen & Goodman,
Kneyser-Ney, Large Language Models
Assignment 1 out
9/20 Text Classification,
Word Sense Disambiguation
Classification Tutorial, MaxEnt Tutorial,
Generative and Discriminative Classifiers
9/27 More Classification Jurafsky & Martin Chapters 6.6, 19.1, 20
(or Manning & Schuetze Chapter 7)
Graphical Models, Latent Dirichlet Allocation Assignment 1 due,
Assignment 2 out
10/4 Part-of-Speech Tagging Jurafsky & Martin Chapter 5
(or Manning & Schuetze Chapter 10)
TnT Tagger, Toutanova & Manning '00
10/11 Language Modeling
Intricacies
Assignment 2 due,
Assignment 3 out
10/18 Advanced Part-of-Speech
Tagging
Jurafsky & Martin Chapter 6
(or Manning & Schuetze Chapter 9)
Merialdo '94, CRFs,
Prototype-Driven Induction, Johnson '07
10/25 Word Alignments Jurafsky & Martin Chapter 25
(or Manning & Schuetze Chapter 13)
MT Tutorial, Overview, IBM Models,
HMM-Alignments, Agreement
Assignment 3 due,
Assignment 4 out
11/1 Phrase-Based
Translation

Handout provided in class Decoding, Phrases, Moses Project Proposal due
11/8 Syntactic Parsing Jurafsky & Martin Chapters 12, 13
(or Manning & Schuetze Chapters 3, 11),
Dependency Parsing Chapters 1-4
(available from NYU network)
Best-First, A*, K-Best,
MST Parsing, Shift-Reduce Parsing
Assignment 4 due
11/15 Advanced Syntactic
Parsing
Jurafsky & Martin Chapter 14
(or Manning & Schuetze Chapter 12)
Unlexicalized, Lexicalized, Latent Variable,
3rd Order Parsing, Dual Decomposition
11/22 Hierarchical (Syntax-Based)
Translation
Hiero, GHKM, Syntax vs. Phrases,
Synchronous Grammars
11/29 Sentiment Analysis,
Summarization
Pang & Lee Sentiment Analysis Book,
Jurafsky & Martin Chapter 23.3 - 23.7
Sentiment: Aspects, Lexicons, Summarization
Summarization: Query, N-Gram, Topical
12/6 Final Presentations How to (not) write a paper,
How to give a talk
Project due 12/18

Assignments:

There will be four assignments in which you will build systems for various NLP tasks. You will be provided with a (Java) code base that will handle most of the basic infrastructure and will only need to write the core components.