This course will consider how methods of natural language processing
can be used to bridge this gap: to extract information automatically
from text, creating data bases from news, scientific publications,
and medical records. When coupled with social media,
it can offer near-real-time event tracking. We will
consider several levels of text analysis, including word and phrase
level analysis (for document retrieval and sentiment analysis),
(grammars and parsing), semantic analysis (word and sentence meaning),
and discourse analysis (pronoun resolution and text structure).
We will use both systems based on hand-coded rules and those trained
automatically from corpora using statistical methods.
During the course you will use and extend a suite of text processing tools, JET, coded in Java, building up all the basic components for an information extraction system. There will be 10 small weekly assignments (some 'paper-and-pencil', some running and modifying Jet), a term project, and a final exam.
Students should have
CSCI-GA.1180, Mathematical Techniques for Computer Science Applications, while not a prerequisite, does provide a solid background for the statistical methods employed in this course.
Textbook: Jurafsky and Martin, Speech and Language Processing (Second edition, Prentice Hall)
For further information, you can consult last year's web pages at http://cs.nyu.edu/courses/spring13/CSCI-GA.2590-001/index.html. You may also want to look at the pages of our natural language research group, the Proteus Project, and in particular its publication list.
Note: Earlier versions of this course used Lisp, and this language is mentioned in earlier course descriptions. While we still do some research using Lisp, the course will primarily use Java. For Python fans, we will also make available Python versions of some components.
For further information, contact Prof. Grishman.