CSCI-GA.2590 - Natural Language Processing - Spring 2013 Prof. Grishman

Term Project

March 12, 2013

You must submit a term project on material connected to the course; this is worth 30% of your grade.  You have wide latitude in what do for the project.  It may be a project based on Jet;  a separate programming project, or a research paper.  A Jet or programming project must be accompanied by a separate, well-written description of the project;  an analysis of the data and your system's performance will be an important part of the grade.

Joint projects are permitted.

The general idea is to do something interesting which will require you to confront some of the 'real problems' of doing NLP.  Real NLP is hard ... don't be too ambitious, or at least have a fall-back plan if your ambitions are not realized.

Possible projects

  1. In this course, we present the use of Jet for a specific extraction task -- 'executive succession' -- using the pattern matching tools in Jet.  One possible project involves adapting the system to other event or relation types.  For the past several years we have participated in a multi-site evaluation organized by NIST, the English Slot Filling task of Knowledge Base Population.   This involves extracting information about people and organizations comparable to what appears in Wikipedia infoboxes;  for example, for a person, their date of birth, place of birth, date of death (if no longer alive), spouse, employer, organizational memberships, title, etc.;  for an organization, their date of founding,  headquarters location, number of employees, top employees, website, etc.  We would expect you to select a small subset of either person or organization properties.  You would start by gathering some examples of your own and writing some patterns; later in the semester we can provide some training and test data.
  2. As an alternative to KBP on general news, you could do a richer analysis of some narrow sublanguage within the news, such as weather forecasts, death notices, cooking recipes, sports results, etc.  If you are considering one of these, you should begin by marking up a few documents by hand to see what is feasible.  Then you should You can do this extraction either with Jet or with your own program (e.g., with Perl).
  3. Extract time expressions (including actual dates, "Friday", "last month", "two weeks ago") and normalize them (figure out a date or date range, given the date of an article).
  4. Extension of Jet syntactic patterns to more constructs (e.g., a rich variety of modifiers for noun and verb groups).  This should include some performance analysis.
  5. Building your own HMM and training it to identify names, noun groups, or time expressions.
  6. Feature engineering:  training a maximum entropy tagger to identify names, noun groups, or time expressions, extending the feature set to get good performance.
  7. Implementing Brin's method for finding a relationship from the Web (requires ability to do Web queries automatically).
  8. Preparing a context-free semantic grammar and dictionary for a sublanguage, and testing it on a small sample of the sublanguage.
  9. Foreign language analysis:  building a POS tagger or even a chunker for another language.
  10. Research report on some topic not covered in the course (e.g., morphological analysis of morphologically rich languages;  question answering, summarization, machine translation methods).The paper should show some understanding of what problems have and have not been addressed by current technology.
What were the best recent projects?
extraction systems for
    {criminal verdicts, lay-offs, weather reports, sports game summaries, death notices}, including evaluation (using Jet and using Python)
extracting family relationships from the Bible
noun / verb group patterns, with evaluation on larger corpus (using Jet)
question-answering (NL data base interface) system for {stock quotes, train schedules} (from scratch)
a parser for feature grammars (on top of Jet)
name recognizer / linker for Web pages (using HMM)
literature surveys of {coreference analysis, IE pattern learning}

Due dates

Brief project description: March 29th; 1-2 paragraphs, email to

Project: May 7th (last class)

   2% penalty for each weekday late