Candidate: Roman Yangarber
Advisor: Ralph Grishman

Scenario Customization for Information Extraction

1:00 p.m., Thursday, September 28, 2000
12th floor conference room, 719 Broadway


Information Extraction (IE) is an emerging NLP technology, whose function is to process unstructured, natural language text, to locate specific pieces of information, or facts, in the text, and to use these facts to fill a database. IE systems today are commonly based on pattern matching. The core IE engine uses a cascade of sets of patterns of increasing linguistic complexity. Each pattern consists of a regular expression and an associated mapping from syntactic to logical form. The pattern sets are customized for each new topic, as defined by the set of facts to be extracted.

Construction of a pattern base for a new topic is recognized as a time-consuming and expensive process--a principal roadblock to wider use of IE technology in the large. An effective pattern base must be precise and must have wide coverage. This thesis addresses the portability problem in two stages.

First, we introduce a set of tools for building patterns manually from examples. To adapt the IE system to a new subject domain quickly, the user chooses a set of example sentences from a training text, and specifies how each example maps to the extracted event--its logical form. The system then applies meta-rules to transform the example automatically into a general set of patterns. This effectively shifts the portability bottleneck from building patterns to finding good examples.

Second, we propose a novel methodology for discovering good examples automatically from a large un-annotated corpus of text. The system is initially seeded with a small set of relevant patterns provided by the user. An unsupervised learning procedure then identifies new patterns and classes of related terms on successive iterations. We present experimental results, which confirm that the discovered patterns exhibit high quality, as measured in terms of precision and recall.