This thesis proposes a novel approach for exploring Information
Extraction scenarios. Information Extraction, or IE, is a task
aiming at finding events and relations in natural language texts
that meet a user's demand. However, it is often difficult to
formulate, or even define such events that satisfy both a user's
need and technical feasibility. Furthermore, most existing IE
systems need to be tuned for a new scenario with proper training
data in advance. So a system designer usually needs to understand
what a user wants to know in order to maximize the system
performance, while the user has to understand how the system will
perform in order to maximize his/her satisfaction.
In this thesis, we focus on maximizing the variety of scenarios that the system can handle instead of trying to improve the accuracy of a particular scenario. In traditional IE systems, a relation is defined a priori by a user and is identified by a set of patterns that are manually crafted or acquired in advance. We propose a technique called Unrestricted Relation Discovery, which defers determining what is a relation and what is not until the very end of the processing so that a relation can be defined a posteriori. This laziness gives huge flexibility to the types of relations the system can handle. Furthermore, we use the notion of recurrent relations to measure how useful each relation is. This way, we can discover new IE scenarios without fully specifying definitions or patterns, which leads to Preemptive Information Extraction, where a system can provide a user a portfolio of extractable relations and let the user choose them.
We used one year news articles obtained from the Web as a development set. We discovered dozens of scenarios that are similar to the existing scenarios tried by many IE systems, as well as new scenarios that are relatively novel. We have evaluated the existing scenarios with Automatic Content Extraction (ACE) event corpus and obtained reasonable performance. We believe this system will shed new light on IE research by giving various experimental IE scenarios.