MNS IDM'2000 Project Report Format

AUTOMATED STRUCTURING OF TEXT INFORMATION

PI: Ralph Grishman

Co-PI: Satoshi Sekine

Computer Science Department
New York University

Contact Information

Prof. Ralph Grishman
Computer Science Department
New York University
715 Broadway, 7th Floor
New York, NY 10003
Phone: (212) 998-3497
Fax : (212) 995-4123
Email: grishman@cs.nyu.edu

PROJECT WWW PAGE

cs.nyu.edu/cs/projects/proteus

Project Award Information

Award Number: IIS-0081962
Duration: 09/01/00 - 08/31/03
Title: Automated Structuring of Text Information

Keywords

information extraction, information access

Research Goals

At present, access to the information in large-scale text collections is largely limited to keyword-based searches which retrieve entire documents or passages. While such tools are often satisfactory in retrieving information on general topics, they provide little support for accessing information involving specific relationships, events, or facts.

Information extraction technology offers the possibility of creating structured, tabular representations of selected relations from large text collections --- representations which can support more detailed document querying. Until now, however, developing extraction systems for a broad range of relations has been too expensive and time-consuming to consider its use in this way. Recent developments in extraction system customization offer the promise of substantially easing this task, and so making this approach to document indexing feasible.

This research project will

use corpus-based techniques to automatically identify the most common relationships within a sublanguage (the set of texts concerning a particular subject matter)
use corpus-based methods to identify the different ways in which these relations can be expressed in the text
construct extraction systems which will extract information about these relationships from new text, building tabular summaries
provide a user interface for querying these relationships and accessing the underlying documents

Taken together, these tools should offer significant new capabilities for accessing the information in large text collections.

Current Research Activities

Our focus during the first half year of this grant has been on the algorithms and infrastructure for discovering the primary patterns of a sublanguage.

Pattern discovery is a bootstrapping process which begins with the identification of some major semantic classes in the domain and some members of these classes. A sample of text in the sublanguage is tagged with these classes. The sample is then analyzed syntactically and all possible patterns (e.g., subject-verb-object patterns) are extracted. Patterns which appear in the sample with high frequency, and in particular with high frequency relative to text outside the sublanguage, are identified. These patterns may lead to the augmentation of the semantic classes and thence further pattern discovery.

Work over the past year, on English texts from several topic areas, has demonstrated the general efficacy of this approach [1]. In particular, we have succeeded in using these patterns to create information extraction systems for one of these topics [2]. More recently, these discovery methods have been extended to Japanese [3]. Some of the modifications needed to accomodate Japanese have provided insights which may allow us to generalize the methodology for English; we intend to explore this over the next few months.

Effective pattern discovery requires a uniform representation of the syntactic relations in text. The greater the degree to which these relations are normalized (by identifying 'logical syntactic relations' and filling syntactic 'gaps'), the more successful discovery will be. The best syntactic analyzers for English, however, based on the Penn Tree Bank, produce an augmented surface representation. We have produced a procedure for transforming this representation into one which is more heavily normalized. We intend to use this representation over the next year for our discovery procedures.

Project References

[1] Roman Yangarber, Ralph Grishman, Pasi Tapanainen, and Silja Huttunen. Unsupervised Discovery of Scenario-Level Patterns for Information Extraction. Proc. Sixth Applied Natural Language Processing Conf., Seattle, WA, April-May, 2000, 282-289.

[2] Roman Yangarber, Ralph Grishman, Pasi Tapanainen, and Silja Huttunen. Automatic Acquisition of Domain Knowledge for Information Extraction. Proc. 18th Int'l Conf. on Computational Linguistics (COLING 2000), Saarbrücken, Germany, July-August 2000, 940-946.

[3] Kiyoshi Sudo, Satoshi Sekine, and Ralph Grishman. Automatic Pattern Acquisition for Japanese Information Extraction. Proc. First International Conf. on Human Language Technology Research (HLT 2001), San Diego, CA, March, 2001, 51-58.