NYU - Automated Structuring of Text Info

AUTOMATED STRUCTURING OF TEXT INFORMATION

PI: Ralph Grishman
Co-PI: Satoshi Sekine

Computer Science Department, New York University

Contact Information

Prof. Ralph Grishman
Computer Science Department
New York University
715 Broadway, 7th Floor
New York, NY 10003
Phone: (212) 998-3497
Fax: (212) 995-4123
Email: grishman@cs.nyu.edu

PROJECT WWW PAGE

www.cs.nyu.edu/cs/projects/proteus

Project Award Information

Award Number: IIS-0081962
Duration: 09/01/00 - 08/31/03
Title: Automated Structuring of Text Information

Keywords

information extraction, information access

Project Summary

At present, access to the information in large-scale text collections is largely limited to keyword-based searches which retrieve entire documents or passages. While such tools are often satisfactory in retrieving information on general topics, they provide little support for accessing information involving specific relationships, events, or facts. Information extraction technology offers the possibility of creating structured, tabular representations of selected relations from large text collections --- representations which can support more detailed document querying. Until now, however, developing extraction systems for a broad range of relations has been too expensive and time-consuming to consider its use in this way. Recent developments in extraction system customization offer the promise of substantially easing this task, and so making this approach to document indexing feasible.

This research project will

use corpus-based techniques to automatically identify the most common relationships within a sublanguage (the set of texts concerning a particular subject matter)
use corpus-based methods to identify the different ways in which these relations can be expressed in the text
construct extraction systems which will extract information about these relationships from new text, building tabular summaries
provide a user interface for querying these relationships and accessing the underlying documents

Taken together, these tools should offer significant new capabilities for accessing the information in large text collections.

Project Publications

[1] Kiyoshi Sudo. Japanese Information Extraction with Automatically Extracted Patterns. Proceedings of the Student Research Workshop at the 39th Annual Meeting of the Association for Computational Linguistics; Toulouse, France, July, 2001.

[2] Adam Meyers, Michiko Kosaka, Ralph Grishman, and Shubin Zhao. Covering Treebanks with GLARF. Proceedings of the Workshop on Sharing Tools and Resources for Research and Education, ACL / EACL 2001, Toulouse, France, July 2001.

[3] Ralph Grishman. Adaptive Information Extraction and Sublanguage Analysis. Working Notes of the Workshop on Adaptive Text Extraction and Mining, Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-2001), Seattle, Washington, August 5, 2001.

[4] Adam Meyers, Michiko Kosaka, Satoshi Sekine, Ralph Grishman, and Shubin Zhao. Parsing and GLARFing. Proceedings of RANLP - 2001, Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria, September 2001.

[5] Yusuke Shinyama, Satoshi Sekine, Kiyoshi Sudo, and Ralph Grishman. Automatic Paraphrase Acquisition from News Articles. Proceedings of HLT 2002 (Human Language Technology Conference), San Diego, California, March 2002.

[6] Ralph Grishman, Silja Huttunen, and Roman Yangarber. Real-Time Event Extraction for Infectious Disease Outbreaks. Proceedings of HLT 2002 (Human Language Technology Conference), San Diego, California, March 2002.

[7] Satoshi Sekine, Kiyoshi Sudo, and Chikashi Nobata. Extended Named Entity Hierarchy. To appear in Proceedings of LREC 2002 (Language Resources and Evaluation Conference), Las Palmas, Canary Islands, Spain, May 2002.

[8] Silja Huttunen, Roman Yangarber, and Ralph Grishman. Diversity of Scenarios in Information Extraction. To appear in Proceedings of LREC 2002 (Language Resources and Evaluation Conference), Las Palmas, Canary Islands, Spain, May 2002.

[9] Adam Meyers, Ralph Grishman, and Michiko Kosaka. Formal Mechanisms for Capturing Regularizations. To appear in Proceedings of LREC 2002 (Language Resources and Evaluation Conference), Las Palmas, Canary Islands, Spain, May 2002.

Current Research Activities

The focus of our research has been on the problem of customization of extraction systems: easily moving an extraction system to a new task or domain. To create an extraction system for a new set of relations or events, one must determine how these relations are expressed in text, and the range of arguments with which these relations can appear. We are developing procedures to discover this information from text collections with minimal user involvement.

The extraction system uses a set of patterns to encode the different linguistic forms which can be used to express a particular semantic relation. Pattern discovery is a bootstrapping process which begins with the identification of some major semantic classes in the domain and some members of these classes. A sample of text in the domain is tagged with these classes. The sample is then analyzed syntactically and all possible patterns (e.g., subject-verb-object patterns) are extracted. Patterns which appear in the sample with high frequency, and in particular with high frequency relative to text outside the domain, are identified. Applying this method to English texts from several topic areas, we have demonstrated the general efficacy of this approach [10]. In particular, we have succeeded in using these patterns to create information extraction systems for one of these topics [11]. More recently, these discovery methods have been extended to Japanese [12, 1].

These procedures just described can find the patterns associated with a set of relations, but cannot identify which patterns are synonymous (are associated with the same relation). To determine that two patterns are paraphrases, we have made use of articles from different newspapers on the same day, reporting on the same event. Within these articles we can often identify (based on an overlap in names and other terms) sentences or portions of sentences conveying the same information; these sentence portions are generally paraphrases. [5]

These pattern discovery procedures rely in turn on the identification of words (or names) belonging to the major classes of entities for a domain. We have taken two approaches to enhance our ability to identify such terms. We have designed a large inventory of classes of names appearing in general newspaper text [7], and are developing a tagger to identify such names in text. In addition, we have developed a tool for identifying additional members of a class of names automatically from a large text corpus using a bootstrapping method. We begin with a small 'seed' set, find patterns which appear frequently as contexts of these names, and then use these patterns to find additional names; this process can be repeated to gradually enlarge the class of names.

Effective pattern discovery requires a uniform representation of the syntactic relations in text. The greater the degree to which these relations are normalized (by identifying 'logical syntactic relations' and filling syntactic 'gaps'), the more successful discovery will be. The best syntactic analyzers for English, however, based on the Penn Tree Bank, produce an augmented surface representation. We have produced a procedure for transforming this representation into one which is more heavily normalized [2, 4, 9] and have begun using this representation for our discovery procedures.

In addition to this work on discovery procedures, we have developed a prototype system for document access using information extraction [6]. This system is focused on the domain of infectious disease outbreaks. The system uses a web crawler to obtain articles, from both medical and general news sources, on a daily basis, reporting disease outbreaks. An extraction system for this domain pulls out information on the disease, date, location, and type and number of victims. The resulting data base is made available through a web interface which allows searching and sorting on any data base attribute. Once records of interest are located, the documents associated with those records can be easily obtained with a single mouse click, and the relevant text in the document is highlighted and color-coded.

Additional Project References

[10] Roman Yangarber, Ralph Grishman, Pasi Tapanainen, and Silja Huttunen. Unsupervised Discovery of Scenario-Level Patterns for Information Extraction. Proc. Sixth Applied Natural Language Processing Conf., Seattle, WA, April-May, 2000, 282-289.

[11] Roman Yangarber, Ralph Grishman, Pasi Tapanainen, and Silja Huttunen. Automatic Acquisition of Domain Knowledge for Information Extraction. Proc. 18th Int'l Conf. on Computational Linguistics (COLING 2000), Saarbrücken, Germany, July-August 2000, 940-946.

[12] Kiyoshi Sudo, Satoshi Sekine, and Ralph Grishman. Automatic Pattern Acquisition for Japanese Information Extraction. Proc. First International Conf. on Human Language Technology Research (HLT 2001), San Diego, CA, March, 2001, 51-58.