Automated Structuring of Text Information

IIS-0081962

Principal Investigator

Ralph Grishman
Department of Computer Science

New York University

715 Broadway, 7th Floor

New York
NY 10003
(212) 998-3497

(212) 995-4123

grishman@cs.nyu.edu

http://www.cs.nyu.edu/grishman

Co-PI

Satoshi Sekine
Department of Computer Science

New York University

719 Broadway, 7th Floor

New York
NY 10003
(212) 998-3175

(212) 995-4123

sekine@cs.nyu.edu

http://nlp.cs.nyu.edu/sekine

Keywords

information extraction, information access

Project Summary

At present, access to the information in large-scale text collections is largely limited to keyword-based searches which retrieve entire documents or passages. While such tools are often satisfactory in retrieving information on general topics, they provide little support for accessing information involving specific relationships, events, or facts. Information extraction technology offers the possibility of creating structured, tabular representations of selected relations from large text collections --- representations which can support more detailed document querying. Until now, however, developing extraction systems for a broad range of relations has been too expensive and time-consuming to consider its use in this way. Recent developments in extraction system customization offer the promise of substantially easing this task, and so making this approach to document indexing feasible.

This research project is

  • using corpus-based techniques to automatically identify the most common relationships within a topic or domain, and the different ways in which these relations are expressed
  • constructing extraction systems which extract information about these relationships from text, building tabular summaries
  • providing a Web-based a user interface for querying these relationships and accessing the underlying documents
Taken together, these tools should offer significant new capabilities for accessing the information in large text collections.

Publications

[1] Kiyoshi Sudo. Japanese Information Extraction with Automatically Extracted Patterns. Proc. Student Research Workshop, 39th Annual Meeting Assn. for Computational Linguistics; Toulouse, France, July, 2001.

[2] Adam Meyers, Michiko Kosaka, Ralph Grishman, and Shubin Zhao. Covering Treebanks with GLARF. Proc. Workshop on Sharing Tools and Resources for Research and Education, ACL / EACL 2001, Toulouse, France, July 2001.

[3] Ralph Grishman. Adaptive Information Extraction and Sublanguage Analysis. Working Notes, Workshop on Adaptive Text Extraction and Mining, Seventeenth Int'l Joint Conf. on Artificial Intelligence (IJCAI-2001), Seattle, Washington, August 5, 2001.

[4] Adam Meyers, Michiko Kosaka, Satoshi Sekine, Ralph Grishman, and Shubin Zhao. Parsing and GLARFing. Proc. RANLP - 2001, Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria, September 2001.

[5] Yusuke Shinyama, Satoshi Sekine, Kiyoshi Sudo, and Ralph Grishman. Automatic Paraphrase Acquisition from News Articles. Proc. HLT 2002 (Human Language Technology Conference), San Diego, California, March 2002.

[6] Ralph Grishman, Silja Huttunen, and Roman Yangarber. Real-Time Event Extraction for Infectious Disease Outbreaks. Proc. HLT 2002 (Human Language Technology Conference), San Diego, California, March 2002.

[7] Satoshi Sekine, Kiyoshi Sudo, and Chikashi Nobata. Extended Named Entity Hierarchy. Proc. LREC 2002 (Language Resources and Evaluation Conference), Las Palmas, Canary Islands, Spain, May 2002.

[8] Silja Huttunen, Roman Yangarber, and Ralph Grishman. Diversity of Scenarios in Information Extraction. Proc. LREC 2002 (Language Resources and Evaluation Conference), Las Palmas, Canary Islands, Spain, May 2002.

[9] Adam Meyers, Ralph Grishman, and Michiko Kosaka. Formal Mechanisms for Capturing Regularizations. Proc. LREC 2002 (Language Resources and Evaluation Conference), Las Palmas, Canary Islands, Spain, May 2002.

[10] Roman Yangarber, Winston Lin, and Ralph Grishman. Unsupervised Learning of Generalized Names. Proc. 19th International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan, August 2002.

[11] Silja Huttunen, Roman Yangarber, and Ralph Grishman. Complexity of Event Structure in IE Scenarios. Proc. 19th International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan, August 2002.

[12] Ralph Grishman, Discovery Methods for Information Extraction. Proc. Workshop on Spontaneous Speech Processing and Recognition, Tokyo, Japan, April 2003.

[13] Ralph Grishman, Silja Huttunen, and Roman Yangarber. Information Extraction for Enhanced Aceess to Disease Outbreak Reports. J. Biomedical Informatics, vol. 25 (2002), p. 236.

[14] Roman Yangarber. Counter-training in Discovery of Semantic Patterns. Proc. 41st Annual Meeting Assn. for Computational Linguistics (ACL-2003), Sapporo, Japan, July 2003.

Research Activities

The focus of this research has been on the problem of moving extraction systems to new tasks or domains. Building an extraction system for a new domain requires the discovery of patterns which express the primary relations domain; discovery of terms which are involved in these relations; and discovery of which patterns correspond to the same relation (pattern paraphrase).

The extraction system uses a set of patterns to encode the different linguistic forms which can be used to express a particular semantic relation. Pattern discovery begins with the identification of some major semantic classes in the domain and some members of these classes. A sample of text in the domain is tagged with these classes. The sample is then analyzed syntactically, all possible patterns (e.g., subject-verb-object patterns) are extracted, and those patterns which appear with much higher frequency in the sample than in text outside the domain are selected. Applying this method to texts from several topic areas, we have demonstrated the general efficacy of this approach to finding relevant extraction patterns for English [10, 14] and Japanese [1].

Extraction systems also require lists of domain-specific terms; for example, our extraction system involving disease outbreaks requires a list of diseases. We have developed a bootstrapping procedure for discovering such terms automatically from a large text corpus, using only a small 'seed' set of initial terms [10]. As in the case of patterns, we have developed procedures which can learn several classes of terms concurrently ('competitive learning'). Competitive learning yields term lists of substantially higher quality than learning one class of terms at a time.

These procedures just described can find the patterns associated with a set of relations, but cannot identify which patterns are synonymous (are associated with the same relation). To determine that two patterns are paraphrases, we have made use of articles from different newspapers on the same day, reporting on the same event. Within these articles we can often identify (based on an overlap in names and other terms) sentences or portions of sentences conveying the same information; these sentence portions are generally paraphrases. [5]

Effective pattern discovery requires a uniform representation of the syntactic relations in text. The greater the degree to which these relations are normalized (by identifying 'logical syntactic relations' and filling syntactic 'gaps'), the more successful discovery will be. The best syntactic analyzers for English, however, based on the Penn Tree Bank, produce an augmented surface representation. We have produced a procedure for transforming this representation into one which is more heavily normalized [2, 4, 9] and have begun using this representation for our discovery procedures.

In addition to this work on discovery procedures, we have developed a prototype system for document access using information extraction [6, 13], focused on the domain of infectious disease outbreaks. The system uses a web crawler to obtain articles, from both medical and general news sources, reporting disease outbreaks. An extraction system pulls out information on the disease, date, location, and type and number of victims. The resulting data base is made available through a web interface which allows searching and sorting on any data base attribute. Once records of interest are located, the documents associated with those records can be easily obtained with a single mouse click, and the relevant text in the document is highlighted and color-coded. A small user study demonstrated that people could collect relevant documents on specific disease outbreaks substantially more quickly with this interface than with a conventional web-search tool.

Project Websites

http://nlp.cs.nyu.edu/
The home page of the Proteus Project, the center for research in natural language processing at New York University