Computer Science Department, New York University
This research project will
[2] Adam Meyers, Michiko Kosaka, Ralph Grishman, and Shubin Zhao. Covering Treebanks with GLARF. Proceedings of the Workshop on Sharing Tools and Resources for Research and Education, ACL / EACL 2001, Toulouse, France, July 2001.
[3] Ralph Grishman. Adaptive Information Extraction and Sublanguage Analysis. Working Notes of the Workshop on Adaptive Text Extraction and Mining, Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-2001), Seattle, Washington, August 5, 2001.
[4] Adam Meyers, Michiko Kosaka, Satoshi Sekine, Ralph Grishman, and Shubin Zhao. Parsing and GLARFing. Proceedings of RANLP - 2001, Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria, September 2001.
[5] Yusuke Shinyama, Satoshi Sekine, Kiyoshi Sudo, and Ralph Grishman. Automatic Paraphrase Acquisition from News Articles. Proceedings of HLT 2002 (Human Language Technology Conference), San Diego, California, March 2002.
[6] Ralph Grishman, Silja Huttunen, and Roman Yangarber. Real-Time Event Extraction for Infectious Disease Outbreaks. Proceedings of HLT 2002 (Human Language Technology Conference), San Diego, California, March 2002.
[7] Satoshi Sekine, Kiyoshi Sudo, and Chikashi Nobata. Extended Named Entity Hierarchy. To appear in Proceedings of LREC 2002 (Language Resources and Evaluation Conference), Las Palmas, Canary Islands, Spain, May 2002.
[8] Silja Huttunen, Roman Yangarber, and Ralph Grishman. Diversity of Scenarios in Information Extraction. To appear in Proceedings of LREC 2002 (Language Resources and Evaluation Conference), Las Palmas, Canary Islands, Spain, May 2002.
[9] Adam Meyers, Ralph Grishman, and Michiko Kosaka. Formal Mechanisms for Capturing Regularizations. To appear in Proceedings of LREC 2002 (Language Resources and Evaluation Conference), Las Palmas, Canary Islands, Spain, May 2002.
The extraction system uses a set of patterns to encode the different linguistic forms which can be used to express a particular semantic relation. Pattern discovery is a bootstrapping process which begins with the identification of some major semantic classes in the domain and some members of these classes. A sample of text in the domain is tagged with these classes. The sample is then analyzed syntactically and all possible patterns (e.g., subject-verb-object patterns) are extracted. Patterns which appear in the sample with high frequency, and in particular with high frequency relative to text outside the domain, are identified. Applying this method to English texts from several topic areas, we have demonstrated the general efficacy of this approach [10]. In particular, we have succeeded in using these patterns to create information extraction systems for one of these topics [11]. More recently, these discovery methods have been extended to Japanese [12, 1].
These procedures just described can find the patterns associated with a set of relations, but cannot identify which patterns are synonymous (are associated with the same relation). To determine that two patterns are paraphrases, we have made use of articles from different newspapers on the same day, reporting on the same event. Within these articles we can often identify (based on an overlap in names and other terms) sentences or portions of sentences conveying the same information; these sentence portions are generally paraphrases. [5]
These pattern discovery procedures rely in turn on the identification of words (or names) belonging to the major classes of entities for a domain. We have taken two approaches to enhance our ability to identify such terms. We have designed a large inventory of classes of names appearing in general newspaper text [7], and are developing a tagger to identify such names in text. In addition, we have developed a tool for identifying additional members of a class of names automatically from a large text corpus using a bootstrapping method. We begin with a small 'seed' set, find patterns which appear frequently as contexts of these names, and then use these patterns to find additional names; this process can be repeated to gradually enlarge the class of names.
Effective pattern discovery requires a uniform representation of the syntactic relations in text. The greater the degree to which these relations are normalized (by identifying 'logical syntactic relations' and filling syntactic 'gaps'), the more successful discovery will be. The best syntactic analyzers for English, however, based on the Penn Tree Bank, produce an augmented surface representation. We have produced a procedure for transforming this representation into one which is more heavily normalized [2, 4, 9] and have begun using this representation for our discovery procedures.
In addition to this work on discovery procedures, we have developed a prototype system for document access using information extraction [6]. This system is focused on the domain of infectious disease outbreaks. The system uses a web crawler to obtain articles, from both medical and general news sources, on a daily basis, reporting disease outbreaks. An extraction system for this domain pulls out information on the disease, date, location, and type and number of victims. The resulting data base is made available through a web interface which allows searching and sorting on any data base attribute. Once records of interest are located, the documents associated with those records can be easily obtained with a single mouse click, and the relevant text in the document is highlighted and color-coded.
[11] Roman Yangarber, Ralph Grishman, Pasi Tapanainen, and Silja Huttunen. Automatic Acquisition of Domain Knowledge for Information Extraction. Proc. 18th Int'l Conf. on Computational Linguistics (COLING 2000), Saarbrücken, Germany, July-August 2000, 940-946.
[12] Kiyoshi Sudo, Satoshi Sekine, and Ralph Grishman. Automatic Pattern Acquisition for Japanese Information Extraction. Proc. First International Conf. on Human Language Technology Research (HLT 2001), San Diego, CA, March, 2001, 51-58.