This research project is
 Adam Meyers, Michiko Kosaka, Ralph Grishman, and Shubin Zhao. Covering Treebanks with GLARF. Proc. Workshop on Sharing Tools and Resources for Research and Education, ACL / EACL 2001, Toulouse, France, July 2001.
 Ralph Grishman. Adaptive Information Extraction and Sublanguage Analysis. Working Notes, Workshop on Adaptive Text Extraction and Mining, Seventeenth Int'l Joint Conf. on Artificial Intelligence (IJCAI-2001), Seattle, Washington, August 5, 2001.
 Adam Meyers, Michiko Kosaka, Satoshi Sekine, Ralph Grishman, and Shubin Zhao. Parsing and GLARFing. Proc. RANLP - 2001, Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria, September 2001.
 Yusuke Shinyama, Satoshi Sekine, Kiyoshi Sudo, and Ralph Grishman. Automatic Paraphrase Acquisition from News Articles. Proc. HLT 2002 (Human Language Technology Conference), San Diego, California, March 2002.
 Ralph Grishman, Silja Huttunen, and Roman Yangarber. Real-Time Event Extraction for Infectious Disease Outbreaks. Proc. HLT 2002 (Human Language Technology Conference), San Diego, California, March 2002.
 Satoshi Sekine, Kiyoshi Sudo, and Chikashi Nobata. Extended Named Entity Hierarchy. Proc. LREC 2002 (Language Resources and Evaluation Conference), Las Palmas, Canary Islands, Spain, May 2002.
 Silja Huttunen, Roman Yangarber, and Ralph Grishman. Diversity of Scenarios in Information Extraction. Proc. LREC 2002 (Language Resources and Evaluation Conference), Las Palmas, Canary Islands, Spain, May 2002.
 Adam Meyers, Ralph Grishman, and Michiko Kosaka. Formal Mechanisms for Capturing Regularizations. Proc. LREC 2002 (Language Resources and Evaluation Conference), Las Palmas, Canary Islands, Spain, May 2002.
 Roman Yangarber, Winston Lin, and Ralph Grishman. Unsupervised Learning of Generalized Names. Proc. 19th International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan, August 2002.
 Silja Huttunen, Roman Yangarber, and Ralph Grishman. Complexity of Event Structure in IE Scenarios. Proc. 19th International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan, August 2002.
 Ralph Grishman, Discovery Methods for Information Extraction. Proc. Workshop on Spontaneous Speech Processing and Recognition, Tokyo, Japan, April 2003.
 Ralph Grishman, Silja Huttunen, and Roman Yangarber. Information Extraction for Enhanced Aceess to Disease Outbreak Reports. J. Biomedical Informatics, vol. 25 (2002), p. 236.
 Roman Yangarber. Counter-training in Discovery of Semantic Patterns. Proc. 41st Annual Meeting Assn. for Computational Linguistics (ACL-2003), Sapporo, Japan, July 2003.
The extraction system uses a set of patterns to encode the different linguistic forms which can be used to express a particular semantic relation. Pattern discovery begins with the identification of some major semantic classes in the domain and some members of these classes. A sample of text in the domain is tagged with these classes. The sample is then analyzed syntactically, all possible patterns (e.g., subject-verb-object patterns) are extracted, and those patterns which appear with much higher frequency in the sample than in text outside the domain are selected. Applying this method to texts from several topic areas, we have demonstrated the general efficacy of this approach to finding relevant extraction patterns for English [10, 14] and Japanese .
Extraction systems also require lists of domain-specific terms; for example, our extraction system involving disease outbreaks requires a list of diseases. We have developed a bootstrapping procedure for discovering such terms automatically from a large text corpus, using only a small 'seed' set of initial terms . As in the case of patterns, we have developed procedures which can learn several classes of terms concurrently ('competitive learning'). Competitive learning yields term lists of substantially higher quality than learning one class of terms at a time.
These procedures just described can find the patterns associated with a set of relations, but cannot identify which patterns are synonymous (are associated with the same relation). To determine that two patterns are paraphrases, we have made use of articles from different newspapers on the same day, reporting on the same event. Within these articles we can often identify (based on an overlap in names and other terms) sentences or portions of sentences conveying the same information; these sentence portions are generally paraphrases. 
Effective pattern discovery requires a uniform representation of the syntactic relations in text. The greater the degree to which these relations are normalized (by identifying 'logical syntactic relations' and filling syntactic 'gaps'), the more successful discovery will be. The best syntactic analyzers for English, however, based on the Penn Tree Bank, produce an augmented surface representation. We have produced a procedure for transforming this representation into one which is more heavily normalized [2, 4, 9] and have begun using this representation for our discovery procedures.
In addition to this work on discovery procedures, we have developed a prototype system for document access using information extraction [6, 13], focused on the domain of infectious disease outbreaks. The system uses a web crawler to obtain articles, from both medical and general news sources, reporting disease outbreaks. An extraction system pulls out information on the disease, date, location, and type and number of victims. The resulting data base is made available through a web interface which allows searching and sorting on any data base attribute. Once records of interest are located, the documents associated with those records can be easily obtained with a single mouse click, and the relevant text in the document is highlighted and color-coded. A small user study demonstrated that people could collect relevant documents on specific disease outbreaks substantially more quickly with this interface than with a conventional web-search tool.