This research project demonstrated this potential by
[2] Adam Meyers, Michiko Kosaka, Ralph Grishman, and Shubin Zhao. Covering Treebanks with GLARF. Proc. Workshop on Sharing Tools and Resources for Research and Education, ACL / EACL 2001, Toulouse, France, July 2001.
[3] Ralph Grishman. Adaptive Information Extraction and Sublanguage Analysis. Working Notes, Workshop on Adaptive Text Extraction and Mining, Seventeenth Int'l Joint Conf. on Artificial Intelligence (IJCAI-2001), Seattle, Washington, August 5, 2001.
[4] Adam Meyers, Michiko Kosaka, Satoshi Sekine, Ralph Grishman, and Shubin Zhao. Parsing and GLARFing. Proc. RANLP - 2001, Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria, September 2001.
[5] Yusuke Shinyama, Satoshi Sekine, Kiyoshi Sudo, and Ralph Grishman. Automatic Paraphrase Acquisition from News Articles. Proc. HLT 2002 (Human Language Technology Conference), San Diego, California, March 2002.
[6] Ralph Grishman, Silja Huttunen, and Roman Yangarber. Real-Time Event Extraction for Infectious Disease Outbreaks. Proc. HLT 2002 (Human Language Technology Conference), San Diego, California, March 2002.
[7] Satoshi Sekine, Kiyoshi Sudo, and Chikashi Nobata. Extended Named Entity Hierarchy. Proc. LREC 2002 (Language Resources and Evaluation Conference), Las Palmas, Canary Islands, Spain, May 2002.
[8] Silja Huttunen, Roman Yangarber, and Ralph Grishman. Diversity of Scenarios in Information Extraction. Proc. LREC 2002 (Language Resources and Evaluation Conference), Las Palmas, Canary Islands, Spain, May 2002.
[9] Adam Meyers, Ralph Grishman, and Michiko Kosaka. Formal Mechanisms for Capturing Regularizations. Proc. LREC 2002 (Language Resources and Evaluation Conference), Las Palmas, Canary Islands, Spain, May 2002.
[10] Roman Yangarber, Winston Lin, and Ralph Grishman. Unsupervised Learning of Generalized Names. Proc. 19th International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan, August 2002.
[11] Silja Huttunen, Roman Yangarber, and Ralph Grishman. Complexity of Event Structure in IE Scenarios. Proc. 19th International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan, August 2002.
[12] Ralph Grishman, Discovery Methods for Information Extraction. Proc. Workshop on Spontaneous Speech Processing and Recognition, Tokyo, Japan, April 2003.
[13] Ralph Grishman, Silja Huttunen, and Roman Yangarber. Information Extraction for Enhanced Aceess to Disease Outbreak Reports. J. Biomedical Informatics, vol. 25 (2002), p. 236.
[14] Roman Yangarber. Counter-training in Discovery of Semantic Patterns. Proc. 41st Annual Meeting Assn. for Computational Linguistics (ACL-2003), Sapporo, Japan, July 2003.
[15] Yusuke Shinyama and Satoshi Sekine. Paraphrase
Acquisition for Information Extraction. The Second International
Workshop on Paraphrasing: Paraphrase Acquisition and Applications
(IWP2003) at ACL 2003, Sapporo, Japan, July, 2003.
[16] Winston Lin, Roman Yangarber and Ralph Grishman. Bootstrapped Learning of Semantic Classes from Positive and Negative Examples. Proc.ICML-2003 Workshop on The Continuum from Labeled to Unlabeled Data, Washington, D.C., August 2003.
[17] Heng Ji and Ralph Grishman. Applying Coreference to Improve Name Recognition. Proc. ACL 2004 Workshop on Reference Resolution and Its Applications, Barcelona, Spain, July 2004.
The extraction system uses a set of patterns to encode the different linguistic forms which can be used to express a particular semantic relation. Pattern discovery begins with the identification of some major semantic classes in the domain and some members of these classes. A sample of text in the domain is tagged with these classes. The sample is then analyzed syntactically, all possible patterns (e.g., subject-verb-object patterns) are extracted, and those patterns which appear with much higher frequency in the sample than in text outside the domain are selected. Applying this method to texts from several topic areas, we have demonstrated the general efficacy of this approach to finding relevant extraction patterns for English [10, 14] and Japanese [1].
Extraction systems also require lists of domain-specific terms; for
example, our extraction system involving disease outbreaks requires a
list of diseases. We have developed a bootstrapping
procedure for discovering such terms automatically from a large text
corpus, using only a small 'seed' set of initial terms [10, 16]. As in
the
case of patterns, we have developed procedures which can learn several
classes of terms concurrently ('competitive learning'). Competitive
learning yields term lists of substantially higher quality than
learning one class of terms at a time. Most recently, we have
demonstrated the ability to improve the performance of a name
recognizer by using information from coreference -- other names or noun
phrases, in the same document or related documents, which may refer to
the same entity [17].
These procedures just described can find the patterns associated with a set of relations, but cannot identify which patterns are synonymous (are associated with the same relation). To determine that two patterns are paraphrases, we have made use of articles from different newspapers on the same day, reporting on the same event. Within these articles we can often identify (based on an overlap in names and other terms) sentences or portions of sentences conveying the same information; these sentence portions are generally paraphrases. [5, 15]
Effective pattern discovery requires a uniform representation of the syntactic relations in text. The greater the degree to which these relations are normalized (by identifying 'logical syntactic relations' and filling syntactic 'gaps'), the more successful discovery will be. The best syntactic analyzers for English, however, based on the Penn Tree Bank, produce an augmented surface representation. We have produced a procedure for transforming this representation into one which is more heavily normalized [2, 4, 9] and have begun using this representation for our discovery procedures.
In addition to this work on discovery procedures, we have developed
a prototype system for document access using information extraction [6,
13],
focused on the domain of infectious disease outbreaks.
The system uses a web crawler to obtain articles, from both medical and
general news sources, reporting disease outbreaks.
An extraction system pulls out information on the disease,
date, location, and type and number of victims. The resulting data
base is made available through a web interface which allows searching
and
sorting on any data base attribute. Once records of interest are
located, the documents associated with those records can be easily
obtained
with a single mouse click, and the relevant text in the document is
highlighted
and color-coded. A small user study demonstrated that people could
collect
relevant documents on specific disease outbreaks substantially
more quickly with this interface than with a conventional web-search
tool.