My recent work in NLP has focused on:


Information Extraction (IE) is a text understanding task which involves finding facts in natural language texts, and transforming them into a logical or structured representation - e.g., a table in a relational database. Our IE engine (based on the initial Proteus system developed by R. Grishman) has been customized to extract facts on many different topics, viz.: For example, for Infectious Disease Outbreaks (sample database snapshots 1 and 2 ) the system finds reports of epidemics around the world. For each outbreak the system determines the name of the disease, the location, date, number of victims, whether they are sick or dead, a short description of the victims, and a link back to the original document (for more details, please see papers in jBiomedInfo and HLT-2002).

Building an IE system is a complex, multi-stage process. Among the steps involved are:

Search for relevant domain knowledge;
Customizing and tuning the Knowledge Bases.
Searching for domain-specific knowledge is a highly labor-intensive task, because it involves accounting for the numerous ways in which the sought information can be expressed in text. Examples of such knowledge are semantic patterns (as in a semantic grammar), and lexicons of domain-specific terms. It is fairly easy to think of a few common expressions, which will yield a baseline level of performance. However, finding the less frequent expressions to improve coverage is quite difficult and time-consuming (as predicted by Zipf's Law).
Once linguistic knowledge is found, it needs to be inducted into the system's Knowledge Bases. This itself is a delicate task. E.g., the semantic patterns need to be combined and generalized to maximize coverage while avoiding conflicts. The lexicon must be organized into a semantic hierarchy.
These two phases roughly correspond to the basic (i.) and the applied (ii.) aspects of research related to IE.

Automatic knowledge acquisition:

Linguistic knowledge in Natural Language understanding systems is commonly stratified across several levels. Typical state-of-the-art IE systems require domain-specific knowledge, including (inter alia): How can the system aid the user in the search for linguistic knowledge?

We introduced an approach to unsupervised, or minimally supervised, knowledge acquisition (see chapter in M.T. Pazienza's book). The approach entails bootstrapping a comprehensive knowledge base from a large raw, un-annotated corpus, using a small set of seed elements. This approach is embodied in several algorithms for discovery of knowledge from text.

Pattern discovery:

(Yangarber, ACL-2003) describes an algorithm, ExDisco, for learning quality patterns automatically from a large corpus. The patterns can then be incorporated into an IE system.

The algorithm bootstraps two sets of data-points in parallel---in the dual spaces of patterns and documents---which are statistically correlated with each other and with the topic of interest. The topic is specified by a few seed patterns; for example, in the Infectious Disease scenario, we can seed the system with the sentences "disease killed N people", "victims were diagnosed with disease". The algorithm searches the corpus for documents matched by the initial patterns, and then iteratively searches these documents for more relevant patterns.

Name and term discovery:

Text understanding tasks typically require identification of names of entities belonging to a certain semantic type. E.g., in the Infectious Disease domain we require the names of diseases; names of disease agents (i.e., organisms which cause disease, such as bacteria, viruses, fungi, algae, parasites); names of disease vectors (i.e., carriers, like rats, mosquitoes, etc.); names of drugs used in treatment; and names of locations (cities, villages, provinces, etc.), which are essential in other domains as well.

These names can in principle be derived from fixed lists, but that is unsatisfactory for several (practical and philosophical) reasons. We described an algorithm, Nomen (Yangarber &al., COLING-2002), for discovering names automatically from a large corpus. The algorithm starts with a few seed names in several semantic categories, and grows the categories simultaneously, by iteratively collecting local contexts which are typical for names in these categories. Learning multiple categories provides mutual negative evidence, which improves precision.

Customization tools:

PET, (Yangarber & Grishman, Tipster-1998), is a suite of interactive graphical tools, built on top of the core Proteus IE engine, for customizing the knowledge bases for a new domain. The intended product of PET is a complete, ready-to-run IE system.

To adapt the IE system to a new domain, the user chooses a set of examples in a training text, and for each example gives the logical form which the example induces. The system then applies syntactic meta-rules to transform the example automatically into a general set of patterns.

PET provides editors for the Knowledge Bases in the IE engine -- the lexicon, the semantic concept hierarchy, the logical predicates, and the pattern base. PET also provides a Document/Template Browser, which allows the user to visit textual documents, to apply the IE system to the documents, to browse the resulting templates in graphical form, and to evaluate the performance of the system.

Back to Roman Yangarber's home page