My recent work in NLP has focused on:
- Unsupervised acquisition of domain knowledge from large text
- Integrated environment for building and customizing systems for
Information Extraction (IE) is a text understanding task which involves
finding facts in natural language texts, and transforming them into a
logical or structured representation - e.g., a table in a relational database.
Our IE engine (based on the initial Proteus system developed by R. Grishman)
has been customized to extract facts on many different topics, viz.:
For example, for Infectious Disease
Outbreaks (sample database
snapshots 1 and 2 ) the
system finds reports of epidemics around the world. For each outbreak the
system determines the name of the disease, the location,
date, number of victims, whether they are sick or dead, a short
description of the victims, and a link back to the original
document (for more details, please see papers in jBiomedInfo and HLT-2002).
- Executive Management Succession
- Corporate Mergers & Acquisitions
- Rocket/Missile Launches
- Airplane Crashes
- Natural Disasters
- Infectious Disease Outbreaks
Building an IE system is a complex, multi-stage process. Among the steps
- Search for relevant domain knowledge;
- Customizing and tuning the Knowledge Bases.
These two phases roughly correspond to the basic (i.) and the applied (ii.)
aspects of research related to IE.
- Searching for domain-specific knowledge is a highly labor-intensive
task, because it involves accounting for the numerous ways in which the sought
information can be expressed in text. Examples of such knowledge are semantic
patterns (as in a semantic grammar), and lexicons of domain-specific terms.
It is fairly easy to think of a few common expressions, which will yield a
baseline level of performance. However, finding the less frequent expressions
to improve coverage is quite difficult and time-consuming (as predicted by
- Once linguistic knowledge is found, it needs to be inducted into the
system's Knowledge Bases. This itself is a delicate task. E.g., the
semantic patterns need to be combined and generalized to maximize coverage
while avoiding conflicts. The lexicon must be organized into a semantic
Automatic knowledge acquisition:
Linguistic knowledge in Natural Language understanding systems is commonly
stratified across several levels. Typical state-of-the-art IE systems
require domain-specific knowledge, including (inter alia):
How can the system aid the user in the search for linguistic knowledge?
- semantic patterns, for locating facts in text;
- concept classes, for semantic generalization; and
- specialized lexicons, of terms that may not appear in
We introduced an approach to unsupervised, or minimally supervised, knowledge
acquisition (see chapter in
M.T. Pazienza's book). The approach entails bootstrapping a comprehensive
knowledge base from a large raw, un-annotated corpus, using a small set of
seed elements. This approach is embodied in several algorithms for discovery
of knowledge from text.
describes an algorithm, ExDisco, for learning quality patterns
automatically from a large corpus. The patterns can then be incorporated into
an IE system.
The algorithm bootstraps two sets of data-points in parallel---in the dual
spaces of patterns and documents---which are statistically correlated with
each other and with the topic of interest. The topic is specified by a few
seed patterns; for example, in the Infectious Disease scenario, we can seed
the system with the sentences "disease killed N people", "victims
were diagnosed with disease". The algorithm searches the corpus for
documents matched by the initial patterns, and then iteratively searches
these documents for more relevant patterns.
Name and term discovery:
Text understanding tasks typically require identification of names of entities
belonging to a certain semantic type. E.g., in the Infectious Disease domain
we require the names of diseases; names of disease agents (i.e., organisms
which cause disease, such as bacteria, viruses, fungi, algae, parasites);
names of disease vectors (i.e., carriers, like rats, mosquitoes, etc.); names
of drugs used in treatment; and names of locations (cities, villages,
provinces, etc.), which are essential in other domains as well.
These names can in principle be derived from fixed lists, but that is
unsatisfactory for several (practical and philosophical) reasons. We
described an algorithm, Nomen (Yangarber &al.,
COLING-2002), for discovering names automatically from a large corpus.
The algorithm starts with a few seed names in several semantic categories, and
grows the categories simultaneously, by iteratively collecting local contexts
which are typical for names in these categories. Learning multiple categories
provides mutual negative evidence, which improves precision.
& Grishman, Tipster-1998), is a suite of interactive graphical tools,
built on top of the core Proteus IE engine, for customizing the knowledge
bases for a new domain. The intended product of PET is a complete,
ready-to-run IE system.
To adapt the IE system to a new domain, the user chooses a set of
examples in a training text, and for each example gives the logical
form which the example induces. The system then applies syntactic
meta-rules to transform the example automatically into a general set of
PET provides editors for the Knowledge Bases in the IE engine -- the
lexicon, the semantic concept hierarchy, the logical predicates, and the
pattern base. PET also provides a Document/Template Browser, which allows
the user to visit textual documents, to apply the IE system to the
documents, to browse the resulting templates in graphical form, and to
evaluate the performance of the system.
Back to Roman Yangarber's home page