G22.2591 - Advanced Natural Language Processing - Spring 2004

Lecture 11

Information Extraction

Information extraction is a very general term, including extraction of named entities and extraction of discourse entities (roughly, noun phrase classification and coreference).  Here we will be concerned with event or relation coreference:  extracting a particular type of relation or event from text.  We will specify a (data base) relation, define the meaning of this relation, and expect our extraction system to populate this data base from text.  In extraction terminology, the data base relation is often described as a template, the individual attributes (components) of the relation slots, and the values placed in the slots slot fills.

Semi-structured vs. free text extraction

This still covers a wide range of extraction tasks.  One distinction among these tasks is between semi-structured text  and free text.  In semi-structured text, considerable information is conveyed by the position, layout, and format of text.  This applies to a lot of web-based information.  Examples include job announcements, meeting announcements, and sales catalogs (automatic scanning of product information on the web is a big business).  In semi-structured extraction, the text to be analyzed is often not full sentences, but rather short phrases, and rarely includes the sort of linguistic anaphora we recently considered.  An added simplification in many (but not all) such cases is that each web page corresponds to a single instance of a relation.

In free text extraction, most of the information is carried in the text itself, although there may be some positional information (e.g., headlines).  The text may be relatively complex, involving full sentences and anaphoric relations.  Typical applications are extraction from news (either print news or transcripts of broadcast news) and extraction from scientific papers and reports.

Since the tasks are quite different (with a continuum of tasks in between), the methods are also quite different (though there are some common aspects).  Extraction from semi-structured text is basically a tagging task, based on specific tokens and token features, whereas free text extraction involves a wide range of linguistic processing, including part-of-speech tagging, name tagging, chunking, and possibly anaphora resolution. 

Semi-structured extraction:  tasks

The term wrapper  refers to extraction systems for semi-structured text.  There are a number of standard tasks for semi-structured text extraction;  many of them are available through the Repository of Test Domains for Information Extraction, part of a larger resource on information extraction created by Ion Muslea.  The tasks include seminar announcements, restaurant ratings, and stock quotes.  In general, test data preparation for such tasks is relatively straightforward.

Free text extraction:  tasks and evaluations

Free text extraction has developed under a series of evaluations conducted by the Government:  first, the Message Understanding Conferences (from the late 1980's to the mid 1990's) and more recently the ACE Automatic Content Extraction evaluations, since 1999.  Task definition and data preparation is much harder for free text extraction because it involves the interpretation of the information conveyed in text -- information which can be described in many different ways.  For the MUC evaluations, the event to be extracted was fairly specific:  the hiring or firing of an executive by a company;  a satellite launching;  a terrorist incident.  Nonetheless, in each case there are problems in determining what constitutes a reportable event.  Who is an 'executive'?  Does a rumor that someone will be hired constitute an event?  What makes an incident a terrorist incident?  How do we distinguish one event from two closely related events?

If we answered these questions in strictly linguistic terms (a terrorist incident involves the word "bomb" or "murder" or ...), we essentially remove part of the task, which is to determine all the linguistic means which can be used to express a particular type of event.  We therefore want to give a semantic specification of the task.  This specification must be given (at present) in natural language, and may involve a great deal of detail (in order to insure consistency of annotation).  Addressing this detail may distract from the linguistic problems.

In the ACE evaluations, the Goverment has shifted to more general relations and events, such as a person is at a location, a person has some social relation to another person, etc.  Unfortunately, these relations have been considerably harder to pin down than the MUC events.

Associated with both MUC and ACE are scoring metrics and scoring programs to compute these metrics.  For MUC, the metrics were recall (correct slot fills in system response / total slot fills in key) and precision (correct slot fills in system response / total slots filled in system response);  F measure (the geometric mean of recall and precision) was used as an overall figure of merit.  The metrics for ACE are more complex.

Semi-structured text extraction:  methods

An overview of work on extraction, particularly for semi-structured text, can be found in the Workshops on Machine Learning for Information Extraction (ML&IE-1999) and on Adaptive Text Extraction and Mining (ATEM-2003, ATEM-2001).

As noted before, the semi-structured tasks are similar to tagging tasks such as named entity tagging, and some of the same learning methods have been applied.  For example, Freitag and McCallum (Information Extraction with HMMs and Shrinkage, ML&IE-1999) describe an application of HMMs to two extraction problems ... seminar announcements and corporate acquisition announcements (a free-text task).  In each case, they defined a separate HMM for each slot, with background states (not part of the slot fill) and target states (part of the slot fill).  ['shrinkage' refers to smoothing of emission probabilities] 

Cohen and Jensen (A Structured Wrapper Induction System for Extracting Information from Semi-Structured Documents, ATEM-2001) describe a learning algorithm for a rule-based tagger for Web pages.  This tagger was used to extract information about job ads for WhizBang!  In addition to typical text predicates (field is preceeded by word X or is followed by word Y) they used predicates associated with the structure of the document (treating the HTML document as a tree-structured document), such as the tag labels on the path immediately above a text string.

Free text extraction:  methods

For free text event extraction, the clues indicating slot fillers will be more linguistic:  verbs and nouns indicating the event.  For example, for the hiring/firing events, we might have patterns such as

company hired person
the appointment of person by company

In some systems, a separate pattern is created for each slot;  in others, multiple slots can appear in a single pattern.  Generating slots fills separately means that a separate process is required to assemble them into a single relation or event.

To account for variations in syntax, these patterns should not be applied to raw text but rather to text which has been preprocessed, at least to find names and chunks, and (for some systems) full parses.  In addition, because the explicit arguments of an event may be anaphoric references, we need to include anaphora resolution in such a system.  So a minimal system will have

lexical analysis (POS tagging) --> NE tagging --> syntactic analysis (chunks or full parses) --> event patterns --> anaphora resolution

In the early MUCs, all of these stages were built using carefully constructed hand-coded rules.  Getting fairly good performance with hand-coded rules was not too difficult, but getting good performance (F > 0.60) was very hard -- involving lots of corpus study and careful rule balancing.  As a result, several groups turned to corpus-trained methods based on a syntactically-analyzed corpus.

Supervised learning methods

Ellen Riloff.  (1993) Automatically Constructing a Dictionary for Information Extraction Tasks. (postscript, pdf )
Proc. Eleventh National Conf. on Artificial Intelligence (AAAI-93), AAAI Press/The MIT Press, pp. 811-816.

One of the first efforts at automating the creation of an extraction system, this paper described a semi-automatic process.  If a noun phrase which appears as the subject or object of a verb corresponds to a slot filler, the system generates a candidate verb + NP pattern for filling that slot.  These patterns are then manually reviewed and some patterns are discarded.  The remaining patterns are used for processing new (test) documents and filling slots.

S. Miller, M. Crystal, H. Fox, L. Ramshaw, R. Schwartz, R. Stone, R. Weischedel, and the Annotation Group (BBN Technologies).  BBN: Description of the SIFT System as Used for MUC-7MUC-7 Proceedings.
For MUC-7, BBN introduced a statistical model for recognizing binary relations between entities -- the 'template relation' task introduced in that evaluation. (This task involved a small number of relations, such as person -- organization, and organization -- location.)  They used a generative model based on a parse tree augmented with semantic labels.  The augmentation is somewhat complicated (see Figure 3 of the paper).  In simplified terms, if a relation connects nodes A and B in the parse tree, and the lowest node dominating both A and B is C, then they add a semantic label to A, B, and C, and to all nodes on the paths from C to A and B.  In addition, in some cases a node is added to the tree to indicate the type of relation and the argument.

A large training corpus of this form is generated in a semi-automatic fashion.  The relations are first annotated by hand.  The sentences are then parsed using a TreeBank-based parser, and the resulting (syntactic) tree is augmented with information about the relations.  In this way a training corpus of about 1/2 million words was produced.  From this training corpus they then produce a lexicalized probabilistic context-free grammar.

This grammar is then used to parse new (test) text;  and the relations present are gleaned from the semantic labels (if any) on the trees.

Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella.  Kernel Methods for Relation Extraction.  J. Machine Learning Research 3 (2003) 1083-1106.

SRA addressed the same relation-extraction problem differently.  They used a partial parser (roughly, a chunker) and they used a discriminative method (SVMs) instead of a generative one. The parse tree nodes contain a type and a head or text field (Figure 1).  To represent a relation, the nodes get a 'role' field;  for example, to capture a person-affiliation relation, one node (the person) gets role=member and one node (the organization) gets role=affiliation.

As we have briefly discussed, one advantage of SVMs is that we do not have to explicitly enumerate the features which are used to classify examples;  it is sufficient to provide a kernel function which, roughly speaking, computes a similarity between examples.  As their kernel, they used a measure of similarity between two trees.  Basically, two trees are considered similar if their roots have the same type and role, and each has a subsequence of children (not necessarily consecutive) with the same types and roles.  The value of the similarity depends on how many such subsequences exist, and how spread out they are.  All the training examples are converted into such shallow parse trees with role labels, and used to train the system;  the SVM can then classify new examples of possible relations.

They obtain an F measure of 0.87 for person-affiliation and 0.83 for organization-location, although this is with hand-checked parses.

Unsupervised methods

Pattern/instance bootstrapping:

Sergei Brin.  Extracting Patterns and Relations from the World Wide Web. (Also available in PDF)  In Proc. World Wide Web and Databases International Workshop, pages 172-183. Number 1590 in LNCS, Springer, March 1998.

Eugene Agichtein and Luis Gravano,   Snowball: Extracting Relations from Large Plain-Text Collections, [slides ]   In Proc. 5th ACM International Conference on Digital Libraries (ACM DL), 2000

Discovery from relevant documents: 

Riloff, E. (1996) "Automatically Generating Extraction Patterns from Untagged Text" (postscript, pdf)   Proc. Thirteenth National Conference on Artificial Intelligence (AAAI-96) , 1996, pp. 1044-1049.

Roman Yangarber; Ralph Grishman; Pasi Tapanainen; Silja Huttunen.  Automatic Acquisition of Domain Knowledge for Information Extraction.  Proc. COLING 2000.

Kiyoshi Sudo, Satoshi Sekine and Ralph Grishman.  An Improved Extraction Pattern Representation Model for Automatic IE Pattern AcquisitionProceedings of ACL 2003; Sapporo, Japan.

Cross-language projection:

Riloff, E., Schafer, C., and Yarowsky, D. (2002) Inducing Information Extraction Systems for New Languages via Cross-Language ProjectionProc. 19th International Conference on Computational Linguistics (COLING 2002) .