CSCI-GA.2590 - Natural Language Processing - Spring 2013    Prof. Grishman

Lecture 9 Outline

April 2, 2013

Term project discussion.

Reference: Tarragona lecture notes.

Learning to Extract

Tasks of Information Extraction

There are two main tasks for information extraction: identifying the entities mentioned in a text (the "entity mentions") and the predications asserted about these entities.

The entity mentions are of three types: named mentions, nominal mentions, and pronominal mentions. The pronominal mentions (and some of the nominal mentions) are references to previously mentioned entities. Resolving these is the subject of reference resolution, which we will consider separately later in the course. For named and nominal mentions, we want to be able to define (semantic) classes of entities and identify instances of these classes. We have already done this for broad classes of names. For nominal mentions the semantic class is generally determined by the head of the phrase.

The predications we need to identify range from simple relations to complex events which may have multiple arguments and modifiers (time, location, ...). We will focus on simple binary relations with two arguments and no modifiers, mainly because these have been most intensively studied, particularly from a machine learning point of view. Furthermore, we will limit ourselves to those cases where both arguments are explicitly mentioned in the same sentence (we will relax this a bit when we discuss reference resolution). Relations were introduced with MUC-7 and expanded with ACE and KBP (see slides). Learning to extract general events is relatively uncharted territory.

Learning strategies

As we have in past weeks, we will use these tasks to demonstrate a range of learning strategies we have discussed before and to introduce some new ones. These include

We should keep in mind that, while learning methods can in some cases beat the best hand-coded systems, this is not always the case: for relations and events, some mix of learning and hand-coded rules is so far hard to beat.

Named entities

We have studied supervised systems for named entity recognition, using HMMs, MEMMs, and CRFs, and they yield good performance for frequently-occurring tags. This is a practical approach for NE systems because corpora for NEs can usually be annotated fairly quickly and reliably.

Nonetheless, there was interest from quite early on in semi-supervised methods which could greatly reduce the amount of data that need to be annotated. Collins and Singer (1999) described a bootstrapping procedure for this task -- more specifically, a co-training procedure, where the boostrapping alternates between two disjoint sets of features. Their NE task involved news text and the standard NE labels: PERSON, ORGANIZATION, LOCATION. They divided the features used for NE into two sets, 'spelling' (internal) and contextual features. Starting with a few high-precision seed rules, they labeled some examples and then trained a contextual feature classifier; then using a contextual feature classifer, they labeled additional examples and trained a spelling-feature classifier. After many iterations they combined the two classifiers and produced a high-quality NE classifier.

This is a nice result but has several significant limitations. It only addresses the problem of classifying names, assuming that identification is easy or has already been done. This is more-or-less true for English (because of capitalization) but not for monocase input or for many other languages. In addition, the procedure basically terminates by labeling all the examples. This is fine if the name classes taken together account for almost all the names in the corpus. If not, one ends up with a low-precision tagger. This can be remedied by introducing additional classes so that most names are accounted for, either manually (Yangarber and Lin [NYU]) or automatically (McIntosh).

For fine-grained entity sets, gathering enough sets to cover the space of possible names is not practical. Instead we rank the names in terms of their likelihood of belonging to the class and then separately choosing a point in the ranking as the cutoff ... names above the cut-off are included in the set. One such system has been implemented at NYU by Bonan Min and Chen Chen. This system computes the similarity between the seeds and all candidate names, using the PMI between a name and its syntactic context. It uses active learning, in which the system suggests possible additional seeds but they are reviewed by the user before being added to the seed set.

Relations

To fit the training of relation extractors into the patterns we have been studying, we need to cast relation extraction as a classification problem. We do so by considering every pair of entity mentions in the same sentence as a data point; each data point is classified as some type of relation or no relation.

Supervised relation classifiers use a wide range of features: the word sequence between the arguments (and the individual words), the dependency path between the arguments (and the words on the path), the semantic types of the arguments, ... . Tree kernels measure the similarity between paths or subtrees and can be used to directly compare training and test examples. The best classifiers, trained and tested on ACE data (with perfect entities) get F scores of 70 to 75.

The bootstrapping approach to relation discovery was suggested by Brin and elaborated by Agichtein (2000). In this approach we start with one high-precision pattern for a relation, gather a set of argument pairs from the corpus, and then look for additional phrases connecting these pairs. We then gather additional pairs connected by the new phrases.

The key to the co-training approach, both for names and relations, is that each feature set is sufficient to determine the class (of NE or relation). This is largely true for the spelling features of NEs: most names are of only one type, though there are exceptions. It is less true for the arguments of relations: many argument pairs correspond to more than one relation. This gets relation learners into trouble. This trouble is referred to as semantic drift.

Several measures have been taken to reduce this drift. Both Brin and Agichtein selected functional relations [X, Y]: for each X there is a unique value Y (Brin used book authors, Agichtein used company headquarters). Agichtein then introduced a pattern confidence metric
Conf(P) = P.positive / (P.positive + P.negative)
However, not all relations are functional, so this is not a universal solution.

A second approach is based on active learning, which involves having the user review the patterns being added during bootstrapping. Ang Sun [NYU] developed such an approach using a co-testing framework. This has since been generalized by Lisheng Fu [NYU], who combined human review of new low-confidence patterns with automatic acceptance of high-confidence patterns.

Distant supervision works roughly like a half iteration of bootstrapping, but with a large set of seeds. Given a data base with a large set of instances of a relation and a large raw text corpus, we use the instances to label the corpus and then train a classifier over the corpus. We encounter some of the same problems as with bootstrapping ... a pair may be connected by many relations besides those in the data base.