G22.2591 - Advanced Natural Language Processing - Spring 2011

Lecture 8

Discussion of assignment results.

Discussion of Assignment #3

Relation Extraction (cont'd)

The pipeline

As we saw in looking at examples of ACE relations, identifying relations accurately may depend heavily on first identifying and classifying their arguments accurately. The relation is quite different for French auto worker and Ford auto worker. In particular, many relation taggers operate by considering every pair of entity mentions (basically, noun phrases representing people, places, or organizations) and applying a classifier to the pair, returning either no relation or the type of some relation. Poor performance of the entity mention tagger severely compromises the relation tagger.

To make matters worse, what the relation tagger finds is a relation between noun phrases. (In ACE parlance, they are relation mentions which represent relations between entity mentions.) If the noun phrase is an anaphoric reference to a named entity, ["Bernard Madoff was jailed yesterday. ... His wife Mabel remained in seclusion. or ... The swindler's wife, Mabel, remained in seclusion.] we really want to recover the name of the entity. This requires establishing coreference between entity mentions.

In some cases, this problem can be reduced by taking advantage of the interaction between stages -- by joint inference.  For example, if a semantic relation is consistent with one choice of name tags or coreference relations, a system may prefer that choice.

Semi-supervised methods

The simplest 'models' for identifying relations in text are sets of fixed token sequences (along with the types of the arguments); because all intervening words must be included, these rarely provide good coverage unless the arguments are adjacent. Somewhat more flexible are paths in a dependency tree, but these still require exact match at the path level -- they make use of only positive training examples.

We considered last weeek three types of models for relation extraction which learn from both positive and negative examples
In the papers we looked at, all of these models were trained using supervised methods (annotated corpora).

However, as we have discussed, annotated corpora are expensive -- particularly carefully annotated corpora.  Relation annotation is more laborious than name annotation:  for each positive instance, we have to mark two entities and a relation type.  Annotation decisions may be more difficult, leading to inconsistencies [LDC revised the ACE relation categories several times in an effort to get more consistent annotation].  Furthermore, a hand-annotated relation corpus is likely to be quite incomplete -- more so than a name-annotated corpus.  Trying to keep in mind a large number of relation types makes the task harder [MUC-7 had only 3 types of relations but ACE has about 20].  Identifying long-distance relations in long sentences is particularly difficult.  Even dual annotation and adjudication [used for some ACE corpora] is not sufficient. 

So it's difficult to prepare a large, high-quality annotated corpus, but a large corpus is needed for good coverage.  The solution, as for named entities, is to use semi-supervised methods.  All these methods involve bootstrapping which alternates between finding pairs of arguments and finding the contexts ('patterns') of these arguments.  This is analogous to the co-training (between spelling features and context features) used in semi-supervised NE tagging.

All of these methods are based on named arguments ... anaphoric arguments are not considered.

Sergei Brin. Extracting Patterns and Relations from the World Wide Web. (Also available in PDF)  In Proc. World Wide Web and Databases International Workshop, pages 172-183. Number 1590 in LNCS, Springer, March 1998.

Brin describes a procedure DIPRE to acquire author-title pairs from the Web starting from a small seed (5 pairs).  The top level algorithm is
  1. R = seed  // initialize the set of pairs, R
  2. O = findOccurrences(R, D)  // find all occurrences of R in D
  3. P = genPatterns (O)  //  generate a set of patterns from these occurrences
  4. R = matches(P, D)  //  retrieve pairs in D matching P
  5. if R is large enough, quit, else go to step 2
The pattern is a 5-tuple:  (order, urlprefix, prefix, middle, suffix).  If order is true, an author-title pair matches a pattern if a page whose URL begins with urlprefix contains the sequence prefix-author-middle-title-suffix.

genPatterns operates by grouping all occurrences with the same order and middle, and then computing the longest matching urlprefix, prefix, and suffix.  To avoid generating overly general patterns, Brin defines specificity = total length for urlprefix, prefix, middle, and suffix and requires that specificity of pattern x number of books matching pattern exceed a threshold.

Basic characteristics:

Eugene Agichtein and Luis Gravano,   Snowball: Extracting Relations from Large Plain-Text Collections, [slides ]   In Proc. 5th ACM International Conference on Digital Libraries (ACM DL), 2000
(student presentation)

Deepak Ravichandran and Eduard Hovy, Learning Surface Text Patterns for a Question-Answering System, ACL 2002.

Motivated by work on question answering.  Given a <query, answer> seed pair for a relation, collects from a corpus a set of sentences containing the pair.  Using a suffix tree, collects all shared substrings amongst the sentences which include both members of the pair.  Converts each such substring into a pattern (replacing query and answer by variables).  Then computes the precision of each pattern = Ca/Co, where Co is the number of times this pattern matches the corpus with a given query term and any string for the answer, and Ca is the number of times this pattern matches the corpus with a given query term and the correct answer.

Razvan Bunescu and Raymond J. Mooney. Learning to Extract Relations from the Web using Minimal Supervision. ACL 2007.

(student presentation)

Looking ahead:  Unsupervised methods

Takaaki Hasegawa, Satoshi Sekine, Ralph Grishman Discovering Relations among Named Entities from Large Corpora. ACL 2004.

Looking ahead: event/scenario extraction

Riloff, E. (1996) "Automatically Generating Extraction Patterns from Untagged Text" (postscript, pdf)   Proc. Thirteenth National Conference on Artificial Intelligence (AAAI-96) , 1996, pp. 1044-1049.

Roman Yangarber; Ralph Grishman; Pasi Tapanainen; Silja Huttunen.  Automatic Acquisition of Domain Knowledge for Information Extraction.  Proc. COLING 2000.