G22.2591 - Advanced Natural Language Processing - Spring 2011
Discussion of assignment results.
Discussion of Assignment #3
Relation Extraction (cont'd)
As we saw in looking at examples of ACE relations, identifying
relations accurately may depend heavily on first identifying
and classifying their arguments accurately. The relation is
quite different for French auto worker and Ford auto
worker. In particular, many relation taggers operate by
considering every pair of entity mentions (basically, noun phrases
representing people, places, or organizations) and applying a
classifier to the pair, returning either no relation or
the type of some relation. Poor performance of the entity
mention tagger severely compromises the relation tagger.
To make matters worse, what the relation tagger finds is a relation
between noun phrases. (In ACE parlance, they are relation
mentions which represent relations between entity mentions.)
the noun phrase is an anaphoric reference to a named entity,
["Bernard Madoff was jailed yesterday. ... His wife Mabel remained
in seclusion. or ... The swindler's wife, Mabel, remained in
seclusion.] we really want to recover the name of the entity.
This requires establishing coreference between entity mentions.
In some cases, this problem can be reduced by taking advantage of
the interaction between stages -- by joint
inference. For example, if a semantic relation is
consistent with one choice of name tags or coreference relations, a
system may prefer that choice.
The simplest 'models' for identifying relations in text are sets
of fixed token sequences (along with the types of the arguments);
because all intervening words must be included, these rarely
provide good coverage unless the arguments are adjacent.
Somewhat more flexible are paths in a dependency tree, but
these still require exact match at the path level -- they make
use of only positive training examples.
We considered last weeek three types of models for relation extraction
which learn from both positive and negative examples
In the papers we looked at, all of these models were trained using
supervised methods (annotated corpora).
- probabilistic context-free grammars with both
syntactic and semantic annotations
- feature-based models, using a rich set of features (including
word level and syntactic features)
- kernel-based models, where the kernel measures tree similarity
and is used to compare a test example to positive and negative training
However, as we have discussed, annotated corpora are expensive --
annotated corpora. Relation annotation is more laborious than
name annotation: for each positive instance, we have to mark two
entities and a relation type. Annotation decisions may be more
difficult, leading to inconsistencies [LDC revised the ACE relation
categories several times in an effort to get more consistent
annotation]. Furthermore, a hand-annotated relation corpus is
likely to be quite incomplete
-- more so than a name-annotated corpus. Trying to keep in mind a
large number of relation types makes the task harder [MUC-7 had only 3
types of relations but ACE has about 20]. Identifying
long-distance relations in long sentences is particularly
difficult. Even dual annotation and adjudication [used for some
ACE corpora] is not sufficient.
So it's difficult to prepare a large, high-quality annotated corpus,
but a large corpus is needed for good coverage. The solution, as
for named entities, is to use semi-supervised methods. All these
methods involve bootstrapping which alternates between
finding pairs of arguments and finding the contexts ('patterns') of
these arguments. This is analogous to the co-training (between
spelling features and context features) used in semi-supervised NE
All of these methods are based on named arguments ... anaphoric
arguments are not considered.
Sergei Brin. Extracting
and Relations from the World Wide Web. (Also available in PDF)
Proc. World Wide Web and
Databases International Workshop, pages 172-183. Number 1590 in
LNCS, Springer, March 1998.
Brin describes a procedure DIPRE to acquire author-title pairs from the
Web starting from a small seed (5 pairs). The top level algorithm
The pattern is a 5-tuple: (order, urlprefix, prefix, middle,
suffix). If order is
true, an author-title pair matches a pattern if a page whose URL begins
with urlprefix contains the sequence prefix-author-middle-title-suffix.
- R = seed // initialize
the set of pairs, R
- O = findOccurrences(R, D) // find all occurrences of R in D
- P = genPatterns (O) // generate a set of patterns from these
- R = matches(P, D) // retrieve pairs in D matching P
- if R is large enough, quit, else go to step 2
genPatterns operates by grouping all occurrences with the same order and middle, and then computing the
longest matching urlprefix, prefix, and suffix. To avoid generating
overly general patterns, Brin defines specificity = total length for
urlprefix, prefix, middle, and suffix and requires that specificity of
pattern x number of books matching pattern exceed a threshold.
- rigid pattern
- URL constraint
- minimal constraints on argument strings
- no attempt to gauge recall
Eugene Agichtein and Luis Gravano, Snowball:
Extracting Relations from Large Plain-Text Collections, [slides
] In Proc. 5th ACM
International Conference on Digital Libraries (ACM DL), 2000
- NE requirements on arg strings
- flexible patterns
- run on news articles
- assigns confidences to patterns based on functionality of
relation, multiple seeds
- estimates recall
Deepak Ravichandran and Eduard Hovy, Learning Surface
Text Patterns for a Question-Answering System, ACL 2002.
Motivated by work on question answering. Given a <query,
answer> seed pair for a relation, collects from a corpus a set of
sentences containing the pair. Using a suffix tree, collects all
shared substrings amongst the sentences which include both members of
the pair. Converts each such substring into a pattern (replacing
query and answer by variables). Then computes the precision of each pattern = Ca/Co,
where Co is the number of times this pattern matches the corpus with a
given query term and any string for the answer, and Ca is the number of
times this pattern matches the corpus with a given query term and the
- tested on a variety of relations
- some similarity to Agichtein, but does not cite Brin or Agichtein
- not iterative
- rigid pattern
- QA task: one argument given
Razvan Bunescu and Raymond J. Mooney.
to Extract Relations from the Web using Minimal Supervision.
Looking ahead: Unsupervised
Takaaki Hasegawa, Satoshi Sekine, Ralph Grishman
Discovering Relations among Named Entities from Large Corpora.
- unsupervised: gathers all frequent relations between
arguments of specified types (relies on NE tagger)
Looking ahead: event/scenario