G22.2591 - Advanced Natural Language Processing - Spring 2009
Further discussion of Assignment #2 results.
Discussion of Assignment #3
Discussion of Text Analysis
Conference (TAC) 2009 Knowledge-Base
Population (KBP) task as a possible source of term projects.
- corpus of about 1M news articles
- data base of about 100K people, orgs, and locations (derived
from Wikipedia infoboxes)
- e.g., data base for person has birthdate, nationality,
immediate relatives, ...
- where appropriate, data base values are string and link to another data base entry
- given a person, org, or location:
- if not in data base, create data base entry from news articles
- if already in data base, fill in missing attributes in entry
- Recall important
- Need to extract from single instances
- Need to extract when arguments are anaphoric
Relation Extraction (cont'd)
As we saw in looking at examples of ACE relations, identifying
relations accurately may depend heavily on first identifying
and classifying their arguments accurately. The relation is
quite different for French auto worker and Ford auto
worker. In particular, many relation taggers operate by
considering every pair of entity mentions (basically, noun phrases
representing people, places, or organizations) and applying a
classifier to the pair, returning either no relation or
the type of some relation. Poor performance of the entity
mention tagger severely compromises the relation tagger.
To make matters worse, what the relation tagger finds is a relation
between noun phrases. (In ACE parlance, they are relation
mentions which represent relations between entity mentions.)
If the noun phrase is an anaphoric reference to a named entity,
["Bernard Madoff was jailed yesterday. ... His wife Mabel remained
in seclusion. or ... The swindler's wife, Mabel, remained in
seclusion.] we really want to recover the name of the entity.
This requires establishing coreference between entity mentions.
Semi- and un-supervised methods
All these methods involve bootstrapping which alternates between
finding pairs of arguments and finding the contexts ('patterns') of
these arguments. This is analogous to the co-training (between
spelling features and context features) used in semi-supervised NE
All of these methods are based on named arguments ... anaphoric
arguments are not considered.
Sergei Brin. Extracting
Patterns and Relations from the World Wide Web. (Also available in PDF)
In Proc. World Wide Web and
Databases International Workshop, pages 172-183. Number 1590 in
LNCS, Springer, March 1998.
Eugene Agichtein and Luis Gravano, Snowball:
Extracting Relations from Large Plain-Text Collections, [slides
] In Proc. 5th ACM
International Conference on Digital Libraries (ACM DL), 2000
- rigid pattern
- URL constraint
- minimal constraints on argument strings
- no attempt to gauge recall
Deepak Ravichandran and Eduard Hovy, Learning Surface
Text Patterns for a Question-Answering System, ACL 2002.
- NE requirements on arg strings
- flexible patterns
- run on news articles
- assigns confidences to patterns based on functionality of
relation, multiple seeds
- estimates recall
Takaaki Hasegawa, Satoshi Sekine, Ralph Grishman
Discovering Relations among Named Entities from Large Corpora.
- tested on a variety of relations
- does not cite Brin or Agichtein
- not iterative
- rigid pattern
- QA task: one argument given
- assigns confidences to patterns based on precision on retrieving seed
(assumes Web search retrieving multiple instances of relation)
- unsupervised: gathers all frequent relations between
arguments of specified types (relies on NE tagger)
Looking ahead: event/scenario extraction