G22.2591 - Advanced Natural Language Processing - Spring 2004

Lecture 12

Information Extraction:  Unsupervised Learning

(course evaluation today)

We considered last week some of the methods for learning extraction patterns from annotated corpora.  Developing annotated corpora for information extraction is particularly problematic because there are so many scenarios (event types), and we need a separate annotation for each scenario.  We therefore consider this week how such systems could be developed with very little training data.

Several of the systems are based on bootstrapping methods.  Bootstrapping relies on some redundancy ... multiple features which are correlated with an instance of the class or relation of interest.  We will consider examples of two types of bootstrapping,  pattern / instance bootstrapping and pattern/relevant document bootstrapping.  In pattern/instance bootstrapping, a pair of names in a given context is likely to be an instance of a relation if (1) other pairs appearing in this context are instances of the relation or (2) this pair appearing in other contexts is an instance of the relation.  This is similar to the bootstrapping used for unsupervised name discovery.
In pattern/relevant document bootstrapping, a pair of names in a given context is likely to be an instance of an event if (1) other pairs appearing in this context are instances of the event or (2) the pair appears more frequently in documents containing other instances of the event.

Pattern/instance bootstrapping:

Sergei Brin.  Extracting Patterns and Relations from the World Wide Web. (Also available in PDF)  In Proc. World Wide Web and Databases International Workshop, pages 172-183. Number 1590 in LNCS, Springer, March 1998.

Eugene Agichtein and Luis Gravano,   Snowball: Extracting Relations from Large Plain-Text Collections, [slides ]   In Proc. 5th ACM International Conference on Digital Libraries (ACM DL), 2000

Discovery from relevant documents: 

Riloff, E. (1996) "Automatically Generating Extraction Patterns from Untagged Text" (postscript, pdf)   Proc. Thirteenth National Conference on Artificial Intelligence (AAAI-96) , 1996, pp. 1044-1049.

Roman Yangarber; Ralph Grishman; Pasi Tapanainen; Silja Huttunen.  Automatic Acquisition of Domain Knowledge for Information Extraction.  Proc. COLING 2000.


Kiyoshi Sudo, Satoshi Sekine and Ralph Grishman.  An Improved Extraction Pattern Representation Model for Automatic IE Pattern AcquisitionProceedings of ACL 2003; Sapporo, Japan.

Presentation by the (first) author.

Cross-language projection:

Riloff, E., Schafer, C., and Yarowsky, D. (2002) Inducing Information Extraction Systems for New Languages via Cross-Language ProjectionProc. 19th International Conference on Computational Linguistics (COLING 2002) .

Presentation by Ben Wellington.