Ensemble Semantics: A General Information Extraction Framework
Speaker: Patrick Pantel, University of Southern California
Location: Warren Weaver Hall 1302
Date: April 14, 2010, 11:30 a.m.
Host: Richard Cole
Mounting evidence shows that combining information sources and information extraction algorithms leads to improvements in several knowledge acquisition tasks such as entity extraction, fact extraction, and entailment rule acquisition. In this talk, we present Ensemble Semantics (ES), a general framework for modeling information extraction algorithms that combine multiple sources of information and extractors. We show large gains in entity extraction by combining state-of-the-art distributional and pattern-based extractors with a large set of features from a 600 million document webcrawl, one year of query logs, and a snapshot of Wikipedia. We explore the hypothesis that although distributional and pattern-based algorithms are complementary, they do not exhaust the semantic space; other sources of evidence can be leveraged to better combine them. A detailed analysis of feature correlations and interactions shows that query log and webcrawl features yield the highest gains, but easily accessible Wikipedia features also improve over current state-of-the-art systems. We further deep dive on Yahoo!'s distributional set expansion extractor and study the impact of editor-chosen seeds on extraction performance. We show that in general few seeds are needed to saturate a distributional model and that seed compositionality is very sensitive resulting in tremendous variance on expansion performance. We further study the latter and show that untrained editors are terrible at choosing the right seeds and we propose an algorithm for helping editors choose better seeds.
Patrick Pantel is a Senior Scientist at Yahoo! Labs and a Research Assistant Professor in the Natural Language Group at the USC Information Sciences Institute, where he conducts research in large-scale natural language processing, text mining, and knowledge acquisition. In 2003, he received a Ph.D. in Computing Science from the University of Alberta in Edmonton, Canada.
Refreshments will be offered starting 15 minutes prior to the scheduled start of the talk.