Learning Correspondence Representations for Natural Language Processing
Speaker: John Blitzer, University of California, Berkeley
Location: Warren Weaver Hall 1302
Date: March 5, 2010, 11:30 a.m.
Host: Richard Cole
The key to creating scalable, robust natural language processing (NLP) systems is to exploit correspondences between known and unknown linguistic structure. Natural language processing has experienced tremendous success over the past two decades, but our most successful systems are still limited to the domains and languages where we have large amounts of hand-annotated data. Unfortunately, these domains and languages represent a tiny portion of the total linguistic data in the world. No matter the task, we always encounter unknown linguistic features like words and syntactic constituents that we have never observed before when estimating our models. This talk is about linking these linguistic features to one another through correspondence representations.
The first part describes a technique to learn lexical correspondences for domain adaptation of sentiment analysis systems. These systems predict the general attitude of an essay toward a particular topic. In this case, words which are highly predictive in one domain may not be present in another. We show how to build a correspondence representation between words in different domains using projections to low-dimensional, real-valued spaces. Unknown words are projected onto this representation and related directly to known features via Euclidean distance. The correspondence representation allows us to train significantly more robust models in new domains, and we achieve a 40% relative reduction in error due to adaptation over a state-of-the-art system.
The second part describes a technique to learn syntactic correspondences between languages for machine translation. Syntactic machine translation models exploit syntactic correspondences to translate grammatical structures (e.g. subjects, verbs, and objects) from one language to another. Given pairs of sentences which are translations of one another, we build a latent correspondence grammar which links grammatical structures in one language to grammatical structures in another. The syntactic correspondences induced by our grammar significantly improve a state-of-the-art Chinese-English machine translation system.
John Blitzer is a postdoctoral fellow in the computer science department at the University of California, Berkeley, working with Dan Klein. He completed his PhD in computer science at the University of Pennsylvania under Fernando Pereira, and in 2008 spent 6 months as a visiting researcher in the natural language computing group at Microsoft Research Asia.
John's research focuses on applications of machine learning to natural language. In particular, he is interested in exploiting unlabeled data and other sources of side information to improve supervised models. He has applied these techniques to tagging, parsing, entity recognition, web search, and machine translation. To learn more about John's research interests, please visit his web page: http://john.blitzer.com
Refreshments will be offered starting 15 minutes prior to the scheduled start of the talk.