Web-scale Multi-lingual Opinion Mining
Speaker: Ryan McDonald, Google
Location: Warren Weaver Hall 1302
Date: October 30, 2009, 11:30 a.m.
Host: Mehryar Mohri
Documents, news, blogs, forums, consumer reviews, etc. The web is full of pages that contain unstructured (or semi-structured) natural language. Building intelligent algorithms that learn to analyze this content in order to extract meaningful associations and facts has become an important part of all search engines. This need is amplified when one considers the long tail of search, which contains a variety of questions and investigative queries. One avenue of research that has garnered a lot of attention is to analyze these texts for opinions, to extract and parse these opinions, and to summarize them for easy digestion by users. Understanding which texts (or text fragments) are opinions and identifying the topics and sentiment within them can be a powerful tool. Applications include filtering web pages in fact-driven searches, improving results for searches like "best clam chowder Boston" and "hotels Paris child friendly", brand-name tracking and management, and improving vertical search for product, travel and local service queries. In this talk I will describe a joint research and engineering effort at Google to build an opinion mining infrastructure. In particular, I will discuss how we build large multi-lingual opinion lexicons using label propagation over an automatically induced graph of phrases from the web. The resulting lexicons are on a scale not previously seen in the literature. These lexicons not only contain the usual adjectives, nouns and verbs, but also multi-word expressions such as "just what the doctor ordered" and "run of the mill", misspellings, vulgarity and more. I will present an evaluation of the lexicons, both qualitative and quantitative, exploring their quality relative to lexicons extracted from WordNet, multi-lingual systems that use translation, and classifiers learned from labeled consumer reviews. The experiments ultimately show that a meta-learning system that takes into account features from all sources and learns a single unified model across all languages provides the best performance.
Ryan McDonald is a Senior Research Scientist at Google in New York. Before joining Google, Ryan received a B.Sc. from the University of Toronto, a M.Eng. from the University of Pennsylvania, and a Ph.D. from the University of Pennsylvania. Ryan's research focuses on how computers can learn to analyze and summarize large collections of unstructured text. His thesis work investigated the complexity of learning and predicting syntactic dependency representations of sentences -- a syntactic formalism used in numerous applications from machine translation to question answering. Ryan recently co-authored a book titled "Dependency Parsing", which is available as of February 2009. While at Google, Ryan has been part of an effort to extract, parse, and summarize opinions from the web. Personal webpage: http://ryanmcd.com
Refreshments will be offered starting 15 minutes prior to the scheduled start of the talk.