Speaker: Michel Galley, Stanford University
Location: Warren Weaver Hall 1302
Date: April 16, 2010, 11:30 a.m.
Host: Richard Cole
As the Internet is becoming linguistically very heterogeneous, information access becomes more challenging, since retrieved documents need to be translated into a wide range of languages. Fortunately, machine translation (MT) has made significant progress in recent years thanks to a shift towards corpus-based and statistical methods, which address the challenge of building MT systems for many language pairs. However, translations produced by statistical MT systems remain poor and unintelligible for the more structurally different language pairs, e.g., Chinese to English. The root of the problem is that most state-of-the-art techniques parameterize the translation process using words or word sequences, which do not capture the kind of global features required to operate long distance (e.g., sentential) transformations needed to produce grammatical output. To address this shortcoming, I will describe a technique for learning syntactic translation models that map grammatical structures (e.g., verb and noun phrases) from one language to another. The key benefit of this technique is that it copes with major structural divergences between languages by learning non-isomorphic tree mappings, which is done in linear time and thus easily scales to hundreds of millions of words. But despite being able to model global syntactic relationships, syntactic MT systems are prohibitively slow at test time due to high-order polynomial-time algorithms, and thus often fail to deliver on-the-fly translations. To address this issue, I will also present simpler translation models that retain the main advantages of syntactic systems---i.e., parsimony and good generalization on unseen data---while requiring no syntactic annotation. These simpler models can be decoded in linear time without resorting to any tree-based decoding algorithm, while delivering a competitive level of translation quality. The latter work forms the basis of the Stanford MT system, which ranked second in the 2009 NIST MT evaluation in Arabic-to-English.
Michel Galley is a Research Associate in the CS department at Stanford University, working on statistical machine translation with Prof. Christopher Manning. He received a Ph.D. degree from Columbia University in 2007, and a Diploma in Computer Science from Ecole Polytechnique FÃ©dÃ©rale de Lausanne (EPFL). His dissertation on abstractive summarization of conversational speech was supervised by Prof. Kathleen McKeown. During his graduate school years, he was a frequent visitor to Kevin Knight and Daniel Marcu's research group at USC/ISI, where he worked on syntax-based machine translation. Before entering graduate school, Michel was a research consultant in the Spoken Dialog Systems group at Bell Labs, Murray Hill. More information can be found at http://cs.stanford.edu/~mgalley.
Refreshments will be offered starting 15 minutes prior to the scheduled start of the talk.