Tree-Structured Models of Multitext: Theory, Design and Experiments
Candidate: Benjamin Wellington
Advisor: I. Dan Melamed

Abstract

Statistical machine translation (SMT) systems use empirical models to simulate the act of human translation between language pairs. This dissertation surveys the ability of currently popular syntax-aware SMT systems to model real-world multitext, and shows different types of linguistic phenomena occurring in natural language translation that these popular systems cannot capture. It then proposes a new grammar formalism, Generalized Multitext Grammar (GMTG), and a generalization of Chomsky Normal Form, that allows us to build an efficient SMT system using previously developed parsing techniques. The dissertation addresses many software engineering issues that arise when doing syntax-based SMT using large corpora and lays out a object-oriented design for a translation toolkit. Using the toolkit, we show that a tree-transduction based SMT system, which uses modern machine learning algorithms, outperforms a generative baseline.