Language is complex, but our labeled data sets generally aren't. For
example, treebanks specify coarse categories like noun phrases and
verb phrases, but they say nothing about richer phenomena like
agreement, case, definiteness, and so on. In this talk, I will present
a latent variable model for natural language parsing which is able to
learn these underlying complexities automatically.

I will describe a state-splitting approach which begins with a trivial
X-bar grammar and learns to iteratively refine the previous grammar.
In each step, latent variables are used to refine the previous model,
until a final, full-complexity model is reached. Because each
refinement introduces only limited additional complexity, learning can
be done efficiently and effectively in a generative, as well as in a
discriminative framework.

In the generative variant, the latent variables are used to split
grammar symbols. For example, noun phrases are first split into
subjects and objects, then singular and plural, and so on. A
split&merge technique is used to allocate the refinements only where
necessary, allowing different grammar symbols to specialize to
different degrees. I also present a discriminative multiscale variant
which splits grammar rules rather than grammar symbols. In this
approach, complexity need not be uniform across the entire grammar,
allowing orders of magnitude of space savings.
Both variants produce the best parsing accuracies across an array of
languages, in a fully language-general fashion.