Candidate: Satoshi Sekine
Advisor: Ralph Grishman

Corpus based Parsing and Sublanguage Studies

11:30 a.m., Wednesday, April 8, 1998
12th floor conference room
719 Broadway


There are two main topics in this thesis, a corpus-based parser and a study of sublanguage.

A novel approach to corpus-based parsing is proposed. In this framework, a probabilistic grammar is constructed whose rules are partial trees from a syntactically-bracketed corpus. The distinctive feature is that the partial trees are multi-layered. In other words, only a small number of non-terminals are used to cut the initial trees; other grammatical nodes are embedded into the partial trees, and hence into the grammar rules. Good parsing performance was obtained, even with small training corpora. Several techniques were developed to improve the parser's accuracy, including in particular two methods for incorporating lexical information. One method uses probabilities of binary lexical dependencies; the other directly lexicalizes the grammar rules. Because the grammar rules are long, the number of rules is huge - more than thirty thousand from a corpus of one million words. A parsing algorithm which can efficiently handle such a large grammar is described. A Japanese parser based on the same idea was also developed.

Corpus-based sublanguage studies were conducted to relate the notion of sublanguage to lexical and syntactic properties of a text. A statistical method based on word frequencies was developed to define sublanguages within a collection of documents; this method was evaluated by identifying the sublanguage of new documents. Relative frequencies of different syntactic structures were used to assess the domain dependency of syntactic structure in a multi-domain corpus. Cross-entropy measurements showed a clear distinction between fiction and non-fiction domains. Experiments were then performed in which grammars trained on individual domains, or sets of domains, were used to parse texts in the same or other domains. The results correlate with the measurements of syntactic variation across domains; in particular, the best performance is achieved using grammars trained on the same or similar domains.

The parsing and sublanguage techniques were applied to speech recognition. Sublanguage techniques were able to increase recognition accuracy, and some promising cases were found where the parser was able to correct recognition errors.