Search Problems for Speech and Audio Sequences
Candidate: Eugene Weinstein
Advisor: Mehryar Mohri


The modern proliferation of very large audio, video, and biological databases has created a need for the design of effective methods for indexing and searching highly variable or uncertain data. Classical search and indexing algorithms deal with clean or perfect input sequences. However, an index created from speech transcriptions is marked with errors and uncertainties stemming from the use of imperfect statistical models in the speech recognition process. Similarly, automatic transcription of music, such as assigning a sequence of notes to represent a stream of music audio, is prone to errors. How can we generalize search and indexing algorithms to deal with such uncertain inputs?

This thesis presents several novel algorithms, analyses, and general techniques and tools for effective indexing and search that not only tolerate but actually exploit this uncertainty. In particular, it develops an algorithmic foundation for music identification, or content-based music search; presents novel automata-theoretic results applicable generally to a variety of search and indexing tasks; and describes new algorithms for topic segmentation, or automatic splitting of speech streams into topic-coherent segments.

We devise a new technique for music identification in which each song is represented by a distinct sequence of music sounds, called "music phonemes." In our approach, we learn the set of music phonemes, as well as a unique sequence of music phonemes characterizing each song, from training data using an unsupervised algorithm. We also propose a novel application of factor automata to create a compact mapping of music phoneme sequences to songs. Using these techniques, we construct an efficient and robust music identification system for a large database of songs.

We further design new algorithms for compact indexing of uncertain inputs based on suffix and factor automata and give novel theoretical guarantees for their space requirements. Suffix automata and factor automata represent the set of all suffixes or substrings of a set of strings, and are used in numerous indexing and search tasks, including the music identification system just mentioned. We show that the suffix automaton or factor automaton of a set of strings U has at most 2Q-2 states, where Q is the number of nodes of a prefix-tree representing the strings in U, a significant improvement over previous work. We also describe a matching new linear-time algorithm for constructing the suffix automaton S or factor automaton F of U in time O(|S|).

We also define a new quality measure for topic segmentation systems and design a discriminative topic segmentation algorithm for speech inputs, thus facilitating effective indexation of spoken audio collections. The new quality measure improves on previously used criteria and is correlated with human judgment of topic-coherence. Our segmentation algorithm uses a novel general topical similarity score based on word co-occurrence statistics. This new algorithm outperforms previous methods in experiments over speech and text streams. We further demonstrate that the performance of segmentation algorithms can be improved by using a lattice of competing hypotheses over the speech stream rather than just the one-best hypothesis as input.