Course#: CSCI-GA.2585-001

Instructor: Cyril Allauzen

Grader/TA: Aleks Kracun

Site/slides credit: Mehryar Mohri

Description

This course gives a computer science presentation of automatic speech
recognition, the problem of turning human speech into written
transcripts. Over the course of the semester, this class will cover many
of the essential algorithms for creating large-scale speech recognition
systems. The coverage will include the theoretical and practical aspects
of the algorithms and techniques now used in the vast majority of speech
recognition systems in both industry and academia. Besides covering
classical speech recognition algorithms developed over the past several
decades, the course material will also treat a sampling of recent
developments in this dynamic field.

Many of the learning and search algorithms and techniques currently
used in natural language processing, computational biology, and other
areas of application of machine learning were originally designed for
tackling speech recognition problems. Speech recognition continues to
feed computer science with challenging problems, in particular because
of the size of the learning and search problems it generates.

The objective of the course is thus not just to familiarize students with particular algorithms used in speech recognition, but rather use that as a basis to explore general text and speech and machine learning algorithms relevant to a variety of other areas in computer science. The course will allow students to work with real speech recognition systems by making use of several software libraries implementing the algorithms covered in the lectures.

This course is also open to undergraduate students.

Lectures

The following lecture plan covers roughly the planned topics for the course. This list is subject to revision as the semester progresses.

- Lecture 01: Introduction to speech recognition, statistical formulation.
- Lecture 02: Finite automata and transducers.
- Lecture 03: Weighted transducer algorithms.
- Lecture 04: Weighted transducer software library. [ OpenFst tutorial]
- Lecture 05: n-gram language models.
- Lecture 06: Language modeling software library.
- Lecture 07: Expectation-maximization (EM) algorithm, hidden Markov models (HMMs).
- Lecture 08: Acoustic models, Gaussian mixture models, neural networks.
- Lecture 09: Pronunciation models, decision trees, context-dependent models.
- Lecture 10: Search algorithms, transducer optimizations, Viterbi decoder.
- Lecture 11: Dynamic recognition transducer construction.
- Lecture 12: N-best algorithms, lattice generation, rescoring.
- Lecture 13: Adaptation.
- Guest Lecture 1: Brian Roark: Advanced topics in language modeling: MaxEnt, features and marginal distribution constraints.
- Guest Lecture 2: Shankar Kumar: Statistical Models for Machine Translation.
- Guest Lecture 3: Diamantino Caseiro: Progress in Exponential Language Models.
- Guest Lecture 4: Andrew Senior: Large Vocabulary Continuous Speech Recognition with Long Short-Term Memory Recurrent Networks.

Reading and Software Material

There is no single textbook covering the material presented in this course. The following are some recommended books or papers. An extensive list of recommended papers for further reading is provided in the lecture slides.

Books

- Daniel Jurafsky and James
H. Martin.
*Speech and Lanugage Processing, 2nd Edition*. Pearson Prentice Hall, 2008. - Frederick Jelinek.
*Statistical Methods for Speech Recognition*. MIT Press, Cambridge, MA, 1998. - Lawrence Rabiner and Biing-Hwang Juang.
*Fundamentals of Speech Recognition*. Prentice Hall, 1993.

- B. H. Juang and L. R. Rabiner.
*Automatic Speech Recognition - A Brief History of the Technology*. Elsevier Encyclopedia of Language and Linguistics, Second Edition, 2005. - Mehryar Mohri. Statistical Natural Language Processing. In M. Lothaire, editor, Applied Combinatorics on Words. Cambridge University Press, 2005.
- Mehryar Mohri. Weighted automata algorithms. In Manfred Droste, Werner Kuich, and Heiko Vogler, editors, Handbook of Weighted Automata. Monographs in Theoretical Computer Science, pages 213-254. Springer, 2009.
- Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Speech recognition with weighted finite-state transducers. In Larry Rabiner and Fred Juang, editors, Handbook on Speech Processing and Speech Communication, Part E: Speech recognition. Springer-Verlag, Heidelberg, Germany, 2008.
- Lawrence Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of IEEE, Vol. 77, No. 2, pp. 257, 1989.

- OpenFst Library (Finite-State Transducer Library).
- OpernGrm NGram Library (Language Modeling Library).
- Kaldi (Speech Recognition Library).
- OpenGrm Trax (Rewrite Grammar Library).

Locations and Times

Room 201 Warren Weaver Hall,

251 Mercer Street.

Wednesdays 7:10PM — 9:00PM.

Instructor office hours: Fridays 12:30PM - 1:30PM, room 328 WWH.

Prerequisites

Familiarity with basics in linear algebra, probability, and analysis of algorithms. No specific knowledge about signal processing or other engineering material is assumed. An interest and/or a background in machine learning is helpful.

A working familarity with Linux or other shell-based development environments will be necessary to complete the homework assignments. The assignments and the project will involve working with multiple software libraries with varying degrees of user-friendliness. Students will be expected to be able to work independently through software installation issues and similar challenges in the course of completing the assignments.

Coursework

There will be 3-4 assignments and a final project.

The standard high level of integrity is expected from all students, as with all CS courses.

Homework assignments

Previous years