Speech, Audio, Image and Signal Processing

Image Deblurring

Recent SIGGRAPH Paper on Removing Camera Shake (with Rob Fergus, Bill Freeman and Barun Singh from MIT and Aaron Hertzmann in Toronto.)

Nonlinear Blind Sensor Fusion and Identification

When several uncharacterized sensors measure the same unknown signal or image, we would like to simultaneously combine the measurements into an estimate of true source (fusion) and learn the properties of the individual sensors (identification). This paper presents a model in which sensors perform (time-invariant) linear filtering followed by pointwise nonlinear squashing with additive noise and shows how, given several such noisy nonlinear observations, it is possible to recover the true signal and also estimate the sensor parameters. The setup assumes that both the linear filtering and the nonlinear squashing are spatially (temporally) invariant, but does not make any prior assumptions (such as smoothness, sparsity or heavily tailed marginals) about the signal being recovered and thus is appropriate for a variety of source distributions, such as astronomical images, speech signals and hyperspectral satellite data which may violate one or more standard prior assumptions. An efficient estimation algorithm minimizes the sum of squared errors between the predicted sensor outputs and the sensor readings actually observed, using an efficient procedure, isomorphic to the backpropagation algorithm. The setup can be thought of as learning the weights and unknown common input for several one-layer neural networks given their outputs.
See the paper I wrote about this which was appeared at the IEEE Information Fusion 2005 conference.

Time-Domain Segemental Speech Processing

We present a purely time domain approach to speech processing which identifies waveform samples at the boundaries between glottal pulse periods (in voiced speech) or at the boundaries of unvoiced segments. An efficient algorithm for inferring these boundaries and estimating the average spectra of voiced and unvoiced regions is derived from a simple probabilistic generative model. Competitive results are presented on pitch tracking, voiced/unvoiced detection and timescale modification; all these tasks and several others can be performed using the single segmentation provided by inference in the model.
Kannan Achan, Brendan Frey, Aaron Hertzmann and I wrote a paper A Segment-Based Probabilistic Generative Model of Speech describing this work which appeared at ICASSP 2005.

Recovering the Speech Wave from a Spectrogram

Often you have a spectrogram of speech (perhaps obtained from modifying the spectrogram of an original waveform) and you want to recover a speech wave consistent with it. Using a generative model of time-domain speech signals and their spectrograms, we can efficiently find the maximum a posteriori speech signal given the spectrogram. In contrast to techniques that alternate between estimating the phase and a spectrally-consistent signal, our technique directly infers the speech signal, thus jointly optimizing the phase and a spectrally-consistent signal.
Kannan Achan, Brendan Frey and I wrote a paper Probabilistic Inference of Speech Signals from Phaseless Spectrogram describing this work at NIPS2003.

Auditory Scene Analysis, Separation and Denoising

Source separation, or computational auditory scene analysis, attempts to extract individual acoustic objects from input which contains a mixture of sounds from different sources, altered by the acoustic environment. Unmixing algorithms such as ICA and its extensions recover sources by reweighting multiple observation sequences, and thus cannot operate when only a single observation signal is available. Another technique, which I call refiltering, recovers sources by a nonstationary reweighting (``masking'') of frequency sub-bands from a single recording. This technique has been used in the CASA literature, but normally the masking signals are computed using built in knowledge or pre-designed signal to noise estimators. I think there is a lot of potential in the application of statistical algorithms to learning this masking function.
My paper, Factorial Models and Refiltering for Speech Separation and Denoising, which appeared in Eurospeech 03, presents a simple factorial mode, MAXVQ, and applies it to monaural separation and denoising.
In earlier work, I presented results of a similar factorial HMM system which learns on recordings of single speakers and can then separate mixtures using only one observation signal by computing the masking during inference and then refiltering.
A paper on this work, One Microphone Source Separation appeared in NIPS13.

Articulatory speech processing

In the case of human speech perception and production, an important class of generative models with hidden states are called articulatory models and relate the movements of a talker's mouth to the sequence of sounds produced. Linguistic theories and substantial psychophysical evidence argue strongly that articulatory model inversion plays an important role in speech perception and recognition in the brain. Unfortunately, despite potential engineering advantages and evidence for being part of the human strategy, such inversion of speech production models is absent in almost all artificial speech processing systems.
My Eurospeech'97 paper, Towards articulatory speech recognition described my early attempts to apply this general idea to real speech data.
My thesis work took these ideas further and involved a series of experiments which investigated articulatory speech processing using real speech production data from a database containing simultaneous audio and mouth movement recordings. I found that it is possible to learn simple low dimensionality models which accurately capture the structure observed in such real production data. These models can be used to learn a forward synthesis system which generates sounds from articulatory movements. I also developed an inversion algorithm which estimates movements from an acoustic signal. I demonstrated the use of articulatory movements, both true and recovered, in a simple speech recognition task, showing the possibility of doing true articulatory speech recognition in artificial systems.
Here are the slides from my thesis talk. You can also view the slides from my thesis defense talk online.
A paper, Constrained Hidden Markov Models which briefly discusses the inversion algorithm and its applications appeared in NIPS'99.

Time scale modification of speech

Simply resampling a speech waveform to speed it up or slow it down gives unnatural sounding results (like playing a record at the wrong speed). What needs to be done is to preserve important features in the speech signal and play those features back at a higher or lower rate. This is typically done in the spectral domain using spectrogram resampling techniques.
John Hopfield proposed the simple idea of using upward-going zero crossings of the original waveform as feature points. These are extremely easy to detect, and allow all processing to be done entirely in the time domain and entirely causally. For voiced sections of speech they approximately pick out pitch periods while for unvoiced sections they pick out small sections of the noise bursts.
Here are some (probably familiar) example wav files. [original] [2x faster] [2x slower]

Hidden speaking mode

At one of the Johns Hopkins summer research workshops on speech recognition (WS96), I was a member of Mari Ostendorf's team working on a project to model pronounciation variations in conversational speech using a hidden "speaking mode".
This project is described in an ICSLP96 paper,
Modeling systematic variations in pronunciation via a language-dependent hidden speaking mode.

Simple psychophysics

I have played with a variety of simple psychophysical experiments using speech data. They have all involved only informal testing on myself and a few other listeners, but here they are:
- Short time window reversal and phase manipulation.
  Take a speech signal and divide it into non-overlapping windows. Reverse each window and concatenate. For window-length == 1 sample there is no change. For window-length == entire utterance things sound reversed and unintelligible. However up to window-length == 30ms the speech remains understandable. Instead of reversing the window in time you can perform other manipulations which preserve its spectral magnitude but change its spectral phase to random phase, minimum phase, constant phase, etc.
- Changing speakers every phoneme.
  Take a multispeaker database in which many speakers are saying the same utterance. Using either hand markings or the results of a forced-alignment procedure, mark the phoneme/phone boundaries for each speaker. Reconstruct the utterance by taking the correct phonemes in the correct order but from a different randomly chosen speaker for each phoneme. (Sanjoy Mahajan originally suggested this experiment in 1995.)

Sam Roweis, Vision, Learning and Graphics Group, NYU, www.cs.nyu.edu/~roweis