Speech, Audio, Image and Signal Processing
- Recent SIGGRAPH Paper on
Removing Camera Shake
(with Rob Fergus, Bill Freeman and Barun Singh from MIT and Aaron
Hertzmann in Toronto.)
Nonlinear Blind Sensor Fusion and Identification
When several uncharacterized sensors measure the same unknown signal
or image, we would like to simultaneously combine the measurements
into an estimate of true source (fusion) and learn the properties of
the individual sensors (identification). This paper presents a model
in which sensors perform (time-invariant) linear filtering followed by
pointwise nonlinear squashing with additive noise and shows how, given
several such noisy nonlinear observations, it is possible to recover
the true signal and also estimate the sensor parameters. The setup
assumes that both the linear filtering and the nonlinear squashing are
spatially (temporally) invariant, but does not make any prior
assumptions (such as smoothness, sparsity or heavily tailed marginals)
about the signal being recovered and thus is appropriate for a variety
of source distributions, such as astronomical images, speech signals
and hyperspectral satellite data which may violate one or more
standard prior assumptions. An efficient estimation algorithm
minimizes the sum of squared errors between the predicted sensor
outputs and the sensor readings actually observed, using an efficient
procedure, isomorphic to the backpropagation algorithm. The setup can
be thought of as learning the weights and unknown common input for
several one-layer neural networks given their outputs.
- See the paper I wrote
about this which was appeared at the IEEE Information Fusion 2005
Time-Domain Segemental Speech Processing
We present a purely time domain approach to speech processing which
identifies waveform samples at the boundaries between glottal pulse
periods (in voiced speech) or at the boundaries of unvoiced
segments. An efficient algorithm for inferring these boundaries and
estimating the average spectra of voiced and unvoiced regions is
derived from a simple probabilistic generative model. Competitive
results are presented on pitch tracking, voiced/unvoiced detection and
timescale modification; all these tasks and several others can be
performed using the single segmentation provided by inference in the
- Kannan Achan, Brendan Frey, Aaron Hertzmann and I wrote a paper
A Segment-Based Probabilistic Generative Model of Speech
describing this work which appeared at ICASSP 2005.
Recovering the Speech Wave from a Spectrogram
- Often you have a spectrogram of speech (perhaps obtained from
modifying the spectrogram of an original waveform) and you want to
recover a speech wave consistent with it. Using a generative model of
time-domain speech signals and their spectrograms, we can efficiently
find the maximum a posteriori speech signal given the spectrogram. In
contrast to techniques that alternate between estimating
the phase and a spectrally-consistent signal, our technique directly
infers the speech signal, thus jointly optimizing the phase and a
- Kannan Achan, Brendan Frey and I wrote a paper
Probabilistic Inference of
Speech Signals from Phaseless Spectrogram describing
this work at NIPS2003.
Auditory Scene Analysis, Separation and Denoising
Source separation, or computational auditory scene analysis, attempts
to extract individual acoustic objects from input
which contains a mixture of sounds from
different sources, altered by the acoustic environment.
Unmixing algorithms such as ICA and its extensions recover sources
by reweighting multiple observation sequences, and thus cannot
operate when only a single observation signal is available. Another
technique, which I call refiltering, recovers sources by a
nonstationary reweighting (``masking'') of frequency sub-bands from a
single recording. This technique has been used in the CASA literature,
but normally the masking signals are computed using built in knowledge
or pre-designed signal to noise estimators. I think there is a lot of
potential in the application of statistical
algorithms to learning this masking function.
- My paper,
Factorial Models and Refiltering for
Speech Separation and Denoising,
which appeared in Eurospeech 03, presents a simple factorial mode,
MAXVQ, and applies it to monaural separation and denoising.
- In earlier work, I presented results of a similar
factorial HMM system which learns on recordings of single
speakers and can then separate mixtures using only one observation
signal by computing the masking during inference and then refiltering.
- A paper on this work,
One Microphone Source Separation appeared in NIPS13.
Articulatory speech processing
- In the case of human speech perception and production, an
important class of generative models with hidden states
are called articulatory models and relate the movements of
a talker's mouth to the sequence of sounds produced. Linguistic
theories and substantial psychophysical evidence argue strongly that
articulatory model inversion plays an important role in speech
perception and recognition in the brain. Unfortunately, despite
potential engineering advantages and evidence for being part of the
human strategy, such inversion of speech production models is absent
in almost all artificial speech processing systems.
- My Eurospeech'97 paper,
Towards articulatory speech recognition described my early
attempts to apply this general idea to real speech data.
- My thesis work took these ideas further and
involved a series of experiments which
investigated articulatory speech processing using real speech
production data from a database containing simultaneous audio and
mouth movement recordings. I found that it is
possible to learn simple low dimensionality models which accurately
capture the structure observed in such real production data.
These models can be used to learn a forward synthesis
system which generates sounds from articulatory movements. I also developed
an inversion algorithm which estimates movements from an
I demonstrated the use of articulatory
movements, both true and recovered, in a simple speech recognition
task, showing the possibility of doing true articulatory speech
recognition in artificial systems.
- Here are the slides from my
thesis talk. You can also view the slides from my
thesis defense talk online.
- A paper,
Constrained Hidden Markov Models which briefly discusses the
inversion algorithm and its applications appeared in NIPS'99.
Time scale modification of speech
- Simply resampling a speech waveform to speed it up or slow it down
gives unnatural sounding results (like playing a record at the wrong
speed). What needs to be done is to preserve important features in the
speech signal and play those features back at a higher or lower rate.
This is typically done in the spectral domain using spectrogram
- John Hopfield proposed the simple idea of using upward-going zero
crossings of the original waveform as feature points. These are
extremely easy to detect, and allow all processing to be done entirely in
the time domain and entirely causally. For voiced sections of speech
they approximately pick out pitch periods while for unvoiced sections
they pick out small sections of the noise bursts.
- Here are some (probably familiar) example wav files.
Hidden speaking mode
- I have played with a variety of simple psychophysical experiments
using speech data. They have all involved only informal testing on
myself and a few other listeners, but here they are:
- Short time window reversal and phase manipulation.
Take a speech signal and divide it into non-overlapping
windows. Reverse each window and concatenate. For window-length == 1
sample there is no change. For window-length == entire utterance
things sound reversed and unintelligible. However up to window-length
== 30ms the speech remains understandable. Instead of reversing the
window in time you can perform other manipulations which preserve its
spectral magnitude but change its spectral phase to random phase,
minimum phase, constant phase, etc.
- Changing speakers every phoneme.
Take a multispeaker database in which many speakers are saying the
same utterance. Using either hand markings or the results of a
forced-alignment procedure, mark the phoneme/phone boundaries for each
speaker. Reconstruct the utterance by taking the correct phonemes in
the correct order but from a different randomly chosen speaker for
each phoneme. (Sanjoy Mahajan originally suggested this experiment in 1995.)
Sam Roweis, Vision, Learning and Graphics Group,