
Machine Learning Algorithms
Learning Human Motion
 Graham Taylor has posted some
videos,and some
code]
associated with our recent NIPS paper.
Learning When Being Missing is Informative
 Often elements are missing from datasets we collect,
but in a way that is far from random. Ben Marlin, Rich Zemel and I
explored the topic of unsupervised learning in the presence of such
nonignorable missing data when the missing data mechanism is itself
unknown. Our paper discusses several classes of missing data
mechanisms for
categorical data and develops learning and inference methods for two
specic models. We also present empirical results using synthetic data
which show that these algorithms can recover both the unknown
selection model parameters and the underlying data model param
eters to a high degree of accuracy. We also apply the algorithms to
real data from the domain of collaborative filtering, and report
initial results.
 Our paper about this work was at AISTATS 2005:
Unsupervised Learning with Nonignorable
Missing Data..
Inference in Nonlinear Systems with Embedded HMMs
 We describe a Markov chain method for sampling from the distribution
of the hidden state sequence in a nonlinear dynamical system, given a
sequence of observations. This method updates all states in the
sequence simultaneously using an embedded Hidden Markov Model (HMM).
An update begins with the creation of "pools" of candidate states at each
time. We then define an embedded HMM, whose states are indexes within
these pools. Using a forwardbackward dynamic programming algorithm,
we can efficiently choose a state sequence with the appropriate
probabilities from the exponentially large number of state sequences
that pass through states in these pools.
 Matt Beal, Radford Neal and I presented this work at
NIPS2003 in the paper
Inferring State Sequences for
Nonlinear Systems with Embedded Hidden Markov Models.
Efficient Learning with Hidden Variables
 There is a close relationship between bound optimization
algorithms such as ExpectationMaximization
and direct optimization algorithms such as
gradientbased methods for parameter learning.
Under certain conditions boundoptimizers
exhibit QuasiNewton convergence behavior, under others
these algorithms possess poor, firstorder convergence.
In particular, for the EM algorithm
we show that if a certain measure of the proportion of missing
information is small, then EM exhibits QuasiNewton behavior;
when it is large, EM converges slowly.
Based on this analysis, we present a new
ExpectationConjugateGradient (ECG) algorithm for maximum
likelihood estimation, in latent variable models,
and report empirical
results, showing that, as predicted by the theory, ECG outperforms EM
in certain cases.
 Ruslan Salakhutdinov and I wrote a paper for ICML2003 on a
very simple but very effective way of speeding up EM using
"overstepping", "momentum" or "overrelaxation" with an adaptive step size:
Adaptive Overrelaxed Bound Optimization Methods.
.
 Ruslan Salakhutdinov, Zoubin Ghahramani and I had a more
theoetical paper, also at ICML03, which
discusses the EM algorithm in particular,
analyzes its convergence compared to direct conjugategradient optimizations
and proposes a new algorithm for switching between the two strategies:
Optimization with EM and ExpectationConjugateGradient.
.
 At UAI03, we had a theoretical paper covering a broader class
of algorithms which analyzed the general convergence behaviour of many
bound optimization algorithms including EM.
(After following this link, there is a note which mentions
an important typo appearing in the official version of this paper.)
On the Convergence of Bound Optimization Algorithms
.
Alternate Objective Functions for Markovian Fields
 In labelling or prediction tasks, a trained model's test performance
is often based on the quality of its singletime marginal distributions over
labels rather than its joint distribution over label sequences.
We propose using a new cost function for discriminative
learning that more accurately reflects such test time conditions.
We present an efficient method to
compute the gradient of this cost for Maximum Entropy Markov Models,
Conditional Random Fields, and for an extension of these models
involving hidden states. Our experimental results show that the new
cost can give significant improvements and that it provides a novel and
effective way of dealing with the 'labelbias' problem.
 Our paper
An Alternate Objective Function for Markovian Fields
appeared in ICML2002, joint work with Yee Whye Teh and Sham Kakade.
Selforganized (constrained) hidden Markov models
 Hidden variables are an essential feature of many statistical models of
data. Adding topology to hidden variables can often lead to powerful
algorithms that learn highly structured representations. For my thesis
I developed a model closed related to hidden Markov models (HMMs)
which adds topology to the hidden states. This is similar to the
topology given to units in a Kohonen map.
 A paper,
Constrained Hidden Markov Models appeared in NIPS12.
LinearGaussian models
 Many common statistical, engineering and machine learning models
that employ linear estimators with Gaussian noise can be thought of
in a common framework. I have written a review
article with Zoubin Ghahramani which covers many such models and their
relationships.

A unifying review of linear Gaussian models
appeared in Neural Computation 11(2).
Principal component analysis
 The eigenvectors and eigenvalues found by conventional
principal component analysis (PCA) are also the maximum likelihood
solutions to a simple probabilistic generative model closely related
to factor analysis. This model can be learned with the EM algorithm
which in some cases can be more efficient than other traditional
ways of computing the principal components. It also gives a proper
probabilistic interpretation is PCA.
 My paper
EM algorithms for PCA and SPCA in NIPS10 introduces this
model and the associated algorithms.
Essentially the same model
has been developed independently by Michael Tipping and Chris Bishop
both at Microsoft Research, Cambridge.
My earlier
technical report version
contains some figures of an experiement using a very large natural
image dataset which are not in the NIPS paper.
 The MATLAB code for implementing this
algorithm is also available from my code page.
Linear Heteroencoders
 It is well known that if you train a linear network with a
bottleneck as an autoencoder the hidden units extract the principal
subspace of the data. (This is true even if the output units are
nonlinear.) In other words, linear autoencoders with bottlenecks do
PCA. But what if you train a linear heteroencoder (inputoutput)
network to do supervised learning? With no bottleneck, these networks
do linear regression. What happens with a bottleneck? It turns out
that there is a simple, closed form expression for the "restricted
rank regression" that such networks achieve and that it is closely
related to a statistical concept called canonical correlations.
 Carlos Brody and I have written a short tech report, titled
Linear Heteroencoders gives a tutorial review of these
ideas and a simple derivation of the form of the optimal restricted
rank linear mapping from some inputs to outputs.
Computing with action potentials
 In his 1995 Nature article, John Hopfield proposed
an encoding of information in the timing of action potentials
that allowed interesting computations to be performed using
a simple coincidence detection mechanism. Hopfield, Carlos Brody
and myself have explored the machine learning side of this idea
and applied it to a simple digit recognition task.
 Our paper,
Computing with action potentials appeared in NIPS10.
[ 
Information 
Research 
Teaching 
Professional 
]
Sam Roweis, Vision, Learning and Graphics Group,
NYU, www.cs.nyu.edu/~roweis
