Machine Learning Algorithms

Learning Human Motion

Graham Taylor has posted some videos,and some code] associated with our recent NIPS paper.

Learning When Being Missing is Informative

Often elements are missing from datasets we collect, but in a way that is far from random. Ben Marlin, Rich Zemel and I explored the topic of unsupervised learning in the presence of such non-ignorable missing data when the missing data mechanism is itself unknown. Our paper discusses several classes of missing data mechanisms for categorical data and develops learning and inference methods for two specic models. We also present empirical results using synthetic data which show that these algorithms can recover both the unknown selection model parameters and the underlying data model param- eters to a high degree of accuracy. We also apply the algorithms to real data from the domain of collaborative filtering, and report initial results.
Our paper about this work was at AISTATS 2005: Unsupervised Learning with Non-ignorable Missing Data..

Inference in Nonlinear Systems with Embedded HMMs

We describe a Markov chain method for sampling from the distribution of the hidden state sequence in a non-linear dynamical system, given a sequence of observations. This method updates all states in the sequence simultaneously using an embedded Hidden Markov Model (HMM). An update begins with the creation of "pools" of candidate states at each time. We then define an embedded HMM, whose states are indexes within these pools. Using a forward-backward dynamic programming algorithm, we can efficiently choose a state sequence with the appropriate probabilities from the exponentially large number of state sequences that pass through states in these pools.
Matt Beal, Radford Neal and I presented this work at NIPS2003 in the paper Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models.

Efficient Learning with Hidden Variables

There is a close relationship between bound optimization algorithms such as Expectation-Maximization and direct optimization algorithms such as gradient-based methods for parameter learning. Under certain conditions bound-optimizers exhibit Quasi-Newton convergence behavior, under others these algorithms possess poor, first-order convergence. In particular, for the EM algorithm we show that if a certain measure of the proportion of missing information is small, then EM exhibits Quasi-Newton behavior; when it is large, EM converges slowly. Based on this analysis, we present a new Expectation-Conjugate-Gradient (ECG) algorithm for maximum likelihood estimation, in latent variable models, and report empirical results, showing that, as predicted by the theory, ECG outperforms EM in certain cases.
Ruslan Salakhutdinov and I wrote a paper for ICML2003 on a very simple but very effective way of speeding up EM using "overstepping", "momentum" or "overrelaxation" with an adaptive step size: Adaptive Overrelaxed Bound Optimization Methods. .
Ruslan Salakhutdinov, Zoubin Ghahramani and I had a more theoetical paper, also at ICML03, which discusses the EM algorithm in particular, analyzes its convergence compared to direct conjugate-gradient optimizations and proposes a new algorithm for switching between the two strategies: Optimization with EM and Expectation-Conjugate-Gradient. .
At UAI03, we had a theoretical paper covering a broader class of algorithms which analyzed the general convergence behaviour of many bound optimization algorithms including EM. (After following this link, there is a note which mentions an important typo appearing in the official version of this paper.) On the Convergence of Bound Optimization Algorithms .

Alternate Objective Functions for Markovian Fields

In labelling or prediction tasks, a trained model's test performance is often based on the quality of its single-time marginal distributions over labels rather than its joint distribution over label sequences. We propose using a new cost function for discriminative learning that more accurately reflects such test time conditions. We present an efficient method to compute the gradient of this cost for Maximum Entropy Markov Models, Conditional Random Fields, and for an extension of these models involving hidden states. Our experimental results show that the new cost can give significant improvements and that it provides a novel and effective way of dealing with the 'label-bias' problem.
Our paper An Alternate Objective Function for Markovian Fields appeared in ICML2002, joint work with Yee Whye Teh and Sham Kakade.

Self-organized (constrained) hidden Markov models

Hidden variables are an essential feature of many statistical models of data. Adding topology to hidden variables can often lead to powerful algorithms that learn highly structured representations. For my thesis I developed a model closed related to hidden Markov models (HMMs) which adds topology to the hidden states. This is similar to the topology given to units in a Kohonen map.
A paper, Constrained Hidden Markov Models appeared in NIPS12.

Linear-Gaussian models

Many common statistical, engineering and machine learning models that employ linear estimators with Gaussian noise can be thought of in a common framework. I have written a review article with Zoubin Ghahramani which covers many such models and their relationships.
A unifying review of linear Gaussian models appeared in Neural Computation 11(2).

Principal component analysis

The eigenvectors and eigenvalues found by conventional principal component analysis (PCA) are also the maximum likelihood solutions to a simple probabilistic generative model closely related to factor analysis. This model can be learned with the EM algorithm which in some cases can be more efficient than other traditional ways of computing the principal components. It also gives a proper probabilistic interpretation is PCA.
My paper EM algorithms for PCA and SPCA in NIPS10 introduces this model and the associated algorithms.
Essentially the same model has been developed independently by Michael Tipping and Chris Bishop both at Microsoft Research, Cambridge.
My earlier technical report version contains some figures of an experiement using a very large natural image dataset which are not in the NIPS paper.
The MATLAB code for implementing this algorithm is also available from my code page.

Linear Heteroencoders

It is well known that if you train a linear network with a bottleneck as an autoencoder the hidden units extract the principal subspace of the data. (This is true even if the output units are nonlinear.) In other words, linear autoencoders with bottlenecks do PCA. But what if you train a linear heteroencoder (input-output) network to do supervised learning? With no bottleneck, these networks do linear regression. What happens with a bottleneck? It turns out that there is a simple, closed form expression for the "restricted rank regression" that such networks achieve and that it is closely related to a statistical concept called canonical correlations.
Carlos Brody and I have written a short tech report, titled Linear Heteroencoders gives a tutorial review of these ideas and a simple derivation of the form of the optimal restricted rank linear mapping from some inputs to outputs.

Computing with action potentials

In his 1995 Nature article, John Hopfield proposed an encoding of information in the timing of action potentials that allowed interesting computations to be performed using a simple coincidence detection mechanism. Hopfield, Carlos Brody and myself have explored the machine learning side of this idea and applied it to a simple digit recognition task.
Our paper, Computing with action potentials appeared in NIPS10.

Sam Roweis, Vision, Learning and Graphics Group, NYU, www.cs.nyu.edu/~roweis