RESEARCH PROJECTS
This is my research statement, and here you can find more details about the projects I have worked on:
Sparse Coding: algorithms that learn sparse representations of data
Learning Invariant Representations: algorithms that learn representations that are locally shiftinvariant
Learning Deep Hierarchies of Features: unsupervised algorithms have also been used to train deep networks, i.e. models composed of a sequence of nonlinear transformations
Object Recognition: deep networks producing sparse and shiftinvariant representations have been applied to generic object recognition and handwriting recognition
Image Denoising and Compression: example of how unsupervised algorithms producing sparse representations are able to learn image statistics, with applications to image denoising and compression.
Text Retrieval and Classification: semisupervised algorithms have been used to train a deep network producing representations of text documents too
EnergyBased Models: general framework used to describe both nonprobabilistic and probabilistic models, and powerful tool to devise computationally efficient algorithms
Autonomous Robot Navigation: yet another application of unsupervised algorithms to extract visual features that are used for navigation
Automatic Recognition of Biological Particles: an approach for detection and classification of biological particles in microscopic images
OVERVIEW
I am interested in designing general algorithms that can be applied to a variety of tasks while being computationally efficient to be used in realtime applications. Often labeled data is very expensive to produce, it is rare and noisy. In this scenario the paucity of labeled data prevents standard supervised learning algorithms to give good estimates of the parameters of the model making the system fail to generalize well to previously unseen test data. Most of my work has focused on the design of Unsupervised Learning algorithms. These algorithms aim at modeling the distribution of the input data discovering underlying hidden structures, without relying on “labels” specifying the ultimate task a user might be interested in. Unsupervised learning algorithms can leverage large amounts of unlabeled data to produce representations of the input that are more suitable to be used by subsequent supervised modules.
In particular, my research has focused on devising unsupervised algorithms that learn sparse representations, that is, representations whose components are zero most of the times. Such representations are useful because (1) they can be very highdimensional, and features are more likely to become linearly separable in a highdimensional space, (2) they are more robust to noise, (3) they often produce a partbased representation that is easier to interpret, and (4) they can be learned very efficiently.
The
long term goal of my research is to learn efficient representations that are
invariant
to
irrelevant transformations, and hierarchical.
To this end I have worked on Deep
Networks
These methods have been applied to a variety of datasets and tasks, such as: generic visual object recognition, handwriting recognition, image denoising and compression, text retrieval and classification, and autonomous robot navigation.
I have also been interested in developing the EnergyBased framework for both Supervised and Unsupervised learning. EnergyBased Models are akin to unnormalized factor graphs, and they are useful because they provide a single framework to describe both probabilistic models as well as more discriminative methods, such as maximum margin learning methods and neural networks. The theory of EnergyBased Models gives guidelines for building learning algorithms that are able to scale to large datasets, and to solve the task efficiently during inference.
Go to the top of the page.SPARSE CODING
Sparse coding algorithms learn representations of the input data that have only few components that are significantly nonzero, i.e. that are sparse. A sparse representation can be overcomplete, when its dimensionality is higher than the dimensionality of the input.
For instance, Gabor Wavelets produce a sparse representation of natural images. A sparse coding algorithm is not only able to produce a similar sparse representation, but also, it adapts the analysis and synthesis transformations to the data (for instance, the filter banks used in the decomposition).
Sparse representations are useful because (1) features might become more linearly separable in high dimensional spaces, (2) the sparsity constraint can be used to make training more efficient (see below the section about EnergyBased Models learning), (3) sparse overcomplete representations are more robust to noise, and (4) they are shown to be biologically plausible.
In our work we are interested to learn a nonlinear transformation between input and sparse representation. This is a feedforward mapping that produces a representation very efficiently (e.g. through a learned filter bank followed by a nonlinearity), allowing the system to be used in realtime applications. No iterative process is necessary to compute the representation after training, unlike other sparse coding algorithms proposed in the literature. Thanks to the nonlinearities we can also construct deep networks by stacking these adaptive transformations on the top of each other to produce hierarchical and more abstract representations.
In
order to assess the quality of the representation during training, we
require that the input can be reconstructed from the representation
as a way to make sure that most of the relevant information has been
preserved in the mapping. This feedback path mapping the
representation into input space is used during training only, and it
can be simply composed by a set of linear filters. Training the
parameters of this model (feedforward and feedback mapping) is
achieved by minimizing a loss function that can have terms like the
reconstruction error of the input, and a sparsity penalty on the
representation.
The figure on the right shows an example of the sparse features that are learned by applying the algorithm to natural images. Features are localized oriented edge detectors at different scales. Training these filters from random initial conditions requires a few thousands parameter updates,
as shown in the animation on the left.
PAPERS

Efficient Learning of Sparse Overcomplete
Representations with an EnergyBased Model (NIPS 2006)

Sparse Feature Learning for Deep Belief Networks (NIPS 2007)
LEARNING INVARIANT REPRESENTATIONS
A sparse coding algorithm can be modified to learn representations that are not only sparse, but also invariant to some transformation. Invariance can be learned by adding terms to the loss function that is minimized during training, exploiting the knowledge we have about the transformation, or exploiting general properties like temporal coherence in consecutive video frames or spatial stationarity in static images, for instance. Another possibility is to change the architecture of the system to produce the transformation we want. Here is an example.
Consider the problem of detecting the orientation of bars placed at random locations in a binary image. The feedforward path has a filter bank (that is learned) detecting bars in each possible orientation. By taking the largest value in the output map produced by each filter detector, we have a representation that is invariant to shifts. In the figure, the first filter is detecting vertical bars. No matter where a vertical bar is located, the largest value of the convolution of the input with this filter will be always the same (hence, the invariance to shifts). Since during training the filters are learned by reconstructing the input from the representation, we have to be able to recover the information about the bar location in order to generate good reconstructions. Hence, we store the position of the largest values in another set of variables, called the "transformation parameters". The feedback path will be able to reconstruct the input by using not only the sparse and shift invariant representation, but also, the transformation parameters that specify where the features where found in the image.
PAPERS

Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition (CVPR 2007)
LEARNING DEEP HIERARCHIES OF FEATURES
The unsupervised learning algorithms that have been previously described can be used as building block to train deep networks in a layerwise fashion. Once a layer has been trained, it used to produce the input to train the layer above. Since these algorithms are based on general principles like reconstruction to preserve information and sparsity to regularize the representation, they are “replicable”, i.e. they can be used to train each layer of a hierarchical model in sequence.
Using deep models has an important computational advantage. A deep model can reuse intermediate representations to learn more abstract representations that, hopefully, are more easily correlated to the underlying causes generating the data.
For instance, by applying a sparse coding algorithm to handwritten digits we are able to learn feature detectors that resemble the “strokes” that can be used to form the digits (in this figure we show a small selection of these filters). These detectors reveal lowlevel and fairly localized correlations in the input data.
By using the
representation produced by these detectors as input to another sparse
coding stage that has only 10 units in the representation, we are
actually able to learn the highly nonlinear mapping between input
pixels and labels in an unsupervised way. The figure below shows what
each unit in this second level representation is encoding by mapping
it in input space (through the feedback paths at the second stage
and first stage). The second stage detectors are able to put together
the “strokes” detected by the first stage, discovering
frequent “suspicious” patterns. Such a
highly nonlinear mapping could not be discovered by using only a
single layer of processing.
Go to the top of the page.
OBJECT RECOGNITION
Deep architectures trained in an unsupervised way to produce sparse and locally shift invariant representations have been applied to a variety of datasets. By using the representation as input to a linear supervised classifier, we have built a system for realtime object recognition.
We currently hold the performance record for recognizing the handwritten digits of the MNIST dataset both using the original training dataset (error rate equal to 0.6%) and the training dataset augmented with elastically distorted digits (error rate equal to 0.4%) (on the latter case, the performance is the same as the one achieved by Patrice Simard ICDAR 2003).
We
have also applied these techniques to recognize object categories in
the Caltech 101 dataset, that has 101 natural object categories and up to 30
training instances per class. We reported an average accuracy
equal to 54%; an example of the accuracy on a few categories is given
in the figure below.
IMAGE DENOISING AND COMPRESSION
Sparse coding algorithms can also be used to denoise images. Assuming we know a model for the corruption process (e.g. additive Gaussian noise), then we can train the forward path of the model to map noisy inputs to a sparse representation, and we can force the feedback path to reconstruct the original uncorrupted input from the sparse representation. After training, the model has learned to denoise images by first mapping the input into a sparse representation, and then, by mapping back the sparse representation into input space through the transformation given by the feedback path. Since a sparse representation concentrates the energy in a few coefficients, the noise addition to the input causes the representation to change very little in terms of the relative magnitude between active and approximately zero coefficients, yielding good results in denoising.
Our experiments reported stateoftheart performance in denoising of natural images corrupted by additive Gaussian noise for medium and high levels of noise. The figure below shows an example. The original image is on the left, the corrupted image is in the middle, and the denoised image in on the right.
We also did experiments in text document image compression. After learning sparse representations of document images, a page is compressed by encoding the representation and the location of patches that contain characters. Very promising results were obtained; an example of document image reconstruction is given below (compression ratio equal to 95, percentage of flipped pixels 1.35%).
TEXT RETRIEVAL AND CLASSIFICATION
We devised a SemiSupervised algorithm to learn text document representations, and we used these representations for classification and retrieval tasks. Nowadays the most popular document representation is the bagofword representation storing the number of times each word of the dictionary appears in a given document. This is conceivably suboptimal because it assumes that words are independent of each other, disregarding synonyms and polysemous words.
In this work, we have developed an algorithm to train a deep network that extracts representations of documents. Training proceeds by minimizing an objective function that combines both the Unsupervised principle of good reconstruction of the input from the representation, and also, the Supervised information (e.g. the topic) that might accompany the document. The figure on the left shows that a representation that has been learned by combining both objectives outperforms the reweighted input bagofword representation (tfidf) as well as the representation produced by the same algorithm, but trained totally unsupervised (20 Newsgroup dataset). Also, the performance does not degrade significantly if we compress the representation as we go from the first to the last layer. In this experiment the first layer produces a 200dimensional representation, while the top layer a representation with just 20 components. If the last layer of the deep network has only two components, then we can map the data into the plane for visualization. The figure on the right shows the learned mapping for the documents of the Ohsumed dataset (medical abstracts). The input 30,000dimensional space is mapped into the plane in such a way that documents belonging to the same class are clustered together, and similar topics are placed next to each other.
ENERGYBASED MODELS
The EnergyBased Model is a framework for devising efficient learning algorithms. In the supervised case an energy is associated to each pair of input sample and target value. Inference consists of minimizing the energy with respect to the target, and learning consists of defining a loss functional, which is a function of the energy, whose minimization yield the desired behavior, I.e. the energy is lower for the target that corresponds to the true label. Probabilistic models are an instance of EnergyBased Models and they correspond to models that are trained with a particular choice of loss functional, the negative of the loglikelihood, which is often intractable to compute. EnergyBased Models are useful in highdimensional spaces when the normalization requirement imposed by probabilistic models become computationally intractable. In some sense, EnergyBased Models can be interpreted as a “local probabilistic model”, because they care to assign the correct shape to the distribution of the data only around regions of interest, like those populated by training samples. There are conditions on the energy and the loss functional that guarantee learning and correct inference. Also, there are extensions to the case in which models have latent variables. The figure below shows the energy function of two models trained with different loss functionals. Despite the different shape of energy, the two models achieve the same goal, i.e. they give lower energy to the training samples (the blue dots).
In Unsupervised Learning labels are not provided. Therefore, the energy is a function of the input data, and maybe some other latent variable. The goal of learning is to assign lower energy to samples that are similar to training samples, and higher energy to the other points in input space (or at least to those points that are in a neighborhood of the training samples). This problem can be interpreted as an “energy carving” problem because learning consists of adjusting the parameters of the model to carve the energy surface in such a way that training samples are always on the valleys of the surface (i.e. local minima). In general, it is not sufficient to adjust the parameters in such a way that only the energy of the training samples is decreased because it might well be that the energy surface becomes flat, corresponding to a model that cannot discriminate between points that are similar and points that are dissimilar to the training samples.
Some unsupervised learning algorithms are able to learn good energy surfaces by using a loss functional that has two terms. The first one decreases the energy of the training samples, while the second one increases the energy of a properly chosen point in input space. However, it might be difficult and computationally expensive to individuate a good candidate for pulling up the energy, overall in highdimensional spaces. Other methods achieves the same goal of energy carving by replacing this explicit pullup term in the loss, with constraints in the representation, such as sparsity for instance. By making sure that the internal representation of the model can carry only a limited amount of information, the energy surface can take the right shape without using a contrastive term in the loss functional. This can yield to the design of far more efficient unsupervised learning algorithms.
For instance, the figure below shows the energy surface (in this case the energy is the reconstruction error) produced by different algorithms after training on a dataset of points sampled from a spiral (the points in magenta). We can see the energy surface produced by A) PCA, B) KMeans, C) 212 autoencoder neural net, D) 2202 autoencoder neural net trained by maximum likelihood, E) 2202 autoencoder neural net trained my maximum margin loss, and F) a 2202 network trained with a sparse coding algorithm. All these algorithms are able to avoid flat energy surfaces, and the apparently very different shapes of these energies can be understood under the principle described above of explicit or implicit energy pullup.
AUTONOMOUS ROBOT NAVIGATION
The above mentioned unsupervised learning algorithms have been used also to extract visual features for the DARPA project called LAGR (learning applied to ground robots). Features have been used to do long range stereo helping the robot navigate in unknown outdoor environments.
AUTOMATIC RECOGNITION OF BIOLOGICAL PARTICLES
While I was at Caltech, I worked on a system for automatic recognition of biological particles in microscopic images, such as pollen grains and cells in human specimen. The system is composed of (1) a detector, and (2) a classifier assigning a particles to its category. The figure shows the tool that was used to collect the pollen data. After acquiring an image from the microscope, an expert had to detect and label the pollen grains that were subsequently used to train the algorithm for automatic recognition. In this image, there are two Chinese elm pollen grains, the other particles are just debris.