
 Time Period: 2008  present.
 Participants: Karol Gregor, Koray Kavukcuoglu, Arthur Szlam, Rob Fergus, Yann LeCun (Courant Institute/CBLL).
 Sponsors: ONR, NSF.
 Description: We are developing unsupervised (and semisupervised) learning algorithms to learn
visual features. The methods are based on the idea that good feature representation should be high dimensional
(so as to facilitate the separability of the categories), should contain enough information to reconstruct the
input near regions of high data density, and should not contain enough information to reconstruct inputs in
regions of low data density. This lead us to use sparsity criteria on the feature vectors. Many of the sparse coding
methods we have developed include a feedforward predictor (a socalled encoder) that can quickly produce an
approximation of the optimal sparse representation of the input. This allows us to use the learned feature
extractor in realtime object recognition systems. Variant of sparse coding are proposed, including one
that uses group sparsity to produce locally invariant features, two methods that separate the "what" from
the "where" using temporal constancy criteria, and two methods for convolutional sparse coding, where the
dictionary elements are convolution kernels applied to images.
Examples of filters learned from natural image patches using the Predictive Sparse Decomposition algorithm (PSD).
The objective of sparse coding is to reconstruct an input
vector (e.g. an image patch) as a linear combination of
a small number of vectors picked from a large dictionnary.
This is common known as basis pursuit.
In the most common form, the feature vector Z that represents an input
vector Y (the optimal code for Y), is the one that minimizes the
following L2L1 energy function:
The process can be described by a factor graph:
where red rectangles represent factors, circles represent
variables, and blue modules represent deterministic functional
relationships.
As shown by Olshausen and Field in 1997, the basis functions can be
learned by minimizing the free energy, averaged over a training set of
input samples with respect to the dictionary matrix Wd whose columns
are the dictionary elements. The norm of the dictionary elements is
constrained to be less than 1. The minimization can be performed
through stochastic gradient descent.
Predictive Sparse Decomposition (PSD) 
One problem with traditional sparse coding is that inference is
somewhat slow. Give an input vector, finding the corresponding code
vector requires an L2/L1 optimization. Having to do this for every
patch in an image would preclude the use of sparse coding for
highspeed image recognition.
Predictive Sparse Decomposition (PSD) alleviates this problem by
training a simple, feedforward encoder module to predict an
approximation to the optimal sparse code. The factor graph is shown
below.
This could be seen as a kind of autoencoder. The energy
function has an additional term, which measure the discrepancy
between the predicted code and actual code. The encoder can take
several function forms. In the simplest instance, it is simply
a linear transform, followed by a sigmoid nonlinearity, and
a trainable diagonal gain matrix.
The animation below shows the filters being learned by the PSD
learning procedure, when trained on 12x12 pixel patches from
natural images.
Learned Iterative Shrinkage and Thresholding Algorithm (LISTA) 
The simple encoder described above generally doesn't predict sparse
codes. In this work (published at ICML 2010), we propose to build the
encoder architecture so as to follow that of the socalled ISTA or
FISTA algorithm (Fast Iterative Shrinkage and Thresholding Algorithm),
which has been shown to converge to the optimal sparse code
solution. We build the encoder as a feed forward network with the same
architecture as an timeunrollled version version of the ISTA/FISTA
algorithm (trucated to a few iterations). The block diagram of
ISTA/FISTA and the corresponding encoder is shown below. With ISTA and
FISTA, the two matrices We and S have prescribed values. We is the
transpose of the dicionary matrix, and S is the identity minus some
constant times the gram matrix of the dictionary elements. In our method,
called LISTA (Learned ISTA), the matrices are trained so as to produce
the best approximation of the optimal sparse code with the given
number of iterations.
The approximations are shown experimentally to be considerably better
than simply truncating the FISTA algorithm to the same number of
iterations. A learned version of the Coordinate Descent algorithm
(which is faster than FISTA) is also presented.
A problem with sparse coding, when applied to images, is that it is
trained on individual patches, yet it is often used convolutionally
(applied to sliding windows over the whole image). Because the
training is done on indivudual patches, the dictionary elements must
cover the entire patch. Hence we the same Gabor filter replicated and
multiple shifted locations over the patch. This is very wasteful when
features are to be extracted over entire images using a sliding window
approach: the code ends up being extremely redundant.
Convolutional Sparse Coding consists in viewing the dictionary
elements as convolution kernels applied to a large image region,
as shown in the diagram below
With this method, the dictionary elements are considerably
less redundant. Since there are shifted versions of each
dictionary element because of the convolution operation, the
algorithm doesn't need to learn multiple shifted replicas
of each filter to cover the patches. This results in considerably
more diverse filters, such as centersurround, cross and corner
detectors, Gabors at multiple frequencies, etc....
Our colleagues Zeiler, Taylor, Krishnan, and Fergus described
such a method, called deconvolutional networks in a
CVPR 2010 publication. Our method is a convolutional version
of PSD, which includes a fast encoder. The fast encoder uses
the softshrinkage function as its nonlinearity.
Sparse features learned with Convolutional PSD yield much better results
on Caltech101 than sparse features learned with conventional
sparse coding (or PSD) trained at the patch level:
Convolutional Matching Pursuit 
Matching Pursuit is a "greedy" form of sparse coding.
We have proposed a convolutional form of Matching Pursuit
that produces extremely parsimonious representations
of images.
The picture below (left) shows examples of filters learned with
Convolutional MP. On the right, we see an image of a face
(preprocessed with highpass filtering and contrast normalization),
together with its reconstruction with Convolutional MP using only 40
nonzero coefficients.
Learning Invariant Features with Group Sparsity 
Group sparsity consists in grouping components of the sparse code into
groups. The sparsity criterion applies to groups. It tends to minimize
the number of groups that contain nonzero code components, but if one
group is active, it doesn't restrict the number of nonzero components
within the group. This is acheived using an L2 norm of the code components
in each group as the sparsity criterion.
During learning, the group sparsity criterion tends to produce filters
within a group that fire together. Hence, filters within a group tend
to be very similar (e.g. Gabor edges at similar orientations).
In the chart below, the code components were arranged in a 2D toroidal
topology, and the groups were overlaping sqquares (as shown). The
filters arrange themselves in a smoothlyvarying topographic map.
Learning LocallyConnected Filters with Group Sparsity 
Sparse coding was originally proposed as a model of simple cells in
the visual cortex. In the present project, the filters are arranged
over an input image. EAch filter has a local receptive field on the
image. A layer of complexcelllike pooling combines the outputs of
local groups of units. A sparsity criterion is applied to these
groups. The result is a topographic map with pinwheel patterns
similar to those observed in the visual cortex.
arXiv 
Arthur Szlam, Koray Kavukcuoglu, Yann LeCun:
Convolutional Matching Pursuit and Dictionary Training, October 2010.
 arXiv:1010.0422 
arXiv 
Karol Gregor, Yann LeCun:
Emergence of ComplexLike Cells in a Temporal Product Network with Local Receptive Fields, June 2010.
 arXiv:1006.0448 
148. 
Karol Gregor and Yann LeCun: Learning Fast Approximations of Sparse Coding, Proc. International Conference on Machine learning (ICML'10), 2010, \cite{gregoricml10}.

178KB  DjVu 
226KB  PDF 
255KB  PS.GZ 
146. 
YLan Boureau, Francis Bach, Yann LeCun and Jean Ponce: Learning MidLevel Features for Recognition, Proc. International Conference on Computer Vision and Pattern Recognition (CVPR'10), IEEE, 2010, \cite{boureaucvpr10}.

178KB  DjVu 
225KB  PDF 
229KB  PS.GZ 
144. 
Yann LeCun, Koray Kavukvuoglu and Clément Farabet: Convolutional Networks and Applications in Vision, Proc. International Symposium on Circuits and Systems (ISCAS'10), IEEE, 2010, \cite{lecuniscas10}.

189KB  DjVu 
266KB  PDF 
339KB  PS.GZ 
140. 
Kevin Jarrett, Koray Kavukcuoglu, Marc'Aurelio Ranzato and Yann LeCun: What is the Best MultiStage Architecture for Object Recognition?, Proc. International Conference on Computer Vision (ICCV'09), IEEE, 2009, \cite{jarretticcv09}.

303KB  DjVu 
769KB  PDF 
834KB  PS.GZ 
137. 
Koray Kavukcuoglu, Marc'Aurelio Ranzato, Rob Fergus and Yann LeCun: Learning Invariant Features through Topographic Filter Maps, Proc. International Conference on Computer Vision and Pattern Recognition (CVPR'09), IEEE, 2009, \cite{koraycvpr09}.

334KB  DjVu 
864KB  PDF 
1474KB  PS.GZ 
9. 
Koray Kavukcuoglu, Marc'Aurelio Ranzato and Yann LeCun: Fast Inference in Sparse Coding Algorithms with Applications to Object Recognition, Computational and Biological Learning Lab, Courant Institute, NYU, Tech Report CBLLTR20081201, 2008, \cite{koraypsd08}.

99KB  DjVu 
343KB  PDF 
384KB  PS.GZ 
128. 
Marc'Aurelio Ranzato, YLan Boureau and Yann LeCun: Sparse feature learning for deep belief networks, Advances in Neural Information Processing Systems (NIPS 2007), 2007, \cite{ranzatonips07}.

129KB  DjVu 
174KB  PDF 
212KB  PS.GZ 
127. 
Marc'Aurelio Ranzato and Yann LeCun: A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images, Proc. International Conference on Document Analysis and Recognition (ICDAR), 2007, \cite{ranzatoicdar07}.

139KB  DjVu 
205KB  PDF 
228KB  PS.GZ 
125. 
Yann LeCun, Sumit Chopra, Marc'Aurelio Ranzato and FuJie Huang: EnergyBased Models in Document Recognition and Computer Vision, Proc. International Conference on Document Analysis and Recognition (ICDAR), (keynote address), 2007, \cite{lecunicdarkeynote07}.

110KB  DjVu 
355KB  PDF 
551KB  PS.GZ 
119. 
Marc'Aurelio Ranzato, FuJie Huang, YLan Boureau and Yann LeCun: Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition, Proc. Computer Vision and Pattern Recognition Conference (CVPR'07), IEEE Press, 2007, \cite{ranzatocvpr07}.

187KB  DjVu 
331KB  PDF 
343KB  PS.GZ 
118. 
Marc'Aurelio Ranzato, YLan Boureau, Sumit Chopra and Yann LeCun: A Unified EnergyBased Framework for Unsupervised Learning, Proc. Conference on AI and Statistics (AIStats), 2007, \cite{ranzatounsup07}.

257KB  DjVu 
240KB  PDF 
320KB  PS.GZ 
115. 
Marc'Aurelio Ranzato, Christopher Poultney, Sumit Chopra and Yann LeCun: Efficient Learning of Sparse Representations with an EnergyBased Model, in J. Platt et al. (Eds), Advances in Neural Information Processing Systems (NIPS 2006), MIT Press, 2006, \cite{ranzato06}.

152KB  DjVu 
191KB  PDF 
204KB  PS.GZ 

