This is the list of final project topics.
Projects can be done individually, in groups of 2, or (with
permission) in groups of 3.
You can propose your own project topic, subject to approval by the instructor.
You must send a .tar or .tgz file to the TA with your code and a PDF file
describing your project and the results you obtained.
Sparse Coding in Unusual Algebras |
Sparse coding is a feature extraction method that maps inputs X
to codes (feature vectors) Z by minimizing an energy function of
the type: E(X,Z,W) = || X - WZ ||^2 + alpha|Z|
where W is a so-called dictionary matrix (whose columns are normalized
to 1), and |Z| denotes the L1 norm of Z (the sum of the absolute
values of the components). The optimal code Z* for a given input X
is the one that minimizes E(X,Z,W). Learning the W matrix
can be performed using stochastic gradient descent
W <- W - eta.dE(X,Z*,W)/dW and renormalizing the colums of W
to 1 at every step.
The idea is to replace the additive combination of the basis functions
by a max operation or some other operation. In regular sparse coding,
the reconstruction is of the form W.Z = SUM_i Wi.Zi where Wi is a
column of W. In our "unusual" sparse coding, the reconstruction would
be: W.Z = MAX_i Wi.Zi or W.Z = 1/b.log[ SUM_i exp(b.Wi.Zi) ] where b
is some constant, or W.Z = [ SUM_i (Wi.Zi)^p ]^{1/p}, (the latter one
assumes positive X).or some other combination rule.
Implement one or more of these exotic sparse coding schemes, with the
learning procedure. Train it on 12x12 patches extracted from natural
images.
A number of papers on a similar topic (although in a
probabilistic/Bayesian framework) can be found
on Joerg Luecke's
page, particularly his NIPS 2010, NIPS 2009, and JMLR 2008
papers.
Parallelizing Stochastic Gradient |
Stochastic gradient is difficult to parallelize because each sample is
run sequentially (unlike batch training where the samples can be run
in parallel). The present project idea is to train multiple convnets
in parallel with stochastic gradient, and to synchronize their weight
parameters periodically, but rarely enough so that the communication
cost is minimal.
Lush provides an object serialization infrastructure that makes it
easy to run multiple Lush processes that talk to each other though
sockets.
One idea would be to implement a form of Augmented Lagrangian method
for parallelizing stochastic gradient on multiple machines.
More information can be obtained from Sixin Zhang (sixin.zhang AT gmail DOT com)
A similar project was carried out by David Eigen (deigen AT cs DOT nyu DOT edu)
in 2010. You can get his code and build on top of it. See if you can do better.
Details on augmented Lagrangian methods can be found from Stephen Boyd
paper
Distributed Optimization and Statistical Learning via the Alternating
Direction Method of Multipliers.
note: Stephen Boyd is currently on sabbatical at Courant.
Trainable Contrast Normalization |
Convolutional nets and multiplayer neural nets perform well
when the state of each layer has zero mean, and when the
state variables all have similar standard deviations.
The goal of this project is to implement a trainable subtractive
and divisive contrast normalization layer for convolutional networks.
(current such modules are set up by hand and not trained).
Subtractive contrast normalization is a trainable convolutional module
such that every variable in a set of feature maps (a 3D array) is
replaced by itself minus a (trainable) linear combination of its
neighbors, in such a way that the output is small (e.g. so that it
minimizes the L2 norm of the output state).
The divisive contrast normalization is identical, except that the
input variables are first mapped through a function that approximates
log(abs()). The function must be smoothly clipped so that it doesn't
blow up around zero. The outputs are mapped through the inverse
function. The sign must be treated separately.
Students should contact Ross Goroshin for a version of the code
which can be improved upon (rgoroshin AT gmail DOT com).
Implement in Lush or EBLearn the Contracting Auto-Encoder Model from
Yoshua Bengio and his group.
The basic idea of CAE is to train an auto-encoder with a
regularization term that shrinks the su of the square of the entries
of the Jacobian of the auto-encoder function.
Please use
the original
ICML 2011 paper as a reference.
The Lush code from the backprop homework can be used as a basis.
Multi-step Denoising Autoencoder |
The denoising auto-encoder from Pascal Vincent is an auto-encoder
trained to "denoise" corrupted input samples.
The idea is to implement a denoising auto-encoder that has a recurrent
hidden layer. The recurrent hidden layer can be unfolded a few times,
so as to implement backprop-through-time.
The Lush code from the backprop homework can be used as a basis.
Normally, one would train a learning machine to map
the inputs X1, X2,.... to their respective outputs Y1, Y2,...
In differential training, the regular square loss function
Lr(W,Xk,Yk) = || Yk - F(Xk,W) ||^2 (where F(Xk,W) is the
input-output function of the machine and W is its trainable
parameter) is augmented with an extra term that trains the
"slope" of the machine's function to be correct between two samples:
Ld(W,Xi,Yi,Xj,Yj) = || (Yi - Yj) - (F(Xi,W) - F(Xj,W) ||^2
This is particularly interesting when Yi=Yj:
Ld(W,Xi,Xj) = || (F(Xi,W) - F(Xj,W) ||^2
The overall loss is then:
L(W) = SUM_k Lr(W,Xk,Yk) + alpha SUM_ij Ld(W,Xi,Yi,Xj,Yj)
where alpha controls the importance of the differential term,
and where the SUM_ij is taken over a well chosen subset
of pairs of samples, for example only pairs of samples
of the same category, or only pairs of samples of the same
category that are nearest neighbors in input space.
The advantage of differential training is that it may accelerate
learning: the alpha constant could be made rather large (without
changing the learning rate), because the differential term only
involves the differences of gradients of samples that are similar.
The idea is somewhat reminiscent of
the "Tangent
Prop" method.
Performance could be tested on
the MNIST dataset
(see instructions at the bottom of this page).
The Lush code from the backprop homework can be used as a basis.
Joint Embedding of Inputs and Categories |
Training a classifier is complicated when the number of categories is
very large. One idea that recently became popular is to learn a joint
embedding of inputs and labels to a common embedding space.
In other word, given an input Xi and a label Yi, one trains two
machines Fx(Xi,Wx) and Fy(Yi,Wy) so that Fy(Yi,Wy) is closer to
Fx(Xi,Wx) than any other Fy(Yk,Wy) for other labels Yk.
A number
of papers from
Samy Bengio describe this idea and similar ones, particularly:
- G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
Large scale online learning of image similarity through ranking.
Journal of Machine Learning Research, JMLR, 11:1109-1135, 2010.
- S. Bengio, J. Weston, and D. Grangier.
Label embedding trees for large multi-class tasks. NIPS 2010
- J. Weston, S. Bengio, and N. Usunier. Large scale image
annotation: Learning to rank with joint word-image embeddings. In
Proceedings of the European Conference on Machine Learning and
Principles and Practice of Knowledge Discovery in Databases,
ECML-PKDD, 2010
- J. Weston, S. Bengio, and N. Usunier. Large scale image
annotation: Learning to rank with joint word-image embeddings.
Machine Learning Journal, 81(1):21-35, 2010
The code from the backprop homework can be used as a basis.
Regularization through Persistent Contrastive Divergence |
Complex learning machines, and neural net in particular, map training samples
to their desired targets, and also map lots of non-sensical inputs to perfectly
formed target. For example, if one trains a neural net on the MNIST digits,
many non-sensical combinations of input pixels will produce perfect output
configuration and will be recognized as one of the 10 digits with a high score.
This makes it difficult to use these systems to not only recognize categories, but
also to reject things that do not look like any of the categories.
One can easily generate such non-sensical inputs by performing gradient descent
in input space so as to produce a perfect output. For example, if one wants
to produce a garbage input that is recognized as category Y, one simply
starts from a random input X and performs the iteration:
X <- X - dC(X,Y)/dX where C(X,Y) = || F(X,W) - Y ||^2 and
F(X,W) is the input-output function of the machine for the
(fixed) parameters W.
The resulting X will be a "hallucinated" sample of class Y.
The idea is to maintain and update a number of "hallucinated" vectors
and to train the machine to produce a "junk" target vector (e.g. all
-1's if we use a place code for the categories) when shown these
hallucinations. One would maintain 10-100 hallucinated vectors, and
update them periodically so that they all produce a perfect target
vector for one categories. This would be performed as indicated above.
(e.g. each of the 10 categories will be assigned 1/10th of the
hallucination vectors). The weights of the machine would be updated so
that the output of all the hallucinated vectors is the "junk" category
vector. The coefficient on the gradient for this update would have
to be kept smaller than for the regular update, in case some of the
hallucinations converge to one of the training samples. To prevent
that from happening too often, some noise should be added to the
hallucination vectors.
The whole procedure could be seen as a form of "persistent contrastive
divergence", if the hallucinations were updated with the so-called
Hybrid Monte Carlo method. Papers on persistent CD and HMC are
available
on Geoff
Hinton's page. His "contrastive back-propagation" is essentially
identical to what is described above (without the persistent CD part).
- Hinton, G. E., Osindero, S., Welling, M. and Teh, Y.
Unsupervised Discovery of Non-linear Structure using Contrastive Backpropagation.
Cognitive Science, 30:4, pp 725-731, 2006.
- Tieleman, T. and Hinton, G. E. Using Fast Weights to Improve
Persistent Contrastive Divergence. Proc. 26th International
Conference on Machine Learning, 2009
-
Performance could be tested on
the MNIST dataset
(see instructions at the bottom of this page).
The code from the backprop homework can be used as a basis.
Hessian-Free Optimization |
Implement James Martens's "Hessian Free Optimization" algorithm for
recurrent neural nets. See
his ICML'10
and ICML'11 papers.
The code from the backprop homework can be used as a basis.
Learning the Non-Linearity in Neural Nets |
The idea is to parameterize the non-linearity in neural nets so that
it can be learned, just like the weights. It may be a good idea to
restrict ourselves to monotonically increasing functions. The question
would be to find a good parameterization of monotonically increasing
functions (e.g. using some polynomial family, perhaps
Chebychev
polynomials).
Performance could be tested on
the MNIST dataset
(see instructions at the bottom of this page).
The code from the backprop homework can be used as a basis.
Alternatively, the Lush script mentioned at the bottom of this
page can be used (it uses the gblearn2 library).
Train a ConvNet on
the STL-10
dataset.
The Lush script mentioned at the bottom of this page can be modified
for this purpose (it uses the Lush gblearn2 library). Alternatively,
EBLearn can be used.
Port Torch7's CUDA/SSE backend to Lush |
(for the hackers)
Torch 7 is a software
environment for machine learning with an efficient numerical back-end
in C, and a front-end interpreter based on
the Lua interpreted language.
One nice aspect of Torch7 is it's numerical backend, which is
built around an array structure similar to Lush's idx.
It has optimized numerical functions written in SSEx assembly
code, as well as some function written in CUDA to run on NVidia GPUs.
This project will be to port Torch7's CUDA/SSE backend to Lush.
Details can be obtained from Clement Farabet (clement.farabet AT gmail
DOT com) or Soumith Chintalah (sc3104 AT nyu DOT edu).
The MNIST training script provided at the bottom of this page
can be used for speed comparison.
(for the hackers)
Design a mechanism in Lush that will take a graph of interconnected
computing nodes, and will produce optimized and compilable Lush
functions for the fprop and bprop. We assume that the fprop and
bprop functions of each of the nodes is known.
The real hackers, have a look at
the Jeff Siskind
and Barak
Pearlmutter's publications and
code on the topic,
particularly their Stalingrad compiler and automatic differentiator
(not for the faint of heart).
Recurrent ConvNet for Image segmentation |
The idea is to train a convolutional network to perform image segmentation
(for example, to label the edges in an image). This will be tested on
the Berkeley image segmentation dataset.
Background information on using ConvNets for segmentation can be found
here.
The convnet architecture would be recurrent, in that multiple
successive layers would share the same convolution kernels.
The Lush code from the backprop homework can be used as a basis.
Belief Propagation for Music Composition |
The purpose of the project is to build a factor graph model
that implements the rules of simple counterpoint.
Given a few notes, the model can be used to "fill in the blanks"
by finding melodies that satisfy all the rules.
Each rule can be seen as a factor in a factor graph.
The best note sequence will be obtained with Belief Propagation
or some other efficient inference procedure.
Train the convolutional net implementated in Lush on snippet of images
extracted from scanned pages. The snippets should include neighboring
characters, so that the network output will be robust to "distracting"
characters on each side of the character to be recognized.
The purpose is to demonstrate a segmentation-free "brute force" OCR by
sweeping the convolutional network over an entire page of text.
Students can use the convolutional network implementations from the gblearn2
library in Lush (in lush/packages/gblearn2) or from the eblearn C++ library
Eblearn.
The Google
1000 dataset may be used for this:
There is
a Python
script to download the entire dataset.
Predicting Financial Data |
We have a dataset consisting of description vectors of various
companies, together with a variable that indicated whether the company
defaulted on their loans.
The project consists in predicting whether the company will default
using various methods, including neural nets, logistic regression,
SVM, and perhaps other methods (each project team should pick a good
subset).
The complication resides in the fact that some variables are missing,
hence a latent variable inference model should be used.
Contact the instructor to obtain the dataset.
Lp pooling for Convolutional Nets |
Convolutional networks have been trained with a number of different
pooling functions, such as average, average followed by tanh, and max.
This project concerns the implementation and test of other pooling
functions, such as (SUM_x X_i^p)^(1/p), also known as "Lp norm",
as well as 1/b log[ SUM_i exp(b*X_i) ], also known as log-sum pooling.
This will have to be implemented either within Lush, Torch7, or EBLearn,
and tested on the MNIST or CIfAR 10 dataset.
Students should contact Pierre Sermanet for the code and datasets
(sermanet AT cs DOT nyu DOT edu).
Implement Standard Learning Algorithms with Eblearn |
Eblearn is a C++ library that implements
classes and functionalities similar to Lush's gblearn2 package for gradient-based
and energy-based learning.
This series of projects consists in implementing a number of standard algorithms
and applications using eblearn. Suggested algorithms include Adaboost, PCA,
K-Means, Mixtures of Gaussians.
Students should contact Pierre Sermanet (pierre DOT sermanet AT gmail
DOT com), the principal maintainer of EBLearn, for information about what
module and functionaltiy should be implemented.
Text/Image Segmenter for DjVuLibre |
Implement a foreground/background segmenter for DjVuLibre.
DjVuLibre is the open source version of the DjVu system.
It currently lacks a good segmenter that can separate
the text and drawings from the backgrounds and continous-tone
images in a scanned document.
The project will consist in building a foreground/background segmenter
by using various methods, such as color clustering, spectral
clustering, and others.
Learning with Conjugate Gradient in Eblearn |
Implement and test Polak-Ribiere and Fletcher-Reeves Conjugate
Gradient optimizations within
the Eblearn C++ library.
Compare it with stochastic gradient on regression problems.
This comes down to implementing a new subclass of the "parameter"
class in which the "update" method will be redefined.
Optional: implement quasi-Newton methods such as limited-memory BFGS.
References:
- Y. LeCun, L. Bottou, G. Orr, and K. Muller,
"Efficient BackProp,"
in Neural Networks: Tricks of the trade, 1994.
- Martin F. Møller,
A Scaled
Conjugate Gradient Algorithm for Fast Supervised Learning, 1993.
- J. Nocedal, D. C. Liu,
"On the Limited Memory Method for Large Scale Optimization ,"
Mathematical Programming B, 45, 3, pp. 503-528, 1989.
Getting the MNIST dataset |
Some projects require you to download the MNIST dataset. Instructions are available below.
The MNIST dataset is composed of 60,000 training samples and 10,000
test samples of handwritten digits. Details about MNIST are available
here.
An additional set of digits produced by randomly distorting the
original MNIST images is available (see below).
To obtain the data:
BE CAREFUL: your browser may gunzip the files without telling you.
Each of those files is a Lush matrix that can be loaded into lush
with (load-matrix "thefile"), or more efficiently, it can be
memory-mapped in RAM with (mmap-matrix "thefile").
A Lush script is available to train a convolutional network
on MNIST. The script can be obtained
here. The convolutional net has 20 feature maps at the first
and second layers, and 50 at the third and fourth.
the script should give you 0.7% to 0.8% error on the test set after
training on the original training set (no distorted samples).
p
the net is quite a bit bigger than the one described in the
paper "Gradient-Based Learning Applied to Document Recognition".
|