CBLL HOME
VLG Group
News/Events
Seminars
People
Research
Publications
Talks
Demos
Datasets
Software
Courses
Links
Group Meetings
Join CBLL
Y. LeCun's website
CS at Courant
Courant Institute
NYU
Lush
Lush

Machine Learning and Pattern Recognition: Final Projects


[ Course Homepage | Schedule and Course Material | Mailing List ]

This is the list of final project topics.

Projects can be done individually, in groups of 2, or (with permission) in groups of 3.

You can propose your own project topic, subject to approval by the instructor.

You must send a .tar or .tgz file to the TA with your code and a PDF file describing your project and the results you obtained.

Sparse Coding in Unusual Algebras

Sparse coding is a feature extraction method that maps inputs X to codes (feature vectors) Z by minimizing an energy function of the type: E(X,Z,W) = || X - WZ ||^2 + alpha|Z| where W is a so-called dictionary matrix (whose columns are normalized to 1), and |Z| denotes the L1 norm of Z (the sum of the absolute values of the components). The optimal code Z* for a given input X is the one that minimizes E(X,Z,W). Learning the W matrix can be performed using stochastic gradient descent W <- W - eta.dE(X,Z*,W)/dW and renormalizing the colums of W to 1 at every step.

The idea is to replace the additive combination of the basis functions by a max operation or some other operation. In regular sparse coding, the reconstruction is of the form W.Z = SUM_i Wi.Zi where Wi is a column of W. In our "unusual" sparse coding, the reconstruction would be: W.Z = MAX_i Wi.Zi or W.Z = 1/b.log[ SUM_i exp(b.Wi.Zi) ] where b is some constant, or W.Z = [ SUM_i (Wi.Zi)^p ]^{1/p}, (the latter one assumes positive X).or some other combination rule.

Implement one or more of these exotic sparse coding schemes, with the learning procedure. Train it on 12x12 patches extracted from natural images.

A number of papers on a similar topic (although in a probabilistic/Bayesian framework) can be found on Joerg Luecke's page, particularly his NIPS 2010, NIPS 2009, and JMLR 2008 papers.

Parallelizing Stochastic Gradient

Stochastic gradient is difficult to parallelize because each sample is run sequentially (unlike batch training where the samples can be run in parallel). The present project idea is to train multiple convnets in parallel with stochastic gradient, and to synchronize their weight parameters periodically, but rarely enough so that the communication cost is minimal.

Lush provides an object serialization infrastructure that makes it easy to run multiple Lush processes that talk to each other though sockets.

One idea would be to implement a form of Augmented Lagrangian method for parallelizing stochastic gradient on multiple machines.

More information can be obtained from Sixin Zhang (sixin.zhang AT gmail DOT com)

A similar project was carried out by David Eigen (deigen AT cs DOT nyu DOT edu) in 2010. You can get his code and build on top of it. See if you can do better.

Details on augmented Lagrangian methods can be found from Stephen Boyd paper Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers.

note: Stephen Boyd is currently on sabbatical at Courant.

Trainable Contrast Normalization

Convolutional nets and multiplayer neural nets perform well when the state of each layer has zero mean, and when the state variables all have similar standard deviations.

The goal of this project is to implement a trainable subtractive and divisive contrast normalization layer for convolutional networks. (current such modules are set up by hand and not trained).

Subtractive contrast normalization is a trainable convolutional module such that every variable in a set of feature maps (a 3D array) is replaced by itself minus a (trainable) linear combination of its neighbors, in such a way that the output is small (e.g. so that it minimizes the L2 norm of the output state).

The divisive contrast normalization is identical, except that the input variables are first mapped through a function that approximates log(abs()). The function must be smoothly clipped so that it doesn't blow up around zero. The outputs are mapped through the inverse function. The sign must be treated separately.

Students should contact Ross Goroshin for a version of the code which can be improved upon (rgoroshin AT gmail DOT com).

Contracting Auto-Encoder

Implement in Lush or EBLearn the Contracting Auto-Encoder Model from Yoshua Bengio and his group.

The basic idea of CAE is to train an auto-encoder with a regularization term that shrinks the su of the square of the entries of the Jacobian of the auto-encoder function.

Please use the original ICML 2011 paper as a reference.

The Lush code from the backprop homework can be used as a basis.

Multi-step Denoising Autoencoder

The denoising auto-encoder from Pascal Vincent is an auto-encoder trained to "denoise" corrupted input samples.

The idea is to implement a denoising auto-encoder that has a recurrent hidden layer. The recurrent hidden layer can be unfolded a few times, so as to implement backprop-through-time.

The Lush code from the backprop homework can be used as a basis.

Differential Training

Normally, one would train a learning machine to map the inputs X1, X2,.... to their respective outputs Y1, Y2,...

In differential training, the regular square loss function Lr(W,Xk,Yk) = || Yk - F(Xk,W) ||^2 (where F(Xk,W) is the input-output function of the machine and W is its trainable parameter) is augmented with an extra term that trains the "slope" of the machine's function to be correct between two samples: Ld(W,Xi,Yi,Xj,Yj) = || (Yi - Yj) - (F(Xi,W) - F(Xj,W) ||^2 This is particularly interesting when Yi=Yj: Ld(W,Xi,Xj) = || (F(Xi,W) - F(Xj,W) ||^2 The overall loss is then: L(W) = SUM_k Lr(W,Xk,Yk) + alpha SUM_ij Ld(W,Xi,Yi,Xj,Yj) where alpha controls the importance of the differential term, and where the SUM_ij is taken over a well chosen subset of pairs of samples, for example only pairs of samples of the same category, or only pairs of samples of the same category that are nearest neighbors in input space.

The advantage of differential training is that it may accelerate learning: the alpha constant could be made rather large (without changing the learning rate), because the differential term only involves the differences of gradients of samples that are similar.

The idea is somewhat reminiscent of the "Tangent Prop" method.

Performance could be tested on the MNIST dataset (see instructions at the bottom of this page).

The Lush code from the backprop homework can be used as a basis.

Joint Embedding of Inputs and Categories

Training a classifier is complicated when the number of categories is very large. One idea that recently became popular is to learn a joint embedding of inputs and labels to a common embedding space. In other word, given an input Xi and a label Yi, one trains two machines Fx(Xi,Wx) and Fy(Yi,Wy) so that Fy(Yi,Wy) is closer to Fx(Xi,Wx) than any other Fy(Yk,Wy) for other labels Yk.

A number of papers from Samy Bengio describe this idea and similar ones, particularly:

  • G. Chechik, V. Sharma, U. Shalit, and S. Bengio. Large scale online learning of image similarity through ranking. Journal of Machine Learning Research, JMLR, 11:1109-1135, 2010.
  • S. Bengio, J. Weston, and D. Grangier. Label embedding trees for large multi-class tasks. NIPS 2010
  • J. Weston, S. Bengio, and N. Usunier. Large scale image annotation: Learning to rank with joint word-image embeddings. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML-PKDD, 2010
  • J. Weston, S. Bengio, and N. Usunier. Large scale image annotation: Learning to rank with joint word-image embeddings. Machine Learning Journal, 81(1):21-35, 2010

The code from the backprop homework can be used as a basis.

Regularization through Persistent Contrastive Divergence

Complex learning machines, and neural net in particular, map training samples to their desired targets, and also map lots of non-sensical inputs to perfectly formed target. For example, if one trains a neural net on the MNIST digits, many non-sensical combinations of input pixels will produce perfect output configuration and will be recognized as one of the 10 digits with a high score. This makes it difficult to use these systems to not only recognize categories, but also to reject things that do not look like any of the categories.

One can easily generate such non-sensical inputs by performing gradient descent in input space so as to produce a perfect output. For example, if one wants to produce a garbage input that is recognized as category Y, one simply starts from a random input X and performs the iteration: X <- X - dC(X,Y)/dX where C(X,Y) = || F(X,W) - Y ||^2 and F(X,W) is the input-output function of the machine for the (fixed) parameters W. The resulting X will be a "hallucinated" sample of class Y.

The idea is to maintain and update a number of "hallucinated" vectors and to train the machine to produce a "junk" target vector (e.g. all -1's if we use a place code for the categories) when shown these hallucinations. One would maintain 10-100 hallucinated vectors, and update them periodically so that they all produce a perfect target vector for one categories. This would be performed as indicated above. (e.g. each of the 10 categories will be assigned 1/10th of the hallucination vectors). The weights of the machine would be updated so that the output of all the hallucinated vectors is the "junk" category vector. The coefficient on the gradient for this update would have to be kept smaller than for the regular update, in case some of the hallucinations converge to one of the training samples. To prevent that from happening too often, some noise should be added to the hallucination vectors.

The whole procedure could be seen as a form of "persistent contrastive divergence", if the hallucinations were updated with the so-called Hybrid Monte Carlo method. Papers on persistent CD and HMC are available on Geoff Hinton's page. His "contrastive back-propagation" is essentially identical to what is described above (without the persistent CD part).

  • Hinton, G. E., Osindero, S., Welling, M. and Teh, Y. Unsupervised Discovery of Non-linear Structure using Contrastive Backpropagation. Cognitive Science, 30:4, pp 725-731, 2006.
  • Tieleman, T. and Hinton, G. E. Using Fast Weights to Improve Persistent Contrastive Divergence. Proc. 26th International Conference on Machine Learning, 2009

Performance could be tested on the MNIST dataset (see instructions at the bottom of this page).

The code from the backprop homework can be used as a basis.

Hessian-Free Optimization

Implement James Martens's "Hessian Free Optimization" algorithm for recurrent neural nets. See his ICML'10 and ICML'11 papers.

The code from the backprop homework can be used as a basis.

Learning the Non-Linearity in Neural Nets

The idea is to parameterize the non-linearity in neural nets so that it can be learned, just like the weights. It may be a good idea to restrict ourselves to monotonically increasing functions. The question would be to find a good parameterization of monotonically increasing functions (e.g. using some polynomial family, perhaps Chebychev polynomials).

Performance could be tested on the MNIST dataset (see instructions at the bottom of this page).

The code from the backprop homework can be used as a basis. Alternatively, the Lush script mentioned at the bottom of this page can be used (it uses the gblearn2 library).

ConvNet on STL-10

Train a ConvNet on the STL-10 dataset.

The Lush script mentioned at the bottom of this page can be modified for this purpose (it uses the Lush gblearn2 library). Alternatively, EBLearn can be used.

Port Torch7's CUDA/SSE backend to Lush

(for the hackers)

Torch 7 is a software environment for machine learning with an efficient numerical back-end in C, and a front-end interpreter based on the Lua interpreted language.

One nice aspect of Torch7 is it's numerical backend, which is built around an array structure similar to Lush's idx. It has optimized numerical functions written in SSEx assembly code, as well as some function written in CUDA to run on NVidia GPUs.

This project will be to port Torch7's CUDA/SSE backend to Lush.

Details can be obtained from Clement Farabet (clement.farabet AT gmail DOT com) or Soumith Chintalah (sc3104 AT nyu DOT edu).

The MNIST training script provided at the bottom of this page can be used for speed comparison.

Flow graph compiler

(for the hackers)

Design a mechanism in Lush that will take a graph of interconnected computing nodes, and will produce optimized and compilable Lush functions for the fprop and bprop. We assume that the fprop and bprop functions of each of the nodes is known.

The real hackers, have a look at the Jeff Siskind and Barak Pearlmutter's publications and code on the topic, particularly their Stalingrad compiler and automatic differentiator (not for the faint of heart).

Recurrent ConvNet for Image segmentation

The idea is to train a convolutional network to perform image segmentation (for example, to label the edges in an image). This will be tested on the Berkeley image segmentation dataset.

Background information on using ConvNets for segmentation can be found here.

The convnet architecture would be recurrent, in that multiple successive layers would share the same convolution kernels.

The Lush code from the backprop homework can be used as a basis.

Belief Propagation for Music Composition

The purpose of the project is to build a factor graph model that implements the rules of simple counterpoint.

Given a few notes, the model can be used to "fill in the blanks" by finding melodies that satisfy all the rules.

Each rule can be seen as a factor in a factor graph.

The best note sequence will be obtained with Belief Propagation or some other efficient inference procedure.

Train a Full-Page OCR

Train the convolutional net implementated in Lush on snippet of images extracted from scanned pages. The snippets should include neighboring characters, so that the network output will be robust to "distracting" characters on each side of the character to be recognized.

The purpose is to demonstrate a segmentation-free "brute force" OCR by sweeping the convolutional network over an entire page of text.

Students can use the convolutional network implementations from the gblearn2 library in Lush (in lush/packages/gblearn2) or from the eblearn C++ library Eblearn.

The Google 1000 dataset may be used for this:

There is a Python script to download the entire dataset.

Predicting Financial Data

We have a dataset consisting of description vectors of various companies, together with a variable that indicated whether the company defaulted on their loans.

The project consists in predicting whether the company will default using various methods, including neural nets, logistic regression, SVM, and perhaps other methods (each project team should pick a good subset).

The complication resides in the fact that some variables are missing, hence a latent variable inference model should be used.

Contact the instructor to obtain the dataset.

Lp pooling for Convolutional Nets

Convolutional networks have been trained with a number of different pooling functions, such as average, average followed by tanh, and max.

This project concerns the implementation and test of other pooling functions, such as (SUM_x X_i^p)^(1/p), also known as "Lp norm", as well as 1/b log[ SUM_i exp(b*X_i) ], also known as log-sum pooling.

This will have to be implemented either within Lush, Torch7, or EBLearn, and tested on the MNIST or CIfAR 10 dataset.

Students should contact Pierre Sermanet for the code and datasets (sermanet AT cs DOT nyu DOT edu).

Implement Standard Learning Algorithms with Eblearn

Eblearn is a C++ library that implements classes and functionalities similar to Lush's gblearn2 package for gradient-based and energy-based learning.

This series of projects consists in implementing a number of standard algorithms and applications using eblearn. Suggested algorithms include Adaboost, PCA, K-Means, Mixtures of Gaussians.

Students should contact Pierre Sermanet (pierre DOT sermanet AT gmail DOT com), the principal maintainer of EBLearn, for information about what module and functionaltiy should be implemented.

Text/Image Segmenter for DjVuLibre

Implement a foreground/background segmenter for DjVuLibre. DjVuLibre is the open source version of the DjVu system. It currently lacks a good segmenter that can separate the text and drawings from the backgrounds and continous-tone images in a scanned document.

The project will consist in building a foreground/background segmenter by using various methods, such as color clustering, spectral clustering, and others.

Learning with Conjugate Gradient in Eblearn

Implement and test Polak-Ribiere and Fletcher-Reeves Conjugate Gradient optimizations within the Eblearn C++ library.

Compare it with stochastic gradient on regression problems.

This comes down to implementing a new subclass of the "parameter" class in which the "update" method will be redefined.

Optional: implement quasi-Newton methods such as limited-memory BFGS.

References:

Getting the MNIST dataset

Some projects require you to download the MNIST dataset. Instructions are available below. The MNIST dataset is composed of 60,000 training samples and 10,000 test samples of handwritten digits. Details about MNIST are available here.

An additional set of digits produced by randomly distorting the original MNIST images is available (see below).

To obtain the data:

  • Go to http://www.cs.nyu.edu/~yann/datasets/mnist-distort/.
  • Download and gunzip the training images: train-images-idx3-ubyte.gz, and the labels for those: train-labels-idx1-ubyte.gz.
  • Download and gunzip the test images: t10k-images-idx3-ubyte-0.gz, and the labels for them: t10k-labels-idx1-ubyte-0.gz
  • If you have an account on dept.cs.nyu.edu, you can do:
     rsync -av dept.cs.nyu.edu:/home/yann/www.cs.nyu.edu/html/datasets/mnist-distort.
     
    and type your password when prompted. This will download the entire set (152MB, distorted digits included).
BE CAREFUL: your browser may gunzip the files without telling you.

Each of those files is a Lush matrix that can be loaded into lush with (load-matrix "thefile"), or more efficiently, it can be memory-mapped in RAM with (mmap-matrix "thefile").

A Lush script is available to train a convolutional network on MNIST. The script can be obtained here. The convolutional net has 20 feature maps at the first and second layers, and 50 at the third and fourth. the script should give you 0.7% to 0.8% error on the test set after training on the original training set (no distorted samples). p

the net is quite a bit bigger than the one described in the paper "Gradient-Based Learning Applied to Document Recognition".

.