Vision Meets Machine Learning

Davi Geiger

Graduate Division

Computer Science

This is a hands-on course that goes over some basic aspects of computer vision and the use of machine learning to better solve them. Students will develop their own algorithms, based on already existing and public software, and the aim is to reach or surpass state-of-the-art algorithms. At the same time, we will read together basic papers on the topics, i.e., classes will go over these basic papers. The topics that will be covered are recognition, classification, pose, segmentation, motion, action recognition. Basic knowledge of linear algebra and programming skills are required as well as willingness to program in GPU (Students will have access to GPU).

TAs: Salil Kapur, e-mail: and other to be confirmed

Office Hours: Wednesday 4:00 pm to 5:30 pm, at 4th floor, WWH building

Prof. Geiger Office Hours and Location: Tuesday 1:00 pm to 2:00 pm, at 4th floor, WWH building, room 407

Course Structure: theory and laboratory work. Groups of three students will work together in developing vision applications. Python and Tensor Flow will be the Programing Language/Environment. Worth while to know the computer vision library opencv

Projects will be developped in class and as homeworks. Evaluation is based on participation and project results.


1. Object Classification

A. Learning representations by back-propagating of errors. Rumelhart, Hinton, Williams, Nature 1986.

  BackPropagation Neural Networks

Relevance: it is the original back propagation paper for multilayer neural networks (there are previous versions).

B.  ImageNet Classification with Deep Convolutional Neural Networks . Krizhevsky, Sutskever, and Hinton. Nips 2012.

  Alex Net

Relevance, it is the original deep CNNs in  computer vision. Bring some concepts, such as drop out, that are not currently being used, but are interesting and may have future impact. 

C. Deep Residual Learning for Image Recognition. Kaiming He et. al. Dec 2015.

  Residual Network

Relevance: The ResNet model is the best and simplest  CNN architecture today, though worth to check

D. Deep Residual Learning for Image Recognition. Kaiming He et. al. Dec 2015.

  VGG Network

Relevance: Even though it does not have as good performance as ResNet, it has a multiscale representation that is needed in vision and it has been modified for other applications as we will need in our course. So we will follow it.

2. Object Classification and Localization

SSD: Single Shot MultiBox Detector. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg
Relevance: it works well and it is fast. Is uses VGG and modify VGG. Note that there is a follow up work with small improvement in performance, replacing image res net instead of VGG, but not publically available: DSSD: Deconvolution Single Shot Detector. Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, Alexander C. Berg. DSSD

3. Image Segmentation

Graph Cut algorithm (Unsupervised Technique)  Graph Cuts—Combinatorial Optimization in Vision (Chapter 2). Hiroshi Ishikawa.  In Image Processing and Analysis with Graphs: Theory and Practice, Edited by Olivier Lézoray and Leo Grady, CRC Press, July 2012. 
  Relevance: best unsupervised technique and developed at NYU. 
B. Fully Convolutional Networks for Semantic Segmentation Jonathan Long, Evan Shelhamer and Trevor Darrell UC Berkeley. 2014.  
    Image Segmentation
Relevance: combining the concepts of unsupervised technique (3.A.) with supervised technique

4. Pose Estimation
4. A. Synergistic Face Detection and Pose Estimation with Energy-Based Models. NIPS2004.
Yann.pdf     Relevance: expand the problem to also consider pose estimation. 

5. Motion.
To be developed.

6. Adversarial Networks
A. Generative Adversarial Networks, Goodfelllow et. al. 
Relevance: the foundation paper.
To be developed.

Homework 1. Run SSD on your laptop and connect it to the Kinect. So, it will run on the RGB output of the Kinect. Bring to class on 9/28 so we can all see it working.