All the talks and posters on this page are provided in DjVu, PDF.
Some are also provided in ODP (Open Office's Open Document Format), and PPT (MS PowerPoint). Caution: a few PDF files do not display correctly in ghostview.
We very strongly encourage interested readers to use the DjVu
versions: they display instantly, load much faster, and have no
compatibility problems.
djview4:
Free/Open Source DjVu viewer for Windows, Mac OS-X, and Linux.
A similar seminar was given at the Xerox Research Center Europe
in Grenoble the following day.
Title: Learning Feature Hierarchies for Vision
Abstract: Intelligent perceptual tasks such as vision and
audition require the construction of good internal
representations. Theoretical and empirical evidence suggest that the
perceptual world is best represented by a multi-stage hierarchy in
which features in successive stages are increasingly global,
invariant, and abstract. An important challenge for Machine Learning
is to devise "deep learning" methods for multi-stage architecture than
can automatically learn good feature hierarchies from labeled and
unlabeled data.
A class of such methods that combine unsupervised sparse coding, and
supervised refinement will be described. We demonstrate the use of
these deep learning methods to train convolutional networks
(ConvNets). ConvNets are biologically-inspired architectures
consisting of multiple stages of filter banks, interspersed with
non-linear operations, and spatial pooling operations, analogous to
the simple cells and complex cells in the mammalian visual cortex.
A number of applications will be shown through videos and live demos,
including a category-level object recognition system that can be
trained on the fly, a pedestrian detector, and system that recognizes
human activities in videos, and a trainable vision system for off-road
mobile robot navigation.
A new kind of "dataflow" computer architecture, dubbed NeuFlow, was
designed to run these algorithms (and other vision and recognition
algorithms) in real time on small, embeddable platforms. an FPGA
implementation of NeuFlow running various vision applications will be
shown. An ASIC is being designed in collaboration with e-lab at Yale,
which will be capable of 700 Giga-operations per second for less than
3 Watts.
Abstract: Animals and humans autonomously learn to perceive and
navigate the world. What "learning algorithm" does the cortex use to
organize itself? Could computers and robots learn to perceive the way
animals do, by just observing the world and moving around it? This
constitutes a major challenge for machine learning and computer
vision.
The visual cortex uses a multi-stage hierarchy of representations,
from pixels, to edges, motifs, parts, objects, and scenes. A new
branch of machine learning research, known as "deep learning" is
producing new algorithms that can learn such multi-stage hierarchies
of representations from raw inputs. I will describe a
biologically-inspired, trainable vision architecture called
convolutional network. It consists of multiple stages of filter banks,
non-linear operations, and spatial pooling operations, analogous to
the simple cells and complex cells in the mammalian visual
cortex. Convolutional nets are first trained with unlabeled samples
using a learning method based on sparse coding, and subsequently
fine-tuned using labelled samples with a gradient-based supervised
learning algorithm.
A number of applications will be shown through videos and live demos,
including a category-level object recognition system that can be
trained on the fly, a pedestrian detector, and system that recognizes
human activities in videos, and a trainable vision system for off-road
mobile robot navigation. A very fast implementation of these systems
on specialized hardware will be shown. It is based on a new
programmable and reconfigurable "dataflow" architecture dubbed
NeuFlow.
2010-07-12: Five Lectures at the PCMI Summer School
Five lectures on machine learning and object recognition at the Park City Mathematics Institute
Graduate Summer School, organized by the Institute of Advanced Studies.
Podcast of an interview with Yann LeCun about research at CBLL in vision, learning, robotics
and neuroscience. This is an 18 minute interview conducted by the New York Academy of Science:
[MP3 from NYAS]
2009-03-02: Distinguished Lecture at NEC Labs, Princeton, NJ
Intelligent tasks, such as visual perception, auditory perception, and
language understanding require the construction of good internal
representations of the world. Internal representations (or "features")
must be invariant (or robust) to irrelevant variations of the input,
but must preserve the information relevant to the task. An important
goal of our research is to devise methods that can automatically learn
good internal representations from labeled and unlabeled data.
Results from theoretical analysis, and experimental evidence from
visual neuroscience, suggest that the visual world is best represented
by a multi-stage hierarchy, in which features in successive stages are
increasingly global, invariant, and abstract. The main question is how
can one train such deep architectures from unlabeled data and limited
amounts of labeled data.
Several methods have recently been proposed to train deep
architectures in an unsupervised fashion. Each layer of the deep
architecture is composed of a feed-forward encoder which computes a
feature vector from the input, and a feed-back decoder which
reconstructs the input from the features. The training shapes an
energy landscape with low valleys around the training samples and high
plateaus everywhere else. A number of such layers can be stacked and
trained sequentially. A particular class of methods for deep
energy-based unsupervised learning will be described that imposes
sparsity constraints on the features. When applied to natural image
patches, the method produces hierarchies of filters similar to those
found in the mammalian visual cortex. A simple modification of the
sparsity criterion produces locally-invariant features with similar
characteristics as hand-designed features, such as SIFT.
An application to category-level object recognition with invariance to
pose and illumination will be described. By stacking multiple stages
of sparse features, and refining the whole system with supervised
training, state-the-art accuracy can be achieved on standard datasets
with very few labeled samples. Another application to vision-based
navigation for off-road mobile robots will be shown. After a phase of
off-line unsupervised learning, the system autonomously learns to
discriminate obstacles from traversable areas at long range using
labels produced with stereo vision for nearby areas.
This is joint work with Y-Lan Boureau, Karol Gregor, Raia Hadsell,
Koray Kavakcuoglu, and Marc'Aurelio Ranzato.
2008-09-05: Machine Learning Summer School
Four lectures at the 2008 Machine Learning Summer School,
Ile de Ré, France, September 5, 2009.
Abstract: A long-term goal of Machine Learning research is to
solve highy complex "intelligent" tasks, such as visual perception
auditory perception, and language understanding. To reach that goal,
the ML community must solve two problems: the Deep Learning Problem,
and the Partition Function Problem.
There is considerable theoretical and empirical evidence that complex
tasks, such as invariant object recognition in vision, require "deep"
architectures, composed of multiple layers of trainable non-linear
modules. The Deep Learning Problem is related to the difficulty of
training such deep architectures.
Several methods have recently been proposed to train (or pre-train)
deep architectures in an unsupervised fashion. Each layer of the deep
architecture is composed of an encoder which computes a feature vector
from the input, and a decoder which reconstructs the input from the
features. A large number of such layers can be stacked and trained
sequentially, thereby learning a deep hierarchy of features with
increasing levels of abstraction. The training of each layer can be
seen as shaping an energy landscape with low valleys around the
training samples and high plateaus everywhere else. Forming these
high plateaus constitute the so-called Partition Function problem.
A particular class of methods for deep energy-based unsupervised
learning will be described that solves the Partition Function problem
by imposing sparsity constraints on the features. The method can learn
multiple levels of sparse and overcomplete representations of
data. When applied to natural image patches, the method produces
hierarchies of filters similar to those found in the mammalian visual
cortex.
An application to category-level object recognition with invariance to
pose and illumination will be described (with a live demo). Another
application to vision-based navigation for off-road mobile robots will
be described (with videos). The system autonomously learns to
discriminate obstacles from traversable areas at long range.
This is joint work with Y-Lan Boureau, Sumit Chopra, Raia Hadsell,
Fu-Jie Huang, Koray Kavakcuoglu, and Marc'Aurelio Ranzato.
2007-12-08: Interview with Yann LeCun at Video Lectures
Yann LeCun, Sumit Chopra, Marc'Aurelio Ranzato and Fu-Jie Huang:
Energy-Based Models in Document Recognition and Computer Vision,
Proc. International Conference on Document Analysis and Recognition (ICDAR),
2007, [key=lecun-icdar-keynote-07].
Abstract:
Over the last few years, the Machine Learning and Natural Language
Processing communities have devoted a considerable of work to learning
models whose outputs are "structured", such as sequences of characters
and words in a human language. The methods of choice include
Conditional Random Fields, Hidden Markov SVMs, and Maximum Margin
Markov Networks. These models can be seen as un-normalized versions of
discriminative Hidden Markov Models. It may come to a surprise to the
ICDAR community that this class of models was originally developed in
the handwriting recognition community in the mid 90's to train
handwritten recognition systems at word-level discriminatively. The
various approaches can be described in a unified manner through to
concept of "Energy-Based Model" (EBM). EBMs capture depencies between
variables by associating a scalar energy to each configuration of the
variables. Given a set of observed variables (e.g an image), an EBM
inference consists in finding configurations of unobserved variables
(e.g. a recognized word or sentence) that minimize the
energy. Training an EBM consists in designing a loss function whose
minimization will shape the energy surface so that correct variable
configurations have lower energies than incorrect configurations. The
main advantage of the EBM approach is to circumvent one of the main
difficulties associated with training probabilistic models: keeping
them properly normalized, a potentially intractable problem with
complex models. Energy-Based learning has been applied with
considerable success to such problems as handwriting recognition,
natural language processing, biological sequence analysis, computer
vision (object detection and recognition), image segmentation, image
restoration, unsupervised feature learning, and dimensionality
reduction. Several specific applications will be described (and, for
some, illustrated with real-time demonstrations) including: a check
reading system, a real-time system for simultaneously detecting human
faces in images and estimating their pose; an unsupervised method for
learning invariant feature hierarchies; and a real-time system for
detecting and recognizing generic object categories in images, such as
airplanes, cars, animal, and people.
2006-06-09: Keynote Speech at the Canadian Robotics and Vision Conference, Quebec
Supervised and Unsupervised Learning with Energy-Based Models:
[Slides in DjVu (6.9MB)]
[Slides in PDF (20.6MB)]
This is a good overview of research at CBLL: energy-based learning,
object recognition, face detection, unsupervised feature learning,
and robot vision and navigation.
2005-07-19: Tutorial on Energy-Based Models, Invariant Recognition, Trainable Metrics, and Graph Transformer Network, IPAM Summer School, UCLA
VideoLectures.net also has videos of two "lunch-time debates" (or panel discussions)
in which Yann was a participant, together with Rob Schapire, David McAllester,
Yasemin Altun, Mikhail Belkin, Yoram Singer, and John Langford.
2005-03-23: Invariant Recognition of Generic Object Categories with Energy-Based Models, MSRI Workshop on Object Recognition
A 1-hour talk on invariant recognition of generic object categories,
and on energy-based models given at the Workshop on on Object
Recognition at the Mathematical Science Research Institute in
Berkeley.
2001-10-22: DjVu a compression technique and software platform for distributing scanned documents, digital documents, and high-resolution images on the Web, UIUC Distinguished Lecture
A 1-hour talk (with a video) of a Distinguished Lecture given on October 22, 2001
at the University of Illinois, Urbana-Champaign.