Computer Science Department

Computer Science Colloquium
bar

Lagrangian Difference Learning with Applications to Vision

Yann LeCun
NEC Labs America

Monday, April 14, 2003
11:00 a.m.
Room 1302 WWH
251 Mercer Street
New York, NY 10012-1185

Host: Richard Cole, cole@cs.nyu.edu, 212-998-3119
Directions: http://cs.nyu.edu/csweb/Location/directions.html
Colloquium Information: http://cs.nyu.edu/csweb/Calendar/colloquium/index.html

Abstract

Machine learning and statistical modeling are at the core of many recent advances in data mining, biological data analysis, information retrieval, human-computer interfaces, and time series prediction. In the most common learning paradigm, the inputs are fed to a hand-crafted feature extractor that produces fixed-size vectors, which are then sent to a trainable classifier, trained in supervised mode from labelled data.

Many of the classical "grand challenges" of AI, such as 3D object recognition, continuous speech recognition, and natural language understanding have been (with some notable exceptions) largely out of reach of machine learning because of the overwhelming dimension of the input signal (e.g. pixels of an image), because of the complex invariances in natural signals, and because of the difficulties of representing the relevant data as fixed-size vectors.

Visual object recognition is a particularly interesting problems not only because of its potential practical impact, but also because it poses the most challenging scientific questions. Cracking it will require building very large learning systems composed of multiple heterogeneous modules with millions of adjustable parameters, trained on millions of examples so as to optimize a global performance measure. Training a complete recognition system from raw pixels to object categories requires new ways of integrating heterogeneous trainable modules such as object detectors, segmentors, features extractors, object recognizers, and models of composite object, so that they can be trained cooperatively. It requires trainable modules that can manipulate structured data such as graphs and sequences, rather than just fixed-size vectors. Finally, it also requires new ways to construct objective functions that accurately measure the overall performance of the system while being easy to optimize.

We first propose a methodology to construct objective functions for such systems. We assume that the stable states of the system are extrema (saddle points) of a Lagrange function, and show that a large number of popular supervised and unsupervised learning algorithms can be written as the difference between two extrema of this Lagrange function that result from different sets of constraints. Back-propagation, Boltzmann Machines, discriminative training algorithms for Hidden Markov Models, and many other algorithms (old and new) can be written in that form.

We show that the Lagrangian extremization procedure can be applied to systems composed of multiple interconnected modules that operate on vectors, vector sequences, or valued graphs. Such systems, called Graph Transformer Networks (GTN), can deal with inputs that are not easily handled by traditional learning systems, such as probabilistic finite-state machines, and other combinatorial objects.

A practical application of GTNs will be briefly described. It combines convolutional network character recognizers, stochastic language models, and a discriminative Lagrangian Difference criterion to recognize bank checks with record accuracy. This system is integrated in several commercial recognition engines, and currently reads an estimated 10% to 20% of all the checks written in the US.

Convolutional networks are gradient-based learning systems whose multilayer architecture is loosely inspired by biological visual systems. They can be trained to recognize images directly from pixel data with a high degree of invariance to translations, geometric distortions, and intra-class variability. Applications of convolutional nets to face detection, 3D object recognition, and automatic TV sport classification will be briefly described. A live demo of a convolutional net that can simultaneously segment and recognize handwritten digit strings will be shown.

Part of this work is joint with Leon Bottou, Yoshua Bengio, and Patrick Haffner. On-line demos of convolutional nets are available at http://yann.lecun.com/exdb/lenet/index.html . Relevant publications are at http://yann.lecun.com/exdb/publis/index.html . Software is available at http://lush.sf.net

bar
e-mail: webmaster@cs.nyu.edu