Action Recognition in Video

Time Period: 2009 - present.
Participants: Graham Taylor, Rob Fergus, Chris Bregler, Yann LeCun (Courant Institute/CBLL).
Sponsors: DARPA, ONR.
Description: A trainable system was built to recognize actions in videos. The first layer is a Convolutional Gated Restricted Boltzmann Machine, which is trained in an unsupervised manner. It automatically learns features that primarily encode motion. The second layer uses sparse coding to learn mid-level features in an unspervised manner. The feature vectors thereby obtained are pooled over time, using a max-pooling operation, and fed to a Support vector Machine. Excellent performance was obtained on the Hollywood-2 dataset. A similar system was built to recognize actions on the KTH dataset. It also uses a CGRBM at the first layer, but uses a 3D (spatio-temporal) convolutional network architecture for the following layers.

149. W. Taylor, Graham, Rob Fergus, Yann LeCun and Christoph Bregler: Convolutional Learning of Spatio-temporal Features, Proc. European Conference on Computer Vision (ECCV'10), 2010, \cite{taylor-eccv-10}. 467KB DjVu

857KB PDF

557KB PS.GZ