Action & Activity Recognition 2

What is action recognition?

Given an input video/image, perform some appropriate processing, and output the “action label”

CNNs for Action / Activity Recognition ¹

Why CNN?

Convolutional neural networks report the best performance in static image classification.
They automatically learn to extract generic features that transfer well across data sets.

Strategies for temporal fusion

Single Frame CNN (baseline)
- Network sees one frame at a time
- No temporal information
Late Fusion CNN
- Network sees two frames separated by F = 15 frames
- Both frames go into separate pathways
- Only the last layers have access to temporal information
Early Fusion CNN
- Modify the convolutional filters in the first layer to incorporate temporal information.
  - Filters of $11 \times 11 \times 3 \times T$ , where $T$ is the temporal context ($T=10$)
Slow Fusion CNN
- Layers higher in the hierarchy have access to larger temporal context
- Learn motion patterns at different scales

Multiresolution CNN

Faster training by reducing input size from $170 \times 170$ to $89 \times 89$

💡 Idea: takes advantage of the camera bias present in many online videos, since the object of interest often occupies the center region.

The context stream receives the downsampled frames at half the original spatial resolution (89 × 89 pixels)
The fovea stream receives the center 89 × 89 region at the original resolution

$\rightarrow$ The total input dimensionality is halved.

Evaluation

Dataset: Sports-1M (1 Million videos, 487 sport activities classes)

Encoding image and optical flow separately (two-stream CNNs) ²

3D convolutions for action recognition (C3D)

Notations:

video clips $\in c \times l \times h \times w$
- $c$: #channels
- $l$: length in number of frames
- $h, w$: height and width of the frame
3D convolution and pooling $\in d \times k \times k$
- $d$: kernel temporal depth
- $k$: kernel spatial size

C3D: 3 x 3 x 3 convolutions with stride 1 in space and time

Recurrent Convolutional Networks / CNN-RNN ³

LRCN

Task-specific instantiation
Activity recognition (average frame representations)
Image captioning (feed image info to each RNN)
Video description (sequence-to-sequence models)

Comparison of architectures

Type of convolutional and layers operators
- 2D kernels (image-based) vs.
- 3D kernels (video-based)
Input streams
- RGB (spatial stream), usually used in single-stream networks
- Precomputed optical flow (temporal stream)
- Further streams possible (e.g. depth, human bounding boxes)
Fusion strategy across multiple frames
- Feature aggregation over time
- Recurrent layers, such as LSTM

$\rightarrow$ Modern architectures are usually a combination of the above!

Fair comparison of the architectures is difficult!

different pre-training of models, some are trained from scratch
Activity recognition datasets have been too small for analysis of deep learning approaches $\rightarrow$ pre-training matters even more

Evolution of Activity Recognition Datasets

Construction of large-scale video datasets much harder then for images 🤪
Common datasets too tiny for proper research of deep methods

Evaluation of Action Recognition Architectures ⁴

Contributions

Release of the Kinetics dataset - a first large-scale dataset for Activity Recognition
Benchmarking of three „classic“ architectures for activity recognition
- Note: fair comparison is still quite difficult, since models still differ in their modalities and pre-training basis
New Architecture: I3D
- 3D CNN based Inception-V1 CNN (Google LeNet)
- “Inflation“ of trained 2-D filters in the 3-D Model

Evaluation of 3 “classic” architectures

ConvNet + LSTM (9M Parameters)
- Underlying CNN for feature extraction: Inception-V1
- LSTM with 512 hidden units (after the last AvgPool layer) + FC layer
- Estimating the action from the resulting prediction Sequence:
  - Training: output at each time-step used for loss calculation
  - Testing: output of the last frame used for final prediction
- Pre-trained on ImageNet
- Preprocessing Steps: down-sampling from 25 to 5 fps
3D - ConvNet (79M Parameters)
- Spatio-temporal filters, C3D architecture
- High number of parameters $\rightarrow$ harder to train 🤪
- CNN Input: 16-frame snippets
- Classification: score averaging over each snippet in the video
- Trained from scratch
Two Stream CNN (12 M Parameters)
- Underlying CNN for feature extraction: Inception-V1
- Spatial (RGB) and Temporal (Optical Flow) streams trained separately
- Prediction by score averaging
- CNN Pre-trained on ImageNet

Evaluation

Two-Stream are still the clear winners
3D-CNN show poor performance and very high number of parameters
- Note: this is the only architecture trained from scratch

Inflated 3D CNN (I3D)

💡 Idea: transfer the knowledge from the image recognition tasks in 3-D CNNs

I3-D Architecture

Inception-V1 architecture extended to 3D
Filters and pooling kernels inflated with the time dimension ($N \times N \rightarrow N \times N \times N$)
👍 Advantage: Pre-training on Image-Net possible (Learned weights of 2-D filters repeated N times along the time dimension)
Note: the 3-D extension is not fully symmetric in respect to pooling (Time dimension is different from the space dimensions)
- First two max-pooling layers do not perform temporal pooling
- Late max-pooling layers use symmetric 3x3x3 kernels
Evaluation
- I3D outperforms image-based approaches on each of the streams
- Combination of RGB input and optical flow still very useful

The role of pre-training

Pre-training on a video dataset (additionally to the Image-Net pre-training)

Pre-training on MiniKinetics
For 3D ConvNets, using additional data for pre-training is crucial
For 2D ConvNets, the difference seems to be smaller

$\rightarrow$ Pre-training is crucial

$\rightarrow$ I3D is the new State-of-The art model

Karpathy, Andrej, et al. “Large-scale video classification with convolutional neural networks.” Computer Vision and Pattern Recognition (CVPR), 2014 ↩︎
K. Simonyan, and A. Zisserman. Two-Stream Convolutional Networks for Action Recognition in Videos. In NIPS 2015. ↩︎
J. Donahue, et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description. In CVPR 2015. ↩︎
Carreira, J., & Zisserman, A. (2017). Quo Vadis, action recognition? A new model and the kinetics dataset. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017-January, 4724–4733. https://doi.org/10.1109/CVPR.2017.502 ↩︎

Last updated on Apr 3, 2022

Action & Activity Recognition 2

CNNs for Action / Activity Recognition 1

Strategies for temporal fusion

Multiresolution CNN

Evaluation

Encoding image and optical flow separately (two-stream CNNs) 2

3D convolutions for action recognition (C3D)

Recurrent Convolutional Networks / CNN-RNN 3

LRCN

Comparison of architectures

Evolution of Activity Recognition Datasets

Evaluation of Action Recognition Architectures 4

Evaluation of 3 “classic” architectures

Inflated 3D CNN (I3D)

The role of pre-training

CNNs for Action / Activity Recognition ¹

Encoding image and optical flow separately (two-stream CNNs) ²

Recurrent Convolutional Networks / CNN-RNN ³

Evaluation of Action Recognition Architectures ⁴