People Detection: Deep Learning Approaches

Docs

CV Lecture

Deep Learning for Object Detection

People detections is a special case of object detection (one of the most challenging object classes to detect)
Recently, most detectors are trained for the more challenging task of multi-object detection
- Goal: Given an image, detect all instances of, say, 1000 different object classes
- “Person” always one of the classes
Speed is an issue
- Sliding Window: Look at each position, each scale
- Cascades look at each position too
  - They just take a shorter look at most positions/scales
- Region Proposals: Avoid useless positions/scales from the beginning

Region Proposals

💡Idea
- Identify image regions that are likely to contain an object
- Don’t care about the object class in the regions at this point
Characterization of a general object
- Find “blobby” regions
- Find connected regions that are somehow distinct from their surroundings
Requirements
- FAST!!!
- High recall
- Can allow a relatively high amount of false positives
2 main categories
- Grouping methods
  - Generate proposals based on hierarchically grouping meaningful image regions
  - Often better localization
  - E.g. Selective search
- Window scoring methods
  - Generate a large amount of windows
  - Use a quickly computed cue to discard unlikely windows (“objectness” measure)
  - Often faster

For more details and comparison, see: Overview of Region-based Object Detectors

R-CNN ¹

Idea and structure

Training

Train AlexNet on ImageNet (1000 classes)
Re-initialize last layers to a different dimension (depending on the #classes of the new classifier) and train new model
Train a classifier
- Binary SVMs (e.g. is human? yes/no) for each object class $\rightarrow$ $C$ SVMs in our case
- The outputs of pool5 of the retrained AlexNet are used as features
Improve the region proposals
- Use a regression model to improve the estimated locatin of the object
  - Input: features of proposed region (pool5)
  - output: x, y, width, height of the estimated region

Downsides

Speed: Need to forward-pass EACH region proposal through entire CNN!!!
SVM & BBox regressor are trained after CNN is fixed
- No simultaneous update/adaptation of CNN features possible
Complexity: multi-stage approach

Improvement:

For 1: Can we make (part of) the CNN run only once for all proposals?
For 2&3: Can we make the CNN perform these steps?

Fast R-CNN ²

Overview

ROI pooling

Conv layers don’t care about input size, FC layers do
ROI pooling: warp the variable size ROIs into in a predefined fix size shape.

End-to-end training

Instead of SVM & Regressor just add corresponding losses and train the system for both (multitask)
Gradients can backprop. into feature layers through ROI pooling layers (just as with normal maxpool layers)
End-to-end brings slight improvement 👏
Softmax (integrated) loss slightly but consistently outperforms external classifier 👏

Fast R-CNN vs R-CNN

Speed:

Downsides

Majority of time is lost for region proposals
Model is also not fully end-to-end: proposals come from “outside”
(Can we include them in the CNN as well? 🤔)

Faster R-CNN³

Overview

Region Proposal Network (RPN)

Input: Feature map from larger conv network of size $C \times W \times H$
Output
- List of $p$ proposals
- “Objectness” score of size $p \times 6$
  - $p \times 4$ coordinates (top-left and bottom-right $(x,y) $ coordinates) for bounding box
  - $p \times 2$ for objectness (with vs. without object) per location
General approach:
- Take a mini net (RPN) and slide it over the feature map (stepsize 1)
- At each position evaluate $k$ different window sizes for objectness
- Results in approx. $W \times H \times k$ windows/proposals
Fully convolutional network
Anchors: tackle the scale problem of the feature map
- Initial reference boxes consisting of aspect ratio and scale, centered at sliding window
- 3 scales and 3 aspect ratios = 9 anchors
Layers
- reg layer: regression of the reference anchor
- cls layer: object/no object score

Loss

Need a label for each anchor to train the objectness classification
Labelling anchors
- Positive: highest IoU with groundtruth or IoU > 0.7 (can be more than one)
  - Also store the association between anchor and groundtruth box
- Negative: others, if their IoU < 0.3
- Other anchors do not contribute to training
$\rightarrow$ Convert to classification problem
RPN multitask loss:
- $N\_{cls}$: Batch size (256)
- $N\_{reg}$: number of window positions ($\approx$ 2400)
- $\lambda = 10$

Training

As in paper

Jointly

Train everything in one go
Combination of four losses
- objectness classification
- anchor regression
- object class classification
- detection regression

Why two regression losses?
Anchor regression directly impacts the feature used for detection. Detection regression merely improves final localization

Comparison between all the R-CNNs

SSD Detector ⁴

Motivation

Thus far, deep multiclass detectors rely on variants of three steps:

generate bounding boxes (proposals)
resample pixels/features in boxes to uniform size
apply high quality classifier

Can we avoid / speed up any of those steps to increase overall speed?

Overview

💡Core Idea: Use a set of fixed default boxes at each position in a feature map (similar to anchors)
Classify object class and box regression for each default box
pply boxes at different layers in the ConvNet
- Use layers of different sizes
- Avoids the need for rescaling

Structure

Detectors at various stages with varying numbers of default boxes
Resulting number of detections is fixed
Reduced by non maximum suppression

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 580–587. https://doi.org/10.1109/CVPR.2014.81 ↩ ↩︎
Girshick, R. (2015). Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision, 2015 International Conference on Computer Vision, ICCV 2015, 1440–1448. https://doi.org/10.1109/ICCV.2015.169 ↩ ↩︎
Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031 ↩ ↩︎
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu: “SSD: Single Shot MultiBox Detector”, 2016; arXiv:1512.02325. ↩︎

Last updated on 2024-09-05

← People Detection: Part-based Approaches 2021-07-07

Tracking 2021-07-19 →

People Detection: Deep Learning Approaches

Deep Learning for Object Detection

Region Proposals

R-CNN 1

Idea and structure

Training

Downsides

Fast R-CNN 2

Overview

ROI pooling

End-to-end training

Fast R-CNN vs R-CNN

Downsides

Faster R-CNN3

Overview

Region Proposal Network (RPN)

Loss

Training

As in paper

Jointly

Comparison between all the R-CNNs

SSD Detector 4

Motivation

Overview

Structure

R-CNN ¹

Fast R-CNN ²

Faster R-CNN³

SSD Detector ⁴