People Detection: Deep Learning Approaches

Deep Learning for Object Detection

  • People detections is a special case of object detection (one of the most challenging object classes to detect)
  • Recently, most detectors are trained for the more challenging task of multi-object detection
    • Goal: Given an image, detect all instances of, say, 1000 different object classes
    • “Person” always one of the classes
  • Speed is an issue
    • Sliding Window: Look at each position, each scale
    • Cascades look at each position too
      • They just take a shorter look at most positions/scales
    • Region Proposals: Avoid useless positions/scales from the beginning

Region Proposals

  • 💡Idea

    • Identify image regions that are likely to contain an object
    • Don’t care about the object class in the regions at this point
  • Characterization of a general object

    • Find “blobby” regions
    • Find connected regions that are somehow distinct from their surroundings
  • Requirements

    • FAST!!!
    • High recall
    • Can allow a relatively high amount of false positives
  • 2 main categories

    • Grouping methods

      • Generate proposals based on hierarchically grouping meaningful image regions

      • Often better localization

      • E.g. Selective search

    • Window scoring methods

      • Generate a large amount of windows
      • Use a quickly computed cue to discard unlikely windows (“objectness” measure)
      • Often faster
For more details and comparison, see: Overview of Region-based Object Detectors


Idea and structure

截屏2021-07-19 16.46.22

截屏2021-07-19 16.50.11


  1. Train AlexNet on ImageNet (1000 classes)

    截屏2021-07-19 16.53.30

  2. Re-initialize last layers to a different dimension (depending on the #classes of the new classifier) and train new model

    截屏2021-07-19 16.54.29

  3. Train a classifier

    • Binary SVMs (e.g. is human? yes/no) for each object class $\rightarrow$ $C$ SVMs in our case

    • The outputs of pool5 of the retrained AlexNet are used as features

      截屏2021-07-19 16.58.56

  4. Improve the region proposals

    • Use a regression model to improve the estimated locatin of the object

      • Input: features of proposed region (pool5)
      • output: x, y, width, height of the estimated region

      截屏2021-07-19 16.59.53


  1. Speed: Need to forward-pass EACH region proposal through entire CNN!!!

  2. SVM & BBox regressor are trained after CNN is fixed

    • No simultaneous update/adaptation of CNN features possible
  3. Complexity: multi-stage approach


  • For 1: Can we make (part of) the CNN run only once for all proposals?

  • For 2&3: Can we make the CNN perform these steps?

Fast R-CNN 2


截屏2021-07-19 21.02.20

截屏2021-07-19 21.03.23

ROI pooling

  • Conv layers don’t care about input size, FC layers do
  • ROI pooling: warp the variable size ROIs into in a predefined fix size shape.
截屏2021-07-19 21.14.14 Image for post

End-to-end training

截屏2021-07-19 21.16.14
  • Instead of SVM & Regressor just add corresponding losses and train the system for both (multitask)
  • Gradients can backprop. into feature layers through ROI pooling layers (just as with normal maxpool layers)
  • End-to-end brings slight improvement 👏
  • Softmax (integrated) loss slightly but consistently outperforms external classifier 👏

Fast R-CNN vs R-CNN


截屏2021-07-19 21.21.53 截屏2021-07-19 21.23.51


  • Majority of time is lost for region proposals

  • Model is also not fully end-to-end: proposals come from “outside”

    (Can we include them in the CNN as well? 🤔)

Faster R-CNN3


Image for post

Region Proposal Network (RPN)

  • Input: Feature map from larger conv network of size $C \times W \times H$

  • Output

    • List of $p$ proposals
    • “Objectness” score of size $p \times 6$
      • $p \times 4$ coordinates (top-left and bottom-right $(x,y) $ coordinates) for bounding box
      • $p \times 2$ for objectness (with vs. without object) per location
  • General approach:

    • Take a mini net (RPN) and slide it over the feature map (stepsize 1)
    • At each position evaluate $k$ different window sizes for objectness
    • Results in approx. $W \times H \times k$ windows/proposals

    Image for post

  • Fully convolutional network

  • Anchors: tackle the scale problem of the feature map

    • Initial reference boxes consisting of aspect ratio and scale, centered at sliding window
    • 3 scales and 3 aspect ratios = 9 anchors
  • Layers

    • reg layer: regression of the reference anchor
    • cls layer: object/no object score


  • Need a label for each anchor to train the objectness classification

  • Labelling anchors

    • Positive: highest IoU with groundtruth or IoU > 0.7 (can be more than one)
      • Also store the association between anchor and groundtruth box
    • Negative: others, if their IoU < 0.3
    • Other anchors do not contribute to training

    $\rightarrow$ Convert to classification problem

  • RPN multitask loss:

    截屏2021-07-23 23.39.45

    • $N\_{cls}$: Batch size (256)
    • $N\_{reg}$: number of window positions ($\approx$ 2400)
    • $\lambda = 10$


As in paper

截屏2021-07-19 21.54.06 截屏2021-07-19 21.54.33 截屏2021-07-19 21.54.59 截屏2021-07-19 21.55.21


  • Train everything in one go

  • Combination of four losses

    • objectness classification
    • anchor regression
    • object class classification
    • detection regression

Why two regression losses?

Anchor regression directly impacts the feature used for detection. Detection regression merely improves final localization

Comparison between all the R-CNNs

截屏2021-07-19 21.58.56

SSD Detector 4


Thus far, deep multiclass detectors rely on variants of three steps:

  • generate bounding boxes (proposals)
  • resample pixels/features in boxes to uniform size
  • apply high quality classifier

Can we avoid / speed up any of those steps to increase overall speed?


截屏2021-07-19 23.05.14

  • 💡Core Idea: Use a set of fixed default boxes at each position in a feature map (similar to anchors)
  • Classify object class and box regression for each default box
  • pply boxes at different layers in the ConvNet
    • Use layers of different sizes
    • Avoids the need for rescaling


截屏2021-07-19 23.05.56

  • Detectors at various stages with varying numbers of default boxes
  • Resulting number of detections is fixed
  • Reduced by non maximum suppression

