People Detection: Global Approaches

People Detection: Global Approaches

Motivation

Why people detection?

  • Person Re-Identification

  • Person Tracking

  • Security (e.g. Border Control)

  • Automotive (e.g. Collision Prevention)

  • Interaction (e.g. Xbox Kinect)

  • Medical (e.g. Patient Monitoring)

  • Commercial (e.g. Customer Counting)

Why is people detection difficult?

  • Clothing

    Large variety of clothing styles causes greater appearance variety

  • Accessories Occlusions by accessories. E.g. backpack, umbrella, handbag, …

  • Articulation Faces are mostly rigid. Persons can take on many different poses

  • Clutter People frequently overlap each other in images (crowds)

Categories

Still image vs. video

Still image based

  • Mostly based on gray-value information from visual images
  • Other possible cues: color, infra-red, radar, stereo
  • 👍 Advantage: Applicable in wider variety of applications
  • 👎 Disadvantages
    • Often more difficult (only a single frame)
    • Performs poorer than video based techniques

Video based

  • Background modeling

  • Temporal information (speed, position in earlier frames)

  • Optical flow

  • Can be (re-)initialized by still image approach

  • 👎 Disadvantage: Hard to apply in unconstrained scenarios

Global vs. parts

Global approaches

  • Holistic model, e.g. one feature for whole person

    截屏2021-02-19 23.32.50

  • 👍 Advantages

    • typically simple model
    • work well for low resolutions
  • 👎 Disadvantages

    • problems with occlusions
    • problems with articulations

Part-based approaches

  • Model body sub-parts separately

    截屏2021-02-19 23.33.55

  • 👍 Advantages

    • deal better with moving body parts (poses)
    • able to handle occlusions, overlaps
    • sharing of training data
  • 👎 Disadvantages

    • require more complex reasoning
    • problems with low resolutions

discriminative vs. generative

Generative model

  • Models how data (i.e. person images) is generated

  • 👍 Advantages

    • possibly interpretable, i.e. know why reject/accept
    • models the object class/can draw samples
  • 👎 Disadvantages

    • model variability unimportant to classification task
    • often hard to build good model with few parameters

Discriminative model

  • Can only discriminate for given data, if it is a person or not

  • 👍 Advantages

    • appealing when infeasible to model data itself
    • currently often excel in practice
  • 👎 Disadvantages

    • often can’t provide uncertainty in predictions
    • non-interpretable

Typical components of global approaches

Detection via classification (binary classifier)

Sliding window: Scan window at different positions and scales

截屏2021-02-20 13.49.47

Gradient based

  • Popular and successful in the vision community

  • Avoid hard decisions (compared to edge based features)

  • Examples

    • Histogram of Oriented Gradients (HOG)
    • Scale-Invariant Feature Transform (SIFT)
    • Gradient Location and Orientation Histogram (GLOH)
  • Computing gradients

    • Centered

      $$ f^{\prime}(x)=\lim \_{h \rightarrow 0} \frac{f(x+h)-f(x-h)}{2 h} $$
    • Gradient magnitude

      $$ s = \sqrt{s\_x^2 + s\_y^2} $$
    • Gradient orientation

      $$ \theta=\arctan \left(\frac{s\_{y}}{s\_{x}}\right) $$

      截屏2021-02-20 13.55.54

  • Gradient in image

    • Image: discrete, 2-dimensional signal

    • Use filter mask to compute gradient

      • $x$-direction:

        截屏2021-02-20 14.50.04
      • $y$-direction

        截屏2021-02-20 14.50.15

Edge based

Wavelet based

HOG people detector 1

More see: Histogram of Oriented Gradients (HOG)

  • Gradient-based feature descriptor developed for people detection
  • Global descriptor for the complete body
  • High-dimensional (typically ~4000 dimensions)
  • Very promising results on challenging data sets

Phases

Learning Phase

截屏2021-02-20 17.13.56

Detection Phase

截屏2021-02-20 17.14.25

How HOG descriptor works?

截屏2021-02-20 17.19.38
  1. Compute gradients on an image region of 64x128 pixels
  2. Compute gradient orientation histograms on cells of 8x8 pixels (in total 8x16 cells). typical histogram size: 9 bins
  3. Normalize histograms within overlapping blocks of 2x2 cells (in total 7x15 blocks) block descriptor size: 4x9 = 36
  4. Concatenate block descriptors $\rightarrow$ 7 x 15 x 4 x 9 = 3780 dimensional feature vector

1. Gradients

截屏2021-02-20 17.21.44
  • Convolution with [-1 0 1] filters (x and y direction)

  • Compute gradient magnitude and direction

    • Per pixel: color channel with greatest magnitude is used for final gradient (color is used!)

      截屏2021-02-20 17.22.45

2. Cell histograms

截屏2021-02-20 17.23.45
  • 9 bins for gradient orientations (0-180 degrees)
  • Filled with magnitudes
  • Interpolated trilinearly
    • bilinearly into spatial cells
    • linearly into orientation bins

3. Blocks

截屏2021-02-20 17.25.24
  • Overlapping blocks of 2x2 cells
  • Cell histograms are concatenated and then normalized
  • Normalization
    • different norms possible (L2, L2hys etc.)
    • add a normalization epsilon to avoid division by zero

4. The final HOG descriptor

Concatenation of block descriptors

截屏2021-02-20 17.32.07

Visualization

截屏2021-02-20 17.32.24

From feature to detector

  • Simple linear SVM on top of the HOG Features

    • Fast (one inner product per evaluation window)

      for an entire image it’s a vector-matrix multiplication

  • Gaussian kernel SVM

    • slightly better classification accuracy

    • but considerable increase in computation time

Silhouette Matching 2

Idea

  • 🎯 Goal: align known object shapes with image

    截屏2021-02-20 17.38.46

  • Requirements for an alignment algorithm

    • high detection rate

    • few false positives

    • robustness

    • computationally inexpensive

Computational complexity

截屏2021-02-20 17.41.11

Complexity is O(#positions * #templates * #contourpixels * sizeof(searchregion))

Distance transform

Used to compare/align two (typically binary) shapes

截屏2021-02-20 17.44.59

  1. Compute the distance from each pixel to the nearest edge pixel

    • here the euclidean distances are approximated by the 2-3 distance
    截屏2021-02-20 17.45.23
  2. Overlay second shape over distance transform

    截屏2021-02-20 17.45.42

  3. Accumulate distances along shape 2

  4. Find best matching position by an exhaustive search

However:

  • 2-3 distance is not symmetric
  • 2-3 distance has to be normalized w.r.t. the length of the shapes

Chamfer matching

截屏2021-02-20 17.47.50

Efficient Implementation

The distance transform can be efficiently computed by two scans over the complete image

  • Forward-Scan

    • starts in the upper-left corner and moves from left to right, top to bottom

    • uses the following mask

      截屏2021-02-20 17.50.24

  • Backward-Scan

    • starts in the lower-right corner and moves from right to left, bottom to top

    • uses the following mask

      截屏2021-02-20 17.50.50

Advantages

  • Fast
  • Good performance on uncluttered images (with few background structures)

Disadvantages

  • Bad performance for cluttered images
  • Needs a huge number of people silhouettes

Template Hierarchy

  • Reduce the number of silhouettes to consider
  • The shapes are clustered by similarity

截屏2021-02-20 17.52.49

  • Goal: Reduce search effort by discarding unlikely regions with minimal computational effort

  • Idea:

    • subsample the image and search first at a coarse scale

    • only consider regions with a low distance when searching for a match on finer scales

  • Need to find reasonable thresholds


  1. N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), San Diego, CA, USA, 2005, pp. 886-893 vol. 1, doi: 10.1109/CVPR.2005.177. ↩︎

  2. D. M. Gavrila and V. Philomin, “Real-time object detection for “smart” vehicles,” Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 1999, pp. 87-93 vol.1, doi: 10.1109/ICCV.1999.791202. ↩︎