Lecture | Haobin Tan

Computer Vision for Human-Computer Interaction

Fri, 06 Nov 2020 00:00:00 +0000

Name	Computer Vision for Human-Computer Interaction
Semester	WS 20/21
Language	English, German
Lecturer(s)	Prof. Dr.-Ing. Rainer Stiefelhagen Dr.-Ing. Muhammad Saquib Sarfraz
Credits	6
Homepages	https://cvhci.anthropomatik.kit.edu/600_1979.php

Pattern Recognition

Fri, 06 Nov 2020 00:00:00 +0000

Why pattern recognition and what is it?

What is machine learning?

Motivation: Some problems are very hard to solve by writing a computer program by hand
Learn common patterns based on either
- a priori knowledge or
- on statistical information
  - Important for the adaptability to different tasks/domains
  - Try to mimic human learning / better understand human learning
Machine learning is concerned with developing generic algorithms, that are able to solve problems by learning from example data

Classifiers

Given an input pattern $\mathbf{x}$, assign in to a class $\omega\_i$
- Example: Given an image, assign label “face” or “non-face”
- $\mathbf{x}$: can be an image, a video, or (more commonly) any feature vector that can be extracted from them
- $\omega\_i$: desired (discrete) class label
  - If “class label” is real number or vector –> Regression task
ML: Use example patterns with given class labels to automatically learn
Example
Classification process

Bayes Classification

Given a feature vector $\mathbf{x}$, want to know which class $\omega\_i$ is most likely
Use Bayes’ rule: Decide for the class $\omega\_i$ with maximum posterior probability
🔴 Problem: $p(x|\omega\_i)$ (and to a lesser degree $P(\omega\_i)$) is usually unknown and often hard to estimate from data
Priors describe what we know about the classes before observing anything
- Can be used to model prior knowledge
- Sometimes easy to estimate (counting)
Example

Gaussian Mixture Models

Gaussian classification

Assumption:
$$ \mathrm{p}\left(\mathbf{x} | \omega_{\mathrm{i}}\right) \sim \mathrm{N}(\boldsymbol{\mu}, \mathbf{\Sigma})= \frac{1}{(2 \pi)^{d / 2}|\Sigma|^{1 / 2}} \exp \left[-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right] $$
- 👆 This makes estimation easier
  - Only $\boldsymbol{\mu}, \mathbf{\Sigma}$ need to be estimated
  - To reduce parameters, the covariance matrix can be restricted
    - Diagonal matrix –> Dimensions uncorrelated
    - Multiple of unit matrix –> Dimensions uncorrelated with same variance
🔴 Problem: if the assumption(s) do not hold, the model does not represent reality well 😢
Estimation of $\boldsymbol{\mu}, \mathbf{\Sigma}$ with Maximum (Log-)Likelihood
- Use parameters, that best explain the data (highest likelihood):
  $$ \begin{aligned} \operatorname{Lik}(\boldsymbol{\mu}, \mathbf{\Sigma}) &= p(\text{data}|\boldsymbol{\mu}, \mathbf{\Sigma}) \\\\ &= p(\mathbf{x}\_0, \mathbf{x}\_1, \dots, \mathbf{x}\_n|\boldsymbol{\mu}, \mathbf{\Sigma}) \\\\ &= p\left(\mathbf{x}\_{0} | \boldsymbol{\mu}, \mathbf{\Sigma}\right) \cdot p\left(\mathbf{x}\_{1} | \boldsymbol{\mu}, \mathbf{\Sigma}\right) \cdot \ldots \cdot p\left(\mathbf{x}\_{\mathrm{n}} | \boldsymbol{\mu}, \mathbf{\Sigma}\right) \end{aligned} $$ $$ \operatorname{LogLik}(\boldsymbol{\mu}, \mathbf{\Sigma}) = \log(\operatorname{Lik}(\boldsymbol{\mu}, \mathbf{\Sigma})) = \sum\_{i=0}^n \log(\mathbf{x}\_i | \boldsymbol{\mu}, \mathbf{\Sigma}) $$
  –> Maimize $\log(\operatorname{Lik}(\boldsymbol{\mu}, \mathbf{\Sigma}))$ over $\boldsymbol{\mu}, \mathbf{\Sigma}$

Gaussian Mixture Models (GMMs)

Approximate true density function using a weighted sum of several Gaussians
$$ \mathrm{p}(\mathbf{x})=\sum\_{i} \mathrm{w}\_{i} \frac{1}{(2 \pi)^{\mathrm{d}/2}|\mathbf{\Sigma}|^{1 / 2}} \exp \left[-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^{\top} \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right] \qquad \text{with} \sum\_i w\_i = 1 $$
Any density can be approximated this way with arbitrary precision
- But might need many Gaussians
- Difficult to estimate many parameters
Use Expectation Maximization (EM) Algorithm to estimate parameters of the Gaussians as well as the weights
1. Initialize parameters of GMM randomly
2. Repeat until convergence
  - Expectation (E) step:
    
    Compute the probability $p\_{ij}$ that data point $i$ belongs to Gaussian $j$
    - Take the value of each Gaussian at point $i$ and normalize so they sum up to one
  - Maximization (M) step:
    
    Compute new GMM parameters using soft assignments $p\_{ij}$
    - Maximum Likelihood with data weighted according to $p\_{ij}$

parametric vs. non-parametric

Parametric classifiers
- assume a specific form of probability distribution with some parameters
- only the parameters need to be estimated
- 👍 Advantage: Need less training data because less parameters to estimate
- 👎 disadvantage: Only work well if model fits data
- Examples: Gaussian and GMMs
Non-parametric classifiers
- Do NOT assume a specific form of probability distribution
- 👍 Advantage: Work well for all types of distributions
- 👎 disadvantage: Need more data to correctly estimate distribution
- Examples: Parzen windows, k-nearest neighbors

generative vs. discriminative

A method that models $P(\omega\_i)$ and $p(\mathbf{x}|\omega\_i)$ explicitly is called a generative model
- $p(\mathbf{x}|\omega\_i)$ allows to generate new samples of class $\omega\_i$
The other common approach is called discriminative models
- directly model $p(\omega\_i|\mathbf{x})$ or just output a decision $\omega\_i$ given an input pattern $\mathbf{x}$
- easier to train because they solve a simpler problem 👏

Linear Discriminant Functions

Separate two classes $\omega\_1, \omega\_2$ with a linear hyperplane
$$ y(x)=w^{T} x+w_{0} $$
- Decide $\omega\_1$ if $y(x) > 0$ else $\omega\_2$
- $w^T$: normal vector of the hyperplane
Example:
- Perceptron (see: Perceptron)
- Linear SVM

Support Vector Machines

See: SVM

Linear SVMs

If the input space is already high-dimensional, linear SVMs can often perform well too
👍 Advantages:
- Speed: Only one scalar product for classification
- Memory: Only one vector w needs to be stored
- Training: Training is much faster
- Model selection: Only one parameter to optimize

K-nearest Neighbours

💡 Look at the $k$ closest training samples and assign the most frequent label among them
Model consists of all training samples
- Pro: No information is lost
- Con: A lot of data to manage
Naïve implementation: compute distance to each training sample every time
- Distance metric is needed (Important design parameter!)
  - $L\_1$, $L\_2$, $L\_{\infty}$, Mahalanobis, … or
  - Problem-specific distances
kNN often good classifier, but:
- Needs enough data
- Scalability issues

More see: k-NN

Clustering

New problem setting
- Only data points are given, NO class labels
- Find structures in given data
Generally no single correct solution possible

K-means

Algorithm
1. Randomly initialize k cluster centers
2. Repeat until convergence:
  - Assign all data points to closest cluster center
  - Compute new cluster center as mean of assigned data points
👍 Pros: Simple and efficient
👎 Cons:
- $k$ needs to be known in advance
- Results depend on initialization
- Does not work well for clusters that are not hyperspherical (round) or clusters that overlap
Very similar to the EM algorithm
- Uses hard assignments instead of probabilistic assignments (EM)

Agglomerative Hierarchical Clustering

Algorithm
1. Start with one cluster for each data point
2. Repeat
  - Merge two closest clusters
Several possibilities to measure cluster distance
- Min: minimal distance between elements
- Max: maximal distance between elements
- Avg: average distance between elements
- Mean: distance between cluster means
Result is a tree called a dendrogram
Example:

Curse of dimensionality

In computer vision, the extracted feature vectors are often high-dimensional
Many intuitions about linear algebra are no longer valid in high-dimensional spaces 🤪
- Classifiers often work better in low-dimensional spaces
These problems are called “curse of dimensionality" 👿

Example

Dimensionality reduction

PCA: Leave out dimensions and minimize error made
LDA: Maximize class separability

Face Detection: Color-Based

Fri, 06 Nov 2020 00:00:00 +0000

TL;DR

Different color spaces and classifiers can be used
- Models: histograms, Gaussian Models, Mixture of Gaussians Model
- Histogram-backprojection / Histogram matching
- Bayes classifier
- Discriminative Classifiers (ANN, SVM)
Bayesian classifier and ANN seem to work well
- Sufficient training data is needed for modeling the pdf, in particular for Bayesian approach (positive & negative pdfs learned)
Advantages: Fast, rotation & scale invariant, robust against occlusions
Disadvantages:
- Affected by illumination
- Cannot distinguish head and hands
- Skin-colored objects in the background problematic
Metric: ROC curve used to compare classification results / methods

Color-based face detection overview

💡 Idea: human skin has consistent color, which is distinct from many objects

Possible approach:

Find skin colored pixels
Group skin colored pixels

(and apply some heuristics) to find the face

Color

Grayscale Image: Each pixel represented by one number (typically integer between 0 and 255)
Color image: Pixels represented by three numbers

Different representations exist –> „Color Spaces“

Color spaces

RGB
- most widely used
- specifies colors in terms of the primary colors red (R), green (G), and blue (B)
HSV/HSI: hue (H), saturation (S) and value(V)/intensity (I)
- Closely related to human perception (hue, colorfulness and brightness)
  - Hue: “color”
  - Saturation: How “pure” the color is?
  - Value: “lightness”
Class Y spaces: YCbCr (Digital Video), YIQ (NTSC), YUV (PAL)
- Y channel contains brightness, other two channels store chrominance (U=B-Y, V=R-Y)
- Conversion from RGB to Yxx is a linear transformation
Perceptually uniform spaces
- Perceived color difference is uniform to difference in color values
- Euclidian distance can be used for color comparison
Chromatic Color Spaces
- Two color channels containing chrominance (colour) information
  - HS (taken from HSV)
  - UV (taken from YUV)
- Normalized rg from RGB:
  - r = R / (R+G+B)
  - g = G / (R+G+B)
  - b = B / (R+G+B)
- Sometimes it is argued that chromatic skin color models are more robust

Problems

Reflected color depends on spectrum of the light source (and properties of the object / surface)
If the light source / illumination changes, the reflected color signal changes!!! 🤪

How to model skin color?

Non-parametric models: typically histograms
Parametric models
- Gaussian Model
- Gaussian Mixture Model
Or just learn decision boundaries between classes (discriminative model)
- ANN, SVM, …

Histogram as skin color model

👍 Advantages: Works very well in practice
👎 Disadvantages
- Memory size quickly gets high
- A large number of labelled skin and non-skin samples is needed!

Histogram Backprojection

The simplest (and fastest) way to utilize histogram information
Each pixel in the backprojection is set to the value of the (skin-color) histogram bin indexed by the color of the respective pixel
- A color $x$ is considered as skin color if $H\_{+}(x) > \theta$
E.g.

Histogram Matching

Backprojection
- is good, when the color distribution of the target is monomodal.
- is not optimal, when the target is multi colored! 😢
🔧 Solution: Build a histogram of the image within the search window, and compare it to the target histogram.
- distance metrics for histograms, e.g.:
  - Battacharya distance
  - Histogram intersection
  - Earth-movers distance,…

Histogram Backprojection vs. Matching

Histogram Backprojection
- Compares color of a single pixel with color model
- Fast and simple
- Can only cope well with mono-modal distributions
- sufficient for skin-color classification
Histogram Matching / Intersection
- Compares color histogram of image patch with color model
- Better performance
- Can cope with multi-modal distributions
- Computationally expensive

Parametric models

Gaussian Density Models

Gaussian Densities
- Assume that the distribution of skin colors p(x) has a parametric functional form
- Most common function: Gaussion function $\mathrm{G}(\mathbf{x} ; \mu, \mathbf{C})$
  $$ p(x | \text{skin})=G(x ; \mu, C)=\frac{1}{(2 \pi)^{d / 2}|C|^{1 / 2} }\exp \left\\{-1 / 2(x-\mu)^{\top} C^{-1}(x-\mu)\right\\} $$
  - Mean $\mu$ and covariance matrix $C$ are estimated from a training set of skin colors $S = {x\_1,x\_2,...,x\_N}$:
    - $\mu = E\{x\}$
    - $C = E\{(\boldsymbol{x}-\mu)^T(\boldsymbol{x}-\mu)\}$
- A color is considered as skin color if
  - $p(x|\text{skin}) > \theta$
  - $p(x|\text{skin}) > p(x|\text{non-skin})$

Mixture of Gaussian Models

$$ p(x)=\sum\_{i=1}^{K} \pi\_{i} G\left(x, \mu\_{i}, C\_{i}\right) $$

Parameter set $\Phi$ can be estimated using the EM algorithm
- Iteratively changes parameters so as to maximize the log-likelihood of the training set: $$ L=\log \prod\_{i=1}^{N} p\left(x\_{i} \mid \Phi\right) $$
A color is considered as skin color if
- $p(x|\text{skin}) > \theta$
- $p(x|\text{skin}) > p(x|\text{non-skin})$

Bayes Classifier

Skin Classification using Bayes Decision Rule
- Minimum cost decision rule
- Classify pixel to skin class if $P(\text{Skin} | x)>P(\text{Non-Skin} | x)$
- Decision Rule:
  $$ \frac{p(\mathbf{x} \mid \text {Skin})}{p(\mathbf{x} \mid \text {Non-Skin})} \geq \frac{P(\text {Non-Skin})}{P(\text {Skin})} $$
- The classconditionals $p(x|\omega)$ can be estimated from the corresponding histograms:
  $$ p\left(x \mid \omega\_{i}\right)=h\_{i}(x) / \sum\_{x} h\_{i}(x) $$
  - $h\_i(x)$: count of pixels from class $\omega\_{i}$ that have value $x$

Discriminative Models / Classifiers

Artificial Neural Networks
Support Vector Machine

Performance Measures

For classification

When comparing recognition hypotheses with ground-truth annotations have to consider four cases:

More see: Evaluation

ROC (Receiver Operating Characteristic)

Used for the task of classification
Measures the trade-off between true positive rate and false positive rate

$$ \begin{array}{l} \text { true positive rate }=\frac{\mathrm{TP}}{\mathrm{Pos}}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} \\\\ \text { false positive rate }=\frac{\mathrm{FP}}{\mathrm{Neg}}=\frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TN}} \end{array} $$

Each prediction hypothesis has generally an associated probability value or score
The performance values can therefore plotted into a graph for each possible score as a threshold
Example:

Skin-color: Analysis and Comparison

Conclusions ¹

Bayesian approach and MLP worked best
- Bayesian approach needs much more memory
Approach is largely unaffected by choice of color space, but
Results degraded when only chrominance channels were used

From Skin-Colored Pixels to Faces

Skin-colored pixels need to be grouped into object representations
🔴 Problems:
- skin-colored background,
- further skin-colored body parts (hands, arms, …),
- Noise, …

Perceptual Grouping

Morphological Operators: Operators performing an action on shapes where the input and output is a binary image.
Threshold each pixel‘s skin affiliation –> Binary Image
Morphological Erosion
- Remove pixels from edges of objects
- Set pixel value to min value of surrounding pixels
Morphological Dilatation
- Add pixels to edges of objects
- Set pixel value to max value of surrounding pixels
Morphological Opening
- Apply erosion, then dilatation
- Goal:
  - Smooth outline
  - Open small bridges
  - Eliminate outliers
Morphological Closing
- Apply dilatation, then erosion
- Goal:
  - Smooth inner edges
  - Connect small distances
  - Fill unwanted holes
Apply morphological closing then morphological opening
- Resulting image is reduced to connected regions of skin color (blobs)

From Skin Blobs To Faces

Goal: align bounding box around face candidate
Important for:
- Face Recognition
- Head Pose Estimation
Different approaches:
- Choose cluster with biggest size
- Ellipse fitting (approximate face region by ellipse)
- Heuristics to distinguish between different skin clusters
- Use temporal information (tracking)
- Facial Feature Detection
- …

S. L. Phung, A. Bouzerdoum and D. Chai, “Skin segmentation using color pixel classification: analysis and comparison,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 1, pp. 148-154, Jan. 2005, doi: 10.1109/TPAMI.2005.17. ↩︎

Face Detection: Neural-Network-Based

Fri, 13 Nov 2020 00:00:00 +0000

Motivation

Idea: Use a search-window to scan over an image
Train a classifier to decide whether the search windows contains a face or not?

Detection

Simple neuron model

Topologies

Parameters

Adjustable Parameters are

Connection weights (to be learned)
Activation function (fixed)
Number of layers (fixed)
Number of neurons per layer (fixed)

Training

Backpropagation with gradient descent

Neural Network Based Face Detection¹

Idea: Use an artifical neural network to detect upright frontal faces
- Network receives as input a 20x20 pixel region of an image
- output ranges from -1 (no face present) to +1 (face present)
- the neural network „face-filter“ is applied at every location in the image
- to detect faces with different sizes, the input image is repeatedly scaled down

Network Topology

20x20 pixel input retina
4 types of receptive hidden fields
One real-valued output

System Overview

Network Training

Training Set

1050 normalized face images
15 face images generated by rotating and scaling original face images
1000 randomly chosen non-face images

Preprocessing

correct for different lighting conditions (overall brightness, shadows)
rescale images to fixed size

Histogram equalization

Defines a mapping of gray levels $p$ into gray levels $q$ such that the distribution of $q$ is close to being uniform
Stretches contrast (expands the range of gray levels)
Transforms different input images so that they have similar intensity distributions (thus reducing the effect of different illumination)
Example
Algorithm
- The probability of an occurrence of a pixel of level $i$ in the image:
  $$ p\left(x\_{i}\right)=\frac{n\_{i}}{n}, \qquad i \in 0, \ldots, L-1 $$
  - $L$: number of gray levels
  - $n\_i$: number of occurences of gray level $i$
- Define $c$ as the cumulative distribution function:
  $$ c(i)=\sum\_{j=0}^{i} p\left(x\_{j}\right) $$
- Create a transformation of the form
  $$ y\_i = T(x\_i) = c(i), \qquad y\_i \in [0, 1] $$
  will produce a level $y$ for each level $x$ in the original image, such that the cumulative probability function of $y$ will be linearized across the value range.
  $$ y\_{i}^{\prime}=y\_{i} \cdot(\max -\min )+\min $$

Training Procedure

Randomly choose 1000 non-face images
Train network to produce 1 for faces, -1 for non-faces
Run network on images containing no faces. Collect subimages in which network incorrectly identifes a face (output > 0)
Select up to 250 of these „false positives“ at random and add them to the training set as negative examples

Neural Network Based Face Filter

Output of ANN defines a filter for faces
Search
- Scan input image with search window, apply ANN to search window
- Input image needs to be rescaled in order to detect faces with different size
Output needs to be post-processed
- Noise removal
- Merging overlapping detections
Speed up can be achieved
- Increase step size
- Make ANN more flexible to translation
- Hierarchical, pyramidal search

Localization and Ground-Truth

For localization, the test data is mostly annotated with ground-truth bounding boxes
Comparing hypotheses to Ground-Truth
- Overlap
  $$ O = \frac{\text{GT } \cap \text{ DET}}{\text{GT } \cup \text{ DET}} $$
  
  Also called Intersection over Union (IoU)
- Often used as threshold: Overlap>50%

Neural Network Based Face Detection, by Henry A. Rowley, Shumeet Baluja, and Takeo Kanade. IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 20, number 1, pages 23-38, January 1998. ↩︎

Face Recognition: Traditional Approaches

Thu, 04 Feb 2021 00:00:00 +0000

Face Recognition for Human-Computer Interaction (HCI)

Main Problem

The variations between the images of the same face due to illumination and viewing direction are almost always larger than image variations due to change in face identity.

– Moses, Adini, Ullman, ECCV‘94

Closed Set vs. Open Set Identification

Closed-Set Identification
- The system reports which person from the gallery is shown on the test image: Who is he?
- Performance metric: Correct identification rate
Open-Set Identification
- The system first decides whether the person on the test image is a known or unknown person. If he is a known person who he is?
- Performance metric
  - False accept: The invalid identity is accepted as one of the individuals in the database.
  - False reject: An individual is rejected even though he/she is present in the database.
  - False classify: An individual in the database is correctly accepted but misclassified as one of the other individuals in the training data

Authentication/Verification

A person claims to be a particular member. The system decides if the test image and the training image is the same person: Is he who he claims he is?

Performance metric:

False Reject Rate (FRR): Rate of rejecting a valid identity
False Accept Rate (FAR): Rate of incorrectly accepting an invalid identity.

Feature-based (Geometrical) approaches

“Face Recognition: Features versus Templates” ¹

Eyebrow thickness and vertical position at the eye center position
A coarse description of the left eyebrow‘s arches
Nose vertical position and width
Mouth vertical position, width, height upper and lower lips
Eleven radii describing the chin shape
Face width at nose position
Face width halfway between nose tip and eyes

Classification

Nearest neighbor classifier with Mahalanobis distance as the distance metric:

$$ \Delta_{j}(x)=\left(x-m_{j}\right)^{T} \Sigma^{-1}\left(x-m_{j}\right) $$

$x$: input face image
$m\_j$: average vector representing the $j$-th person
$\Sigma$: Covariance matrix

Different people are characterized only by their average feature vector.

The distribution is common and estimated by using all the examples in the training set.

Appearance-based approaches

Can be either

holistic (process the whole face as the input), or
local / fiducial (process facial features, such as eyes, mouth, etc. seperately)

Processing steps: align faces with facial landmarks

Use manually labeled or automatically detected eye centers
Normalize face images to a common coordination, remove translation,, rotation and scaling factors
Crop off unnecessary background

Holistic appearance-based approaches

Eigenfaces

💡 Idea

A face image defines a point in the high dimensional image space.

Different face images share a number of similarities with each other

They can be described by a relatively low dimensional subspace
Project the face images into an appropriately chosen subspace and perform classification by similarity computation (distance, angle)
- Dimensionality reduction procedure used here is called Karhunen-Loéve transformation or principal component analysis (PCA)

Objective

Find the vectors that best account for the distribution of face images within the entire image space

PCA

For more details see: Principle Component Analysis (PCA)

Find direction vectors so as to minimize the average projection error
Project on the linear subspace spanned by these vectors
Use covariance matrix to find these direction vectors
Project on the largest K direction vectors to reduce dimensionality

PCA for eigenfaces:

$$ \begin{array}{l} Y=\left[y\_{1}, y\_{2}, y\_{3}, \ldots, y\_{K}\right] \\\\ m=\frac{1}{K}\sum y \\\\ C=(Y-m)(Y-m)^{T} \\\\ D=U^{T} C U \\\\ \Omega=U^{\top}(y-m) \end{array} $$

where

$y$: Face image
$Y$: Face matrix
$m$: Mean face
$C$: Covariance matrix
$D$: Eigenvalues
$U$: Eigenvectors
$\Omega$: Representation coefficients

Training

Acquire initial set of face images (training set):
$$ Y = [y\_1, y\_2, \dots, y\_K] $$
Calculate the eigenfaces/eigenvectors from the training set, keeping only the $M$ images/vectors corresponding to the highest eigenvalues
$$ U = (u\_1, u\_2, \dots, u\_M) $$
Calculate representation of each known individual $k$ in face space
$$ \Omega\_k = U^T(y\_k - m) $$

Testing

Project input new image y into face space
$$ \Omega = U^T(y - m) $$
Find most likely candidate class $k$ by distance computation
$$ \epsilon\_k = \\|\Omega - \Omega\_k\\| \quad \text{for all } \Omega\_k $$

Projections onto the face space

Principal components are called “eigenfaces” and they span the “face space”.
Images can be reconstructed by their projections in face space:

$$ Y\_f = \sum\_{i=1}^{M} \omega\_i u\_i $$

Appearance of faces in face-space does not change a lot

Difference of mean-adjusted image $(Y-m)$ and projection $Y\_f$ gives a measure of „faceness“
- Distance from face space can be used to detect faces
Different cases of projections onto face space
- Case 1: Projection of a known individual
  
  $\rightarrow$ Near face space ($\epsilon < \theta\_{\delta}$) and near known face $\Omega\_k$ ($\epsilon\_k < \theta\_{\epsilon}$)
- Case 2: Projection of an unkown individual
  
  $\rightarrow$ Near face space, far from reference vectors
- Case 3 and 4: not a face (far from face space)

PCA for face matching and recognition

Projects all faces onto a universal eigenspace to “encode” via principal components
Uses inverse-distance as a similarity measure $S(p,g)$ for matching & recognition

Problems and shortcomings

Eigenfaces do NOT distinguish between shape and appearance
PCA does NOT use class information
- PCA projections are optimal for reconstruction from a low dimensional basis, they may not be optimal from a discrimination standpoint

Fisherface

Linear Discriminant Analysis (LDA)

For more details about LDA, see: LDA Summary)

A.k.a. Fischer‘s Linear Discriminant
Preserves separability of classes
Maximizes ratio of projected between-classes to projected within-class scatter

$$ W\_{\mathrm{fld}}=\arg \underset{W}{\max } \frac{\left|W^{T} S\_{B} W\right|}{\left|W^{T} S\_{W} W\right|} $$

Where

$S\_{B}=\sum\_{i=1}^{c}\left|x\_{i}\right|\left(\mu\_{i}-\mu\right)\left(\mu\_{i}-\mu\right)^{T}$: Between-class scatter
- $c$: Number of classes
- $\mu\_i$: mean of class $X\_i$
- $|X\_i|$: number of samples of $X\_i$
$S\_{W}=\sum\_{i=1}^{c} \sum\_{x\_{k} \in X\_{i}}\left(x\_{k}-\mu\_{i}\right)\left(x\_{k}-\mu\_{i}\right)^{T}$: Within-class scatter

LDA vs. PCA

LDA for Fisherfaces

Fisher’s Linear Discriminant

projects away the within-class variation (lighting, expressions) found in training set
preserves the separability of the classes.

Local appearance-based approach

Local vs Holistic approaches:

Local variations on the facial appearance (different expression,occlsion, lighting)
- lead to modifications on the entire representation in the holistic approaches
- while in local approaches ONLY the corresponding local region is effected
Face images contain different statistical illumination (high frequency at the edges and low frequency at smooth regions). It’s easier to represent the varying statistics linearly by using local representation.
Local approaches facilitate the weighting of each local region in terms of their effect on face recognition.

Modular Eigen Spaces

Classification using fiducial regions instead of using entire face ².

Local PCA (Modular PCA)

Face images are divided into $N$ smaller sub-images
PCA is applied on each of these sub-images
Performed better than global PCA on large variations of illumination and expression
No imporvements under variation of pose

Local Feature based

🎯 Objective: To mitigate the effect of expression, illumination, and occlusion variations by performing local analysis and by fusing the outputs of extracted local features at the feature or at the decision level.

Gabor Filters

Elastic Bunch Graphs (EBG)

Local Binary Pattern (LBP) Histogram

http://cbcl.mit.edu/people/poggio/journals/brunelli-poggio-IEEE-PAMI-1993.pdf ↩︎
Pentland, Moghaddam and Starner, “View-based and modular eigenspaces for face recognition,” 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 1994, pp. 84-91, doi: 10.1109/CVPR.1994.323814. ↩︎

Face Recognition: Features

Mon, 15 Feb 2021 00:00:00 +0000

Local Appearance-based Face Recognition

Some popular facial descriptions achieving good results

Local binary Pattern Histogram (LBPH)
Gabor Feature
Discrete Cosine Transform (DCT)
SIFT
etc.

Local binary Pattern Histogram (LBPH)¹

Divide image into cells
Compare each pixel to each of its neighbors
- Where the pixel’s value is greater than the threshold value (e.g., center pixel in this example), write “1”
- Otherwise, write “0”
$\rightarrow$ gives a binary number
Convert binary into decimal
Compute the histogram, over the cell
Use the histogram for classification
- SVM
- Histogram-distances

Tutorials and explanation:

Face Recognition: Understanding LBPH Algorithm

how is the LBP |Local Binary Pattern| values calculated? ~ xRay Pixy

High dim. dense local Feature Extraction

Computing features densely (e.g. on overlapping patches in many scales in the image)
Problem: very very high dimensionality!!!
Solution: Encode into a compact form
- Bag of Visual Word (BoVW) model
- Fisher encoding

Fisher Vector Encoding

Aggregates feature vectors into a compact representation
Fitting a parametric generative model (e.g. Gaussian Mixture Model)
Encode derivative of the likelihood of model w.r.t its parameters

Face recognition across pose (Alignment)

Problem

Different view-point / head orientation
Recoginition results degrade, when images of different head orientation have to be matched 😭

Major directions to address the face recognition across pose Probelm

Geometric pose normalization (image affine warps)
2D specific pose models, image rendering at pixel or feature level (2D+3D approaches)
3D face Model fitting

Pose Normalization

💡 Idea

Find several facial features (mesh)
Use complete mesh to normalize face

Here we will use 2D Active Appearance Models

A texture and shape-based parametric model
Efficient fitting algorithm: Inverse compositional (IC) algorithm

Model and fitting

Independent shape and appearance model

$$ \begin{array}{c} \text{shape:} \quad s=\left(x\_{1}, y\_{1}, x\_{2}, y\_{2}, \cdots, x\_{v}, y\_{v}\right)^{T}=s\_{0}+\sum\_{i=1}^{n} p\_{i} s\_{i} \\\\ \text{appearance:} \quad A(x)=A\_{0}(x)+\sum\_{i=1}^{m} \lambda\_{i} A\_{i}(x) \quad \forall x \in s\_{0} \end{array} $$

Fitting goal:

$$ \arg \min \_{p, \lambda} \sum\_{x \in s\_{0}}\left[A\_{0}(x)+\sum\_{i=1}^{m} \lambda_{i} A\_{i}(x)-I(W(x ; p))\right]^{2} $$

Fitting examples

Fitted mesh
Mismatched mesh

Fitted modal can be used to warp image to frontal pose (e.g. using piecewise affine transformation of mesh triangles)

Faces with different poses from FERET data base and their pose- aligned images

Results

Much better results under pose variations compared to simple affine transform
Different warping functions can be used
- Piecewise affine transformation worked best
Approach works well with local-DCT-based approach
- but not so well with holistic approaches, such as Eigenfaces (PCA) 🤪

Face Recogntion using 3D Models²

A method for face recognition across variations in pose and illumination.
Simulates the process of image formation in 3D space.
Estimates 3D shape and texture of faces from single images by fitting a statistical morphable model of 3D faces to images.
Faces are represented by model parameters for 3D shape and texture.

Model-based Recognition

Face vectors

The morphable face model is based on a vector space representation of faces that is constructed such that any combination of shape and texture vectors $S\_i$ and $T\_i$ describes a realistic human face:

$$ S=\sum_{i=1}^{m} a_{i} S_{i} \quad T=\sum_{i=1}^{m} b_{i} T_{i} $$

The definition of shape and texture vectors is based on a reference face $\mathbf{I}\_0$.

The location of the vertices of the mesh in Cartesian coordinates is $(x\_k, y\_k, z\_k)$ with colors $(R\_k, G\_k, B\_k)$

Reference shape and texture vectors are defined by:

$$ \begin{array}{l} S\_{0}=\left(x\_{1}, y\_{1}, z\_{1}, x\_{2}, \ldots, x\_{n}, y\_{n}, z\_{n}\right)^{T} \\\\ T\_{0}=\left(R\_{1}, G\_{1}, B\_{1}, R\_{2}, \ldots, R\_{n}, G\_{n}, B\_{n}\right)^{T} \end{array} $$

To encode a novel scan $\mathbf{I}$, the flow field from $\mathbf{I}\_0$ to $\mathbf{I}$ is computed.

PCA

PCA is performed on the set of shape and texture vectors separately.
Eigenvectors form an orthogonal basis:
$$ \mathbf{S}=\overline{\mathbf{s}}+\sum\_{i=1}^{m-1} \alpha\_{i} \cdot \mathbf{s}\_{i}, \quad \mathbf{T}=\overline{\mathbf{t}}+\sum\_{i=1}^{m-1} \beta\_{i} \cdot \mathbf{t}\_{i} $$
Example

Model-based Image Analysis

🎯 Goal: find shape and texture coefficients describing a 3D face model such that rendering produces an image $\mathbf{I}\_{\text{model}}$ that is as similar as possible to $\mathbf{I}\_{\text{input}}$
For initialization 7 facial feature points, such as the corners of the eyes or tip of the nose, should be labelled manually
Model fitting: Minimize
$$ E\_{I}=\sum\_{x, y}\left\\|\mathbf{I}\_{\text {input }}(x, y)-\mathbf{I}\_{\text {model }}(x, y)\right\\|^{2} $$
- Shape, texture, transformation, and illumination are optimized for the entire face and refined for each segment.
- Complex iterative optimization procedure

Databases

Necessary to develop and improve algorithms
Provide common testbeds and benchmarks which allow for comparing different approaches
Different databases focus on different problems

Well-known databases for face recognition

FERET
FRVT
FRGC
CMU-PIE
BANCA
XM2VTS
…

Observations

One 3-D image is more powerful for face recognition than one 2- D image.
One high resolution 2-D image is more powerful for face recognition than one 3-D image.
Using 4 or 5 well-chosen 2-D face images is more powerful for face recognition than one 3-D face image or multi-modal 3D+2D face.

Wild Face Datasets

Labeled Faces In the Wild Dataset (LFW)

Face Verification: Given a pair of images specify whether they belong to the same person
13K images, 5.7K people
Standard benchmark in the community
Several test protocols depending upon availability of training data within and outside the dataset.

YouTube Faces Dataset (YTF)

Video Face Verification: Given a pair of videos specify whether they belong to the same person
3425 videos, 1595 people
Standard benchmark in the community
Wide pose, expression and illumination variation

T. Ahonen, A. Hadid and M. Pietikainen, “Face Description with Local Binary Patterns: Application to Face Recognition,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 12, pp. 2037-2041, Dec. 2006, doi: 10.1109/TPAMI.2006.244. ↩︎
V. Blanz and T. Vetter, “Face recognition based on fitting a 3D morphable model,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 9, pp. 1063-1074, Sept. 2003, doi: 10.1109/TPAMI.2003.1227983. ↩︎

Face Recognition: Deep Learning

Mon, 15 Feb 2021 00:00:00 +0000

DeepFace ¹

Main idea

Learn a deep (7 layers) NN (20 million parameters) on 4 million identity labeled face images directly on RGB pixels.

Alignment

Use 6 fiducial points for 2D warp
Then 67 points for 3D model
Frontalize the face for input to NN

Representation

Output is fed in $k$-way softmax, that generates probability distribution over class labels.
🎯 Goal of training: maximize the probability of the correct class

FaceNet²

💡Idea

Map images to a compact Euclidean space, where distances correspond to face similarity
Find $f(x)\ \in \mathbb{R}^d$ for image $x$, so that
- $d^2(f(x\_1), f(x\_2)) \rightarrow \text{small}$, if $x\_1, x\_2 \in \text{same identity}$
- $d^2(f(x\_1), f(x\_3)) \rightarrow \text{large}$, if $x\_1, x\_2 \in \text{different identities}$

System architecture

CNN: optimized embedding
Triplet-based loss function: training

Triplet loss

Image triplets:

$$ \begin{array}{c} \left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{p}\right)\right\\|\_{2}^{2}+\alpha<\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{n}\right)\right\\|_{2}^{2} \\\\ \forall\left(f\left(x\_{i}^{a}\right), f\left(x\_{i}^{p}\right), f\left(x\_{i}^{n}\right)\right) \in \mathcal{T} \end{array} $$ where

$x\_i^a$: Anchor image
$x\_i^p$: Positive image
$x\_i^n$: Negative image
$\mathcal{T}$: Set of all possible triplets in the training set
$\alpha$: Margin between positive and negative pairs

Total Loss function to be minimized:

$$ L=\sum\_{i}^{N}\left[\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{p}\right)\right\\|\_{2}^{2}-\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{n}\right)\right\\|\_{2}^{2}+\alpha\right] $$

Triplet selection

Online Generation
Select only the semi-hard negatives and using all anchor-positive pairs of mini-batch

$\rightarrow$ Select $x\_i^n$ such that
$$ \left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{p}\right)\right\\|\_{2}^{2}<\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{n}\right)\right\\|\_{2}^{2} $$

Results

LFW: 99.63% $\pm$ 0.09
Youtube Faces DB: 95.12% $\pm$ 0.39

Deep Face Recognition ³

Key Questions

Can large scale datasets be built with minimal human intervention? Yes!
Can we propose a convolutional neural network which can compete with that of internet giants like Google and Facebook? Yes!

Dataset Collection

Candidate list generation: Finding names of celebrities
- Tap the knowledge on the web
- 5000 identities
Manual verification of celebrities: Finding Popular Celebrities
- Collect representative images for each celebrity
- 200 images/identity
- Remove people with low representation on Google.
- Remove overlap with public benchmarks
- 2622 celebrities for the final dataset
Rank image sets
- 2000 images per identity
- Searching by appending keyword “actor”
- Learning classifier using data obtained the previous step.
- Ranking 2000 images and selecting top 1000 images
- Approx. 2.6 Million images of 2622 celebrities
Near duplicate removal
- VLAD descriptor based near duplicate removal
Manual filtering
- Curating the dataset further using manual checks

Convolutional Neural Network

The “Very Deep” Architecture
- 3 x 3 Convolution Kernels (Very small)
- Conv. Stride 1 px.
- Relu non-linearity
- No local contrast normalisation
- 3 Fully connected layers
Training
- Random Gaussian Initialization
- Stochastic Gradient Descent with back prop.
- Batch Size: 256
- Incremental FC layer training
- Learning Task Specific Embedding
  - Learning embedding by minimizing triplet loss
    $$ \sum\_{(a, p, n) \in T} \max \left\\{0, \alpha-\left\\|\mathbf{x}\_{a}-\mathbf{x}\_{n}\right\\|\_{2}^{2}+\left\\|\mathbf{x}\_{a}-\mathbf{x}\_{p}\right\\|\_{2}^{2}\right\\} $$
  - Learning a projection from 4096 to 1024 dimensions
  - On line triplet formation at the beginning of each iteration
  - Fine tuned on target datasets
  - Only the projection layers learnt

Y. Taigman, M. Yang, M. Ranzato and L. Wolf, “DeepFace: Closing the Gap to Human-Level Performance in Face Verification,” 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 1701-1708, doi: 10.1109/CVPR.2014.220. ↩︎
Schroff, Florian & Kalenichenko, Dmitry & Philbin, James. (2015). FaceNet: A unified embedding for face recognition and clustering. 815-823. 10.1109/CVPR.2015.7298682. ↩︎
Omkar M. Parkhi, Andrea Vedaldi and Andrew Zisserman. Deep Face Recognition. In Xianghua Xie, Mark W. Jones, and Gary K. L. Tam, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 41.1-41.12. BMVA Press, September 2015. ↩︎

Facial Feature Detection

Thu, 18 Feb 2021 00:00:00 +0000

Introduction

What are facial features?

Facial features are referred to as salient parts of a face region which carry meaningful information.

E.g. eye, eyeblow, nose, mouth
A.k.a facial landmarks

What is facial feature detection?

Facial feature detection is defined as methods of locating the specific areas of a face.

Applications of facial feature detection

Face recognition
Model-based head pose estimation
Eye gaze tracking
Facial expression recognition
Age modeling

Problems in facial feature detection

Identity variations

Each person has unique facial part
Expression variations

Some facial features change their state (e.g. eye blinks).
Head rotations

If a head orientation changes, the visual appearance also changes.
Scale variations

Changes in resolution and distance to the camera affect appearance.
Lighting conditions

Light has non-linear effects on the pixel values of a image.
Occlusions

Hair or glasses might hide facial features.

Older approaches (from face detection)

Integral projections + geometric constraints
Haar-Filter Cascades
PCA-based methods (Modular Eigenspace)
Morphable 3D Model

Statistical appearance models

💡 Idea: make use of prior-knowledge, i.e. models, to reduce the complexity of the task
Needs to be able to deal with variability $\rightarrow$ deformable models
Use statistical models of shape and texture to find facial landmark points
Good models should
- Capture the various characteristics of the object to be detected
- Be a compact representation in order to avoid heavy calculation
- Be robust against noise

Basic idea

Training stage: construction of models
Test stage: Search the region of interest (ROI)

Appearance models

Represent both texture and shape
Statistical model learned from training data
Modeling shape variability
- Landmark points
$$ x=\left[x\_{1}, y\_{1}, x\_{2}, y\_{2}, \ldots, x\_{n}, y\_{n}\right]^{T} $$
- Model
  $$ x \approx \bar{x}+P\_{s} b\_{s} $$
  - $\bar{x}$: Mean vector
  - $P\_s$: Eigenvectors of covariance matrix
  - $b\_s = P\_s^T(x - \bar{x})$
Modeling intensity variability:
- Gray values
  $$ h=\left[g\_{1}, g\_{2}, \ldots, g\_{k}\right]^{T} $$
- Model
  $$ h \approx \bar{h} + P\_ib\_i $$
  - $\bar{h}$: Mean vector
  - $P\_s$: Eigenvectors of covariance matrix
  - $b\_i = P\_i^T(h - \bar{h})$

Training of appearance models

1. Construct a shape model with Principal component analysis (PCA)

A shape is represented with manually labeled points.

The shape model approximates the shape of an object.

Procrustes Analysis

Align the shapes all together to remove translation, rotation and scaling

PCA

The positions of labeled points are

$$ x = \bar{x}+P\_{s} b\_{s} $$

$\bar{x}$: Mean shape
$P\_s$: Orthogonal modes of variation obtained by PCA
$b\_s$: Shape parameters in the projected space

The shapes are represented with fewer parameters ($\operatorname{Dim}(x) > \operatorname{Dim}(b\_s)$)

Generating plausible shapes:

2. Construct a texture model which represents grey-scale (or color) values at each point

Warp the image so that the labeled points fit on the mean shape

Then normalize the intensity on the shape-free patch.

Texture warping

Texture model

The pixel values on the shape-free patch

$$ g = \bar{g} + P\_g b\_g $$

$\bar{g}$ : Mean of normalized pixel values
$P\_g$ : Orthogonal modes of variation obtained by PCA
$b\_g$: Texture parameters in the projected space

The pixel values (appearance) are presented with fewer parameters ($\operatorname{Dim}(g) > \operatorname{Dim}(b\_g)$)

3. Model the correlation between shapes and grey-level models

The concatenated vector is

$$ b=\left(\begin{array}{c} W\_{s} b\_{s} \\\\ b\_{g} \end{array}\right) $$

Apply PCA:

$$ b=P\_{c} c=\left(\begin{array}{l} P\_{c s} \\\\ P\_{c g} \end{array}\right)c $$

Now the parameter $\mathbf{c}$ can control both shape and grey-level models

The shape model
$$ x=\bar{x}+P\_{s} W\_{s}^{-1} P\_{c s} c $$
The grey-level model
$$ g=\bar{g}+P\_{g} P\_{c g} c $$

Examples of synthesized faces

Various objects can be synthesized by controlling the parameter $\mathbf{c}$

Dataset for Building Model

IMM data set from Danish Technical University

240 images with 640*480 size; 40 individuals, with 36 males and 4 females.
Each Subject 6 shots, with different pose, expressions and illuminations.
Each image is labeled with 58 landmarks; 3 closed and 4 opened point-paths.

Image Interpretation with Models

🎯 Goal: find the set of parameters which best match the model to the image
- Optimize some cost function
- Difficult optimization problem
Set of parameters
- Defines shape, position, appearance
- Can be used for further processing
  - Position of landmarks
  - Face recognition
  - Facial expression recognition
  - Pose estimation
Problem: Optimizing the model fit
- Active Shape Models
- Active Appearance Models

Active Shape Models (ASM)

Given a rough starting position, create an instance of $\mathbf{X}$ of the model using

shape parameters $b$
translation $T=(X\_t,Y\_t)$
scale $s$
rotation $\theta$

Iterative approach:

Examine region of the image around $\mathbf{X}\_i$ to find the best nearby match for the point $\mathbf{X}\_i^\prime$
Update parameters $(b, T, s, \theta)$ to best fit the new points $\mathbf{X}$ (constrain the model parameters to be within three standard deviations)
Repeat until convergence

In practice: search along profile normals

The optimal parameters are searched from multi-resolution images hierarchically (faster algorithm)
1. Search for the object in a coarse image
2. Refine the location in a series of higher resolution images.

Example of search

Disadvantages

Uses mainly shape constraints for search
Does not take advantage of texture across the target

Active Appearance Models (AAM)

Optimize parameters, so as to minimize the difference of a synthesized image and the target image
Solved using a gradient-descent approach

Fitting AAMs

Learning linear relation matrix $\mathbf{R}$ using multi-variate linear regression

Generate training set by perturbing model parameters for training images
Include small displacements in position, scale, and orientation
Record perturbation and image difference
Experimentally, optimal perturbation around 0.5 standard deviations for each parameter

ASM vs. AAM

ASM

Seeks to match a set of model points to an image, constrained by a statistical model of shape
Matches model points using an iterative technique (variant of EM-algorithm)
A search is made around the current position of each point to find a nearby point which best matches texture for the landmark
Parameters of the shape model are then updated to move the model points closer to the new points in the image

AAM: matches both position of model points and representation of texture of the object to an image

Uses the difference between current synthesized image and target image to update parameters
Typically, less landmark points are needed

Summary of ASM and AAM

Statistical appearance models provide a compact representation
Can model variations such as different identities, facial expression, appearances, etc.
Labeled training images are needed (very time-consuming) 🤪
Original formulation of ASM and AAM is computationally expensive (i.e. slow) 🤪
But, efficient extensions and speed-ups exist!
- Multi-resolution search
- Constrained AAM search
- Inverse compositional AAMs (CMU)
Usage
- Facial fiducial point detection
- Face recognition, pose estimation
- Facial expression analysis
- Audio-visual speech recognition

More Modern Approaches: Conditional Random Forests For Real Time Facial Feature Detection¹

Basics

Regression tree

Basically like classification decision tree
In the nodes-decisions are comparison of numbers
In the leafs-numbers or multidimensional vectors of numbers
Example

Random regression forests

Set of random regression trees
Random
- Different trees trained on random subset of training data
- After training, predictions for unseen samples can be made by averaging the predictions from all the individual regression trees

Basic idea

Train different set of trees for different head pose.
The leaf nodes accumulates votes for the different facial fiducial points

Regression forests training

Each Tree is trained from randomly selected set of images.
Extract patches in each image
Training goal: accumulate probability for a feature point $C\_n$ given a patch $P$ at the leaf node
- Each patch is represented by appearance features $I$, and displacement vectors $D$ (offsets) to each of the facial fiducial feature point. I.e. $P = \\{I, D\\}$
- A simple patch comparison is used as Tree-node splitting criterion

Regression forests testing

Given: a random face image
Extract densely set of patches from the image
Feed all patches to all trees in the forest
Get for each patch $P\_i$ a corresponding set of leafs
A density estimator for the location of ffp’s is calculated
Run meanshift to find all locations

Conditional Regression Forest

Conditional regression tree works alike.
For training:
- Compute a probability for a concrete head pose
- For each head pose divide the training set in disjoint subsets according to the pose
- Train a regression forest for each subset
For testing:
- Estimate the probabilities for each head pose
- Select trees from different regression forests
- Estimate the density function for all facial feature points.
- Finalize the exact poition by clustering over all feature candidate votes for a given facial feature point. (e.g., by meanshift)

Experiments and results

Training set:
- 13233 face images from LFW Database
- 10 annotated facial feature points per face image
Training
- Maximum tree depth = 20
- 2500 splitting candidates and 25 thresholds per split
- 1500 images to train each tree
- 200 patches per image (20 * 20 pixels).
- For head pose two different subsets with 3 and 5 head poses are generated (accuracy 72,5%)
- Required time for face detection and head pose estimation is 33 ms.
Results

CNN based models

Stacked Hourglass Network ²

Fully-convolutional neural network
Repeated down- and upsampling + shortcut connections
Based on RGB face image, produce one heatmap for each landmark
Heatmaps are transformed into numerical coordinates using DSNT

M. Dantone, J. Gall, G. Fanelli and L. Van Gool, “Real-time facial feature detection using conditional regression forests,” 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 2012, pp. 2578-2585, doi: 10.1109/CVPR.2012.6247976. ↩︎
Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9912 LNCS, 483–499. https://doi.org/10.1007/978-3-319-46484-8_29 ↩︎

Facial Expression Recognition

Thu, 18 Feb 2021 00:00:00 +0000

What is facial expression analysis?

What is Facial Expression?

Facial expressions are the facial changes in response to a person‘s internal emotional states, interntions, or social communications.

Role of facial expressions

Almost the most powerful, natural, and immediate way (for human beings) to communicate emotions and intentions
Face can express emotion sooner than people verbalize or realize feelings
Faces and facial expressions are an important aspect in interpersonal communication and man-machine interfaces

Facial Expressions

Facial expression(s):
- nonverbal communication
- voluntary / involuntary
- results from one or more motions or positions of the muscles of the face
- closely associated with our emotions
The fact: Most people’s success rate at reading emotions from facial expression is only a little over 50 percent.

Facial expression analysis vs. Emotion analysis

Emotion analysis requires higher level knowledge, such as context information.
Besides emotions, facial expressions can also express intention, cognitive processes, physical effort, etc.

Emotions conveyed by Facial Expressions

Six basic emotions (assumed to be innate)

Basic structure of facial expression analysis systems

Levels of description

Emotions

Discrete classes

Six basic emotions
Positive, neutral, negative

Continuous valued dimensions

Emotions as a continuum along 2/3 dimension
Circumplex model by Russel
- Valence: unpleasant - pleasant
- Arousal: low – high activation

Facial Action Units (AUs)

Facial Action Coding System (FACS)

A human-observer based system designed to detect subtle changes in facial features
Viewing videotaped facial behavior in slow motion, trained observer can manually FACS code all possible facial displays
These facial displays are referred to as action units (AU) and may occur individually or in combinations.

Action Units (AUs)

There are 44 AUs
30 AUs related to contractions of special facial muscles
- 12 AUs for upper face
- 18 AUs for lower face
Anatomic basis of the remaining 14 is unspecified $\rightarrow$ referred to in Facial Action Coding System (FACS) as miscellaneous actions
For action units that vary in intensity, a 5-point ordinal scale is used to measure the degree of muscle contraction

Combination of AUs

More than 7000 different AU combinations have been observed.

Additive: appearance of single AUs does NOT change. E.g.
Nonadditive: appearance of single AUs does change. E.g.

Individual Differences in Subjects

Variations in appearance
- Face shape,
- Texture
- Color
- Facial and scalp hair
due to sex, ethnic background, and age differences
Variations in expressiveness

Transitions Among Expressions

Simplifying assumption: expressions are singular and begin and end with a neutral position
Transitions from action units or combination of actions to another may involve NO intervening neutral state.
Parsing the stream of behavior is an essential requirement of a robust facial analysis system, and training data are needed that include dynamic combinations of action units, which may be either additive or nonadditive.

Intensity of Facial Expression

Facial actions can vary in intensity
FACS coding uses 5-point intensity scale to describe intensity variation of action units
Some related action units function as sets to represent intensity variation.
- E.g. in the eye region, action units 41, 42, and 43 or 45 can represent intensity variation from slightly drooped to closed eyes.

Relation to other Facial Behavior or Nonfacial Behavior

Facial expression is one of several channels of nonverbal communication.
The message values of various modes may differ depending on context.
For robustness, should be integrated with
- Gesture
- Prosody
- Speech

Different datasets and systems

Using geometric features + ANN (2001 / early work)

Recognizing Action Units for Facial Expression Analysis¹

An Automatic Facial Analysis (AFA) system to analyze facial expressions based on both permanent facial features (brows, eyes, mouth) and transient facial features (depending of facial furrows) in a nearly frontal-view image sequences.
A group of action units (neutral expression, six upper face AUs and 10 lower face AUs) are recognized whether they occur alone or in combinations.

Cohn-Kanade AU-Coded Facial Expression Database

100 subjects from varying ethnic backgrounds.
23 different facial expressions (single action units and combinations of action units)
Frontal faces, small head motion
Variations in lighting
- ambient lighting
- single-high-intensity lamp
- dual high-intensity lamps with reflective umbrellas
Coded with FACS and assigned emotion-specified labels (happy, surprise, anger, disgust, fear, sadness)
Example

Feature-based Automatic Facial Action Analysis (AFA) System

Feature detection & feature location
- Region of the face and location of individual face features detected automatically in the initial frame using neural network based approach
- Contours of face features and components adjusted manually in the initial frame
- Face features are then tracked automatically
  - permanent features (e.g., brows, eyes, lips)
  - transient features (lines and furrows)
Feature extraction: Group facial features into separate collections of feature parameters
- 15 normalized upper face parameters
- 9 normalized lower face parameters
Parameters fed to two neural-network-based classifiers

Facial Feature Extraction

Multistate Facial Component Models of a Frontal Face

Permanent components/features
- Lip
- Eye
- Brow
- Cheek
Transient component/features
- Furrows and wrinkles appear perpendicular to the direction of the motion of the activated muscles
- Classification
  - present (appear, deepen or lengthen)
  - absent
- Detection
  - Canny edge detector
  - Nasal root / crow’s-feet wrinkles
  - Nasolabial furrows

Facial Feature Representation

Face coordinate system
- $x = $ line between inner corners of eyes
- $y = $ perpendicular to x
Group facial features
- upper face features: 15 parameters
  - Example
- lower face features: 9 parameters
  - Example

AU Recognition by Neural Networks

Three layer neural networks (one hidden layer)
Standard back-propagation method
- Separate networks for upper- / lower face

Using appearance-based features + SVM (2006)

Automatic Recognition of Facial Actions in Spontaneous Expression²

RU-FACS data set

Containts spontaneous expressions
100 subjects

Using Deep features (CNN) + fusion (2013)

Emotion Recognition in the Wild Challenge (EmotiW)

🎯 Goal: Move to more realistic out of the lab data
AFEW Dataset (Acted Facial Expressions in the Wild)
- Extracted from movies
- Annotated with six basic emotions
- Movie clips from 330 subjects, age range: 1-70
- Semi-automatic annotation pipeline
  - Recommender sytem + manual annotation

2013 Winner

Combining Modality Specific Deep Neural Networks for Emotion Recognition in Video³

Convolutional Network

Inputs are images of size 40x40, cropped randomly
Four layers, 3 convolutions followed by max or average pooling and a fully-connected layer

Representing video sequence

CNN gives 7-dim output per frame
Multiple frames are averaged into 10 vectors describing the sequence
- For shorter sequences, frames / vectors get expanded (duplicated)
Results in 70-dim feature vector (10*7)
Classification with SVM

Other Features

„Bag of Mouth“
Audio-features

Typical Pipline

Face detection and alignment
Extract various features and different representations
Build multiple classifiers
Fusion of results

Other Applications

Pain Analysis
Analysis of psychological disorders
Workload / stress analysis
Adaptive user interfaces
Advertisment

Y. . -I. Tian, T. Kanade and J. F. Cohn, “Recognizing action units for facial expression analysis,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 97-115, Feb. 2001, doi: 10.1109/34.908962. ↩︎
Littlewort, Gwen & Frank, Mark & Lainscsek, Claudia & Fasel, Ian & Movellan, Javier. (2006). Automatic Recognition of Facial Actions in Spontaneous Expressions. Journal of Multimedia. 1. 10.4304/jmm.1.6.22-35. ↩︎
Kahou, Samira Ebrahimi & Pal, Christopher & Bouthillier, Xavier & Froumenty, Pierre & Gulcehre, Caglar & Memisevic, Roland & Vincent, Pascal & Courville, Aaron & Bengio, Y. & Ferrari, Raul & Mirza, Mehdi & Jean, Sébastien & Carrier, Pierre-Luc & Dauphin, Yann & Boulanger-Lewandowski, Nicolas & Aggarwal, Abhishek & Zumer, Jeremie & Lamblin, Pascal & Raymond, Jean-Philippe & Wu, Zhenzhou. (2013). Combining modality specific deep neural networks for emotion recognition in video. ICMI 2013 - Proceedings of the 2013 ACM International Conference on Multimodal Interaction. 543-550. 10.1145/2522848.2531745. ↩︎

People Detection: Global Approaches

Thu, 18 Feb 2021 00:00:00 +0000

Motivation

Why people detection?

Person Re-Identification
Person Tracking
Security (e.g. Border Control)
Automotive (e.g. Collision Prevention)
Interaction (e.g. Xbox Kinect)
Medical (e.g. Patient Monitoring)
Commercial (e.g. Customer Counting)

Why is people detection difficult?

Clothing

Large variety of clothing styles causes greater appearance variety
Accessories Occlusions by accessories. E.g. backpack, umbrella, handbag, …
Articulation Faces are mostly rigid. Persons can take on many different poses
Clutter People frequently overlap each other in images (crowds)

Typical components of global approaches

Detection via classification (binary classifier)

Sliding window: Scan window at different positions and scales

Gradient based

Popular and successful in the vision community
Avoid hard decisions (compared to edge based features)
Examples
- Histogram of Oriented Gradients (HOG)
- Scale-Invariant Feature Transform (SIFT)
- Gradient Location and Orientation Histogram (GLOH)
Computing gradients
- Centered
  $$ f^{\prime}(x)=\lim \_{h \rightarrow 0} \frac{f(x+h)-f(x-h)}{2 h} $$
- Gradient magnitude
  $$ s = \sqrt{s\_x^2 + s\_y^2} $$
- Gradient orientation
  $$ \theta=\arctan \left(\frac{s\_{y}}{s\_{x}}\right) $$
Gradient in image
- Image: discrete, 2-dimensional signal
- Use filter mask to compute gradient
  - $x$-direction:
  - $y$-direction

Edge based

Wavelet based

HOG people detector ¹

More see: Histogram of Oriented Gradients (HOG)

Gradient-based feature descriptor developed for people detection
Global descriptor for the complete body
High-dimensional (typically ~4000 dimensions)
Very promising results on challenging data sets

Phases

Learning Phase

Detection Phase

How HOG descriptor works?

Compute gradients on an image region of 64x128 pixels
Compute gradient orientation histograms on cells of 8x8 pixels (in total 8x16 cells). typical histogram size: 9 bins
Normalize histograms within overlapping blocks of 2x2 cells (in total 7x15 blocks) block descriptor size: 4x9 = 36
Concatenate block descriptors $\rightarrow$ 7 x 15 x 4 x 9 = 3780 dimensional feature vector

1. Gradients

Convolution with [-1 0 1] filters (x and y direction)
Compute gradient magnitude and direction
- Per pixel: color channel with greatest magnitude is used for final gradient (color is used!)

2. Cell histograms

9 bins for gradient orientations (0-180 degrees)
Filled with magnitudes
Interpolated trilinearly
- bilinearly into spatial cells
- linearly into orientation bins

3. Blocks

Overlapping blocks of 2x2 cells
Cell histograms are concatenated and then normalized
Normalization
- different norms possible (L2, L2hys etc.)
- add a normalization epsilon to avoid division by zero

4. The final HOG descriptor

Concatenation of block descriptors

Visualization

From feature to detector

Simple linear SVM on top of the HOG Features
- Fast (one inner product per evaluation window)
  
  for an entire image it’s a vector-matrix multiplication
Gaussian kernel SVM
- slightly better classification accuracy
- but considerable increase in computation time

Silhouette Matching ²

Idea

🎯 Goal: align known object shapes with image
Requirements for an alignment algorithm
- high detection rate
- few false positives
- robustness
- computationally inexpensive

Computational complexity

Complexity is O(#positions * #templates * #contourpixels * sizeof(searchregion))

Distance transform

Used to compare/align two (typically binary) shapes

Compute the distance from each pixel to the nearest edge pixel
- here the euclidean distances are approximated by the 2-3 distance
Overlay second shape over distance transform
Accumulate distances along shape 2
Find best matching position by an exhaustive search

However:

2-3 distance is not symmetric
2-3 distance has to be normalized w.r.t. the length of the shapes

Chamfer matching

Efficient Implementation

The distance transform can be efficiently computed by two scans over the complete image

Forward-Scan
- starts in the upper-left corner and moves from left to right, top to bottom
- uses the following mask
Backward-Scan
- starts in the lower-right corner and moves from right to left, bottom to top
- uses the following mask

Advantages

Fast
Good performance on uncluttered images (with few background structures)

Disadvantages

Bad performance for cluttered images
Needs a huge number of people silhouettes

Template Hierarchy

Reduce the number of silhouettes to consider
The shapes are clustered by similarity

Coarse-To-Fine Search

Goal: Reduce search effort by discarding unlikely regions with minimal computational effort
Idea:
- subsample the image and search first at a coarse scale
- only consider regions with a low distance when searching for a match on finer scales
Need to find reasonable thresholds

N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), San Diego, CA, USA, 2005, pp. 886-893 vol. 1, doi: 10.1109/CVPR.2005.177. ↩︎
D. M. Gavrila and V. Philomin, “Real-time object detection for “smart” vehicles,” Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 1999, pp. 87-93 vol.1, doi: 10.1109/ICCV.1999.791202. ↩︎

People Detection: Part-based Approaches

Wed, 07 Jul 2021 00:00:00 +0000

Motivation

Model body-parts separately
Break down an objects’ overall variability into more manageable pieces
Pieces can be classified by less complex classifiers
Apply prior knowledge by (manually) splitting the global object into meaningful parts
Advantages
- deal better with moving body parts (poses)
- able to handle occlusions, overlaps
- sharing of training data
Disadvantages
- require more complex reasoning
- problems with low resolutions

Part-based models

Two main components
- parts (2D image fragments)
- structure (configuration of parts) $\rightarrow$ often also part-combination method
  - Fixed spatial layout
    - Local parts are modeled to have a mostly fixed position and orientation with respect to the object or detection window center
  - Flexible Spatial Layout
    - local parts are allowed to shift in location and scale
    - can better handle deformations or articulation changes
    - well suited for non-rigid objects
    - spatial relations are often modeled probabilistically

The Mohan People Detector ¹

4 parts
- face and shoulder
- legs
- right arm
- left arm
Fixed layout
- Body parts are not always at the exact same position
- Allow local shifts: in position and in scale
- Best location has to be found for each detection window
Combination: Classifier (SVM)
Detection
- sliding window approach
- 64x128 pixels

The Implicit Shape Model (ISM) ²

💡 Main ideas

Automatically learn a large number of local parts that occur on the object (referred to as visual vocabulary, bag of words or codebook)
Learn a star-topology structural model
- features are considered independent given the objects’ center
- likely relative positions are learned from data

5 steps

Part detection/localization
Part description
Learning part appearance
Learning theh spatial layout of parts
Combination of part detections

Part Detection/Localization

A good part decomposition needs to be

Repeatable

We should be able to find the part despite articulation or image transformations (e.g. invariance to rotation, perspective, lighting)
Distinctive
- A part should not be easily confused with other parts the regions should contain an “interesting” structure
Compact

No lengthy or strangely shaped parts
Efficient

Computationally inexpensive to detect or represent
Cover

Parts need to sufficiently cover the object

Local features

Two components of local features:

key- or interest-points (“Where is it?”)

specify repeatable points on the object

consist of x-, y-position and scale

local (keypoint) descriptors (“How does it look like?”)

describe the area around an interest point

i.e. define the feature representation of an interest point

General approach

Find keypoints using keypoint detector

Define region around keypoint

Normalize region

Compute local descriptor

Compare descriptors

Keypoint detectors

Find reproducible, scale invariant local keypoints in an image

Keypoint Localization

Goals
- repeatable detection
- precise localization
- interesting content
Idea: Look for two-dimensional signal changes

Hessian Detector

Search for strong second derivatives in two orthogonal directions (Hessian determinant)

$$ \operatorname{Hessian}(I)=\left[\begin{array}{ll} I\_{x x} & I\_{x y} \\\\ I\_{x y} & I\_{y y} \end{array}\right] $$ $$ \operatorname{det}(\operatorname{Hessian}(I))=I\_{x x} I\_{y y}-I\_{x y}^{2} $$

Second Partial Derivative Test: If $det(H)>0$, we have a local minimum or maximum.

Example:

Responses:

Handle scale

Scale Space

Not only detect a distinctive position, but also a characteristic scale around an interest point
Scale Invariance
- Same operator responses, if the patch contains the same image up to a scale factor
- Automatic Scale Selection: Function responses for increasing scale (scale signature)
  - Laplacian-of-Gaussian (LoG)

Part Description

Distinctly describe local keypoints and achieve orientation invariance

Local Descriptors

Goal: Describe (local) region around a keypoint
Most available descriptors focus on edge/gradient information
- Capture boundary and texture information
- Color still used relatively seldom

Orientation Invariance

Compute orientation histogram
Select dominant orientation
Normalize: rotate to fixed orientation

The SIFT descriptor: Histogram of gradient orientations
- captures important texture information
- robust to small translations / affine deformations
- How it works? (similar to HOG)
  - region rescaled to a grid of 16x16 pixels (8x8 in image)
  - 4x4 regions (2x2 in image) = 16 histograms (concatenated)
  - histograms: 8 orientation bins, gradients weighted by gradient magnitude
  - final descriptor has 128 dimensions and is normalized to compensate for illumination differences

A brief introduction: SIFT - 5 Minutes with Cyrill

A nice explanation: (source: https://gilscvblog.com/2013/08/18/a-short-introduction-to-descriptors/)

SIFT was presented in 1999 by David Lowe and includes both a keypoint detector and descriptor. SIFT is computed as follows:

First, detect keypoints using the SIFT detector, which also detects scale and orientation of the keypoint.

Next, for a given keypoint, warp the region around it to canonical orientation and scale and resize the region to 16X16 pixels.

Compute the gradients for each pixels (orientation and magnitude).

Divide the pixels into 16, 4X4 pixels squares.

For each square, compute gradient direction histogram over 8 directions

concatenate the histograms to obtain a 128 (16*8) dimensional feature vector:

SIFT descriptor illustration:

SIFT is invariant to illumination changes, as gradients are invariant to light intensity shift. It’s also somewhat invariant to rotation, as histograms do not contain any geometric information.

Shape Context Descriptor

What Local Features Should I Use?

Best choice often application dependent
- Harris-/Hessian-Laplace/DoG work well for many natural categories
More features are better
- combining several detectors often helps

Learning Part Appearances

Visual Vocabulary

Detect keypoints on all person training examples
Compute local descriptors for all keypoints

-> Result: Large set of local image descriptors that all occur on people

Group visually similar local descriptors

similar local descriptors = parts that are reoccurring
parts, that occur only rarely are discarded (they could result from noise or background structures)
result: descriptor groups representing human body parts
Grouping Algorithms / Clustering
- Partitional Clustering
  - K-Means
  - Gaussian Mixture Clustering (EM)
- Hierarchical of Agglomerative Clustering
  - Single-Link (minimum)
  - Group-Average
  - Ward’s method (minimum variance)

Learning the Spatial Layout of Parts

Spatial Occurrence (Star-Model)

Record spatial occurrence
- match vocabulary entries to training images
- record occurrence distributions with respect to object center (location $(x, y)$ and scale)

Generalized Hough Transform

For every feature, store possible “occurrences”
For new image, let the matched features vote for possible object positions

Combination of Part Detections

ISM Detection Procedure:

A. Mohan, C. Papageorgiou and T. Poggio, “Example-based object detection in images by components,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 4, pp. 349-361, April 2001, doi: 10.1109/34.917571. ↩︎
Leibe, B. & Leonardis, Ales & Schiele, B.. (2004). Combined object categorization and segmentation with an implicit shape model. Proc. 8th Eur. Conf. Comput. Vis. (ECCV). 2. ↩︎

People Detection: Deep Learning Approaches

Fri, 16 Jul 2021 00:00:00 +0000

Deep Learning for Object Detection

People detections is a special case of object detection (one of the most challenging object classes to detect)
Recently, most detectors are trained for the more challenging task of multi-object detection
- Goal: Given an image, detect all instances of, say, 1000 different object classes
- “Person” always one of the classes
Speed is an issue
- Sliding Window: Look at each position, each scale
- Cascades look at each position too
  - They just take a shorter look at most positions/scales
- Region Proposals: Avoid useless positions/scales from the beginning

Region Proposals

💡Idea
- Identify image regions that are likely to contain an object
- Don’t care about the object class in the regions at this point
Characterization of a general object
- Find “blobby” regions
- Find connected regions that are somehow distinct from their surroundings
Requirements
- FAST!!!
- High recall
- Can allow a relatively high amount of false positives
2 main categories
- Grouping methods
  - Generate proposals based on hierarchically grouping meaningful image regions
  - Often better localization
  - E.g. Selective search
- Window scoring methods
  - Generate a large amount of windows
  - Use a quickly computed cue to discard unlikely windows (“objectness” measure)
  - Often faster

For more details and comparison, see: Overview of Region-based Object Detectors

R-CNN ¹

Idea and structure

Training

Train AlexNet on ImageNet (1000 classes)
Re-initialize last layers to a different dimension (depending on the #classes of the new classifier) and train new model
Train a classifier
- Binary SVMs (e.g. is human? yes/no) for each object class $\rightarrow$ $C$ SVMs in our case
- The outputs of pool5 of the retrained AlexNet are used as features
Improve the region proposals
- Use a regression model to improve the estimated locatin of the object
  - Input: features of proposed region (pool5)
  - output: x, y, width, height of the estimated region

Downsides

Speed: Need to forward-pass EACH region proposal through entire CNN!!!
SVM & BBox regressor are trained after CNN is fixed
- No simultaneous update/adaptation of CNN features possible
Complexity: multi-stage approach

Improvement:

For 1: Can we make (part of) the CNN run only once for all proposals?
For 2&3: Can we make the CNN perform these steps?

Fast R-CNN ²

Overview

ROI pooling

Conv layers don’t care about input size, FC layers do
ROI pooling: warp the variable size ROIs into in a predefined fix size shape.

End-to-end training

Instead of SVM & Regressor just add corresponding losses and train the system for both (multitask)
Gradients can backprop. into feature layers through ROI pooling layers (just as with normal maxpool layers)
End-to-end brings slight improvement 👏
Softmax (integrated) loss slightly but consistently outperforms external classifier 👏

Fast R-CNN vs R-CNN

Speed:

Downsides

Majority of time is lost for region proposals
Model is also not fully end-to-end: proposals come from “outside”

(Can we include them in the CNN as well? 🤔)

Faster R-CNN³

Overview

Region Proposal Network (RPN)

Input: Feature map from larger conv network of size $C \times W \times H$
Output
- List of $p$ proposals
- “Objectness” score of size $p \times 6$
  - $p \times 4$ coordinates (top-left and bottom-right $(x,y) $ coordinates) for bounding box
  - $p \times 2$ for objectness (with vs. without object) per location
General approach:
- Take a mini net (RPN) and slide it over the feature map (stepsize 1)
- At each position evaluate $k$ different window sizes for objectness
- Results in approx. $W \times H \times k$ windows/proposals
Fully convolutional network
Anchors: tackle the scale problem of the feature map
- Initial reference boxes consisting of aspect ratio and scale, centered at sliding window
- 3 scales and 3 aspect ratios = 9 anchors
Layers
- reg layer: regression of the reference anchor
- cls layer: object/no object score

Loss

Need a label for each anchor to train the objectness classification
Labelling anchors
- Positive: highest IoU with groundtruth or IoU > 0.7 (can be more than one)
  - Also store the association between anchor and groundtruth box
- Negative: others, if their IoU < 0.3
- Other anchors do not contribute to training
$\rightarrow$ Convert to classification problem
RPN multitask loss:
- $N\_{cls}$: Batch size (256)
- $N\_{reg}$: number of window positions ($\approx$ 2400)
- $\lambda = 10$

Training

As in paper

Jointly

Train everything in one go
Combination of four losses
- objectness classification
- anchor regression
- object class classification
- detection regression

Why two regression losses?

Anchor regression directly impacts the feature used for detection. Detection regression merely improves final localization

Comparison between all the R-CNNs

SSD Detector ⁴

Motivation

Thus far, deep multiclass detectors rely on variants of three steps:

generate bounding boxes (proposals)
resample pixels/features in boxes to uniform size
apply high quality classifier

Can we avoid / speed up any of those steps to increase overall speed?

Overview

💡Core Idea: Use a set of fixed default boxes at each position in a feature map (similar to anchors)
Classify object class and box regression for each default box
pply boxes at different layers in the ConvNet
- Use layers of different sizes
- Avoids the need for rescaling

Structure

Detectors at various stages with varying numbers of default boxes
Resulting number of detections is fixed
Reduced by non maximum suppression

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 580–587. https://doi.org/10.1109/CVPR.2014.81 ↩ ↩︎
Girshick, R. (2015). Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision, 2015 International Conference on Computer Vision, ICCV 2015, 1440–1448. https://doi.org/10.1109/ICCV.2015.169 ↩ ↩︎
Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031 ↩ ↩︎
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu: “SSD: Single Shot MultiBox Detector”, 2016; arXiv:1512.02325. ↩︎

Tracking

Mon, 19 Jul 2021 00:00:00 +0000

Introduction

Tracking Vs. Detection

Detection: Find an object in a single image
- Face, person, body part, facial landmarks, …
- No assumption about dynamics, temporal consistency made
Tracking:
- determine a target’s locations (and/or rotation, deformation, pose, …) over a sequence of images
  
  i.e.: determine the target’s state (location and/or rotation, deformation, pose, …) over a sequence of observations derived from images
- Provides object positions (etc.) in each frame

Motivation

Use more than one image to analyse the scene
Use a-priori knowledge to improve analysis
- system dynamics, imaging / measurment process,

Target types

Single objects: face, person, …
Multiple objects: group of people, head and hands, …
Articulated body: full body, hand

Sensor setup

Single camera
Multiple cameras
Active cameras
Cameras + microphones

observations used for tracking

Templates
Color
Foreground-Background segmentation Edges
Dense Disparity
Optical flow
Detectors (body, body parts)

Tracking as State Estimation

Want to predict state of the system (position, pose, …)
- But state cannot directly be measured
Only certain observations (measurements) can be made
- But Observations are noisy! (due to measurement errors)

What is the most likely state $x$ of the system at a given time, given a sequence of observations $Z\_t$ ?

$$ \arg \max p\left(x\_{t} \mid Z\_{t}\right) $$

$x\_t$: state of the system at time $t$
$z\_t$: Observation / measurement about the certain aspects of the system at

time $t$
Observations up to time $t$: $z\_{1:t}$ or $Z\_t$

Bayes Filter

Assume state $x$ to be Markov process
$$ p\left(x\_{t} \mid x\_{t-1}, x\_{t-2}, . ., x\_{0}\right)=p\left(x\_{t} \mid x\_{t-1}\right) $$
States $x$ generate observations $z$
$$ p\left(z\_{t} \mid x\_{t}, x\_{t-1}, . ., x\_{0}\right)=p\left(z\_{t} \mid x\_{t}\right) $$
Want to estimate most likely state $x\_t$ given sequence $Z\_t$:
$$ \arg \max p\left(x\_{t} \mid Z\_{t}\right) $$
Can be estimated recursively
Need:
- Process model: $p(x\_t | x\_{t-1})$
- Measurement model: $p(z\_t | x\_t)$

Helpful resource:

Bayes Filters

概率机器人——贝叶斯滤波

Kalman filter

An instance of a Bayes filter
Assumes
- Linear state propagation and measurement model
- Gaussian process and measurement noise

The process to be estimated:

$$ \begin{array}{ll} x\_{k}=A x\_{k-1}+w\_{k-1} & \quad p(w) \sim N(0, Q) \\\\ z\_{k}=H x\_{k}+v\_{k} & \quad p(v) \sim N(0, R) \end{array} $$

$x\_k$: state at time $k$
$A$: transition matrix
$z\_k$: obeservation at time $k$
$H$: measurement matrix
$p(w) \sim N(0, Q)$: process noise
$p(v) \sim N(0, R)$: measurement noise

Note:

The simple Kalman Filter is NOT applicable, when the process to be estimated is NOT linear or the measurement relationship to the process is NOT linear.

$\rightarrow$ The Extended Kalman Filter (EKF) linearizes about the current mean and covariance

Paticle Filter

Helpful resources:

Particle Filters Basic Idea

The Kalman Filter often fails when the measurement density is multimodal / non-Gaussian.
A Particle Filter represents and propagates arbitrary probability distributions. They are represented by a set of weighted samples.
- The Particle Filtering is a numerical technique (unlike the Kalman filter which is analytical).
- Like a Kalman Filter, a Particle Filter incorporates a dynamic model describing system dynamics

Bayesian Tracking

Bayes rule applied to tracking

$$ \arg \max \_{x\_{t}} p\left(x\_{t} \mid Z\_{t}\right)=\arg \max \_{x\_{t}} p\left(z\_{t} \mid x\_{t}\right) p\left(x\_{t} \mid Z\_{t-1}\right) $$ $$ p\left(x\_{t} \mid Z\_{t-1}\right)=\int\_{x_{t-1}} p\left(x\_{t} \mid x\_{t-1}\right) p\left(x\_{t-1} \mid Z\_{t-1}\right) $$

Simplifying assumption (Markov):

$$ p\left(x\_{t} \mid X\_{t-1}\right)=p\left(x\_{t} \mid x\_{t-1}\right) $$

where

$x\_t$: state at time $t$
$z\_t$: observation at time $t$
$X\_t$: history of states up to the time $t$
$Z\_t$: history of observations up to $t$

Observation and Motion Model

$p(z\_t | x\_t)$: The likelihood that the $z\_t$ is observed, given that the true state of the system is represented by $x\_t$
$p(x\_{t} | x\_{t-1})$: The likelihood that the state of the system is $x\_t$ when the previous state was $x\_{t-1}$

Factored Sampling

Probability density function is represented by weighted samples (“particles“)

Particle Filter (PF)

For a PF tracker, you need

a set of $N$ weighted samples (particle) at time $k$
$$ \left\\{\left(s\_{k}^{(i)}, \pi\_{k}^{(i)}\right) \mid i=1 \dots N\right\\} $$
the motion model
$$ s\_{k}^{(i)} \leftarrow s\_{k-1}^{(i)} $$
the observation model
$$ \pi\_{k}^{(i)} \leftarrow s\_{k}^{(i)} $$

The Condensation Algorithm

A popular instance of a particle filter in Computer Vision

Select

Randomly select $N$ new samples $S\_{k}^{(i)}$ from the old sample set $S\_{k-1}^{(i)}$ according to their weights $\pi\_{k-1}^{(i)}$
Predict

Propagate the samples using the motion model
Measure

Calculate weights for the new samples using the observation model
$$ \pi\_{k}^{(i)}=p\left(z\_{k} \mid x\_{k}=s\_{k}^{(i)}\right) $$

Illustration:

How to get the target position?

Cluster the particle set and search for the highest mode
Just take the strongest particle

How many particles are needed?

Depends strongly on the dimension of the state space!
Tracking 1 object in the image plane typically requires 50-500 particles

Problem

The Dimensionality Problem

Examples

Tracking one Face with a Particle Filter

State: ($x$, $y$, scale)
Observations: skin color
Procedure:
1. Select and predict samples
2. Measurement step
  - For each particle
    - Count supporting skin pixels in box defined by ($x$, $y$, scale)
    - Particle weights determined based on skin color support
  - Particle with maximum weight choosen as best solution

Tracking multiple objects

Two different approaches:

A dedicated tracker for each of the objects
- Start with one tracker, once an object is tracked, initialize one more tracker to search for more objects
- Typically fast and well parallelizable
- Optimal global assignment / tracking difficult to find, Information has to be shared across trackers to find a good assignment
A single tracker in a joint state space
- Easier to find optimal assignment
- Number of objects has to be known in advance
- State space becomes high dimensional (curse of dimensionality)

Face and Head Pose Tracking

Particle filter: Head-pose estimation integrated in the tracker
Observation model
- Use bank of face detectors for different poses
- Update particle weights with score of matching detector, i.e. the detector with closest angle to hypothesis
Dynamical model: Gaussian noise, no explicit velocity model
Occlusion handling
- Set particle weight to zero, if it is too close to another track’s center

Tracking 2

Tue, 20 Jul 2021 00:00:00 +0000

Multi-Camera Systems

Type of multi-camera systems

Stereo-camera system (narrow baseline)
- Close distance and equal orientation
- An object’s appearance is almost the same in both cameras
- Allows for calculation of a dense disparity map
Wide-baseline multi-camera system
- Arbitrary distance and orientation, overlapping field of view
- An object’s appearance is different in each of the cameras
- Allows for 3D localization of objects in the joint field of view
Multi-camera network
- Non-overlapping field of view
- An object’s appearance differs strongly from one camera to another

3D to 2D projection: Pinhole Camera Model

Summary:

$$ z^{\prime} = -f $$ $$ \frac{y^{\prime}}{-f}=\frac{y}{z} \Rightarrow y^{\prime}=\frac{-f y}{z} $$ $$ \frac{x^{\prime}}{-f}=\frac{x}{z} \Rightarrow x^{\prime}=\frac{-f x}{z} $$

Pixel coordinates $(u, v)$ of the projected points on image plane

$$ \begin{array}{l} \boldsymbol{u}=\boldsymbol{k}\_{u} \boldsymbol{x}^{\prime}+\boldsymbol{u}\_{\mathrm{0}} \\\\ \boldsymbol{v}=-\boldsymbol{k}\_{v} \boldsymbol{y}^{\prime}+\boldsymbol{v}\_{\mathrm{0}} \end{array} $$

where $k\_u$ and $k\_v$ are scaling factors which denote the ratio between world and pixel coordinates.

In matrix formulation:

$$ \left(\begin{array}{l} u \\\\ v \end{array}\right)=\left(\begin{array}{cc} k\_{u} & 0 \\\\ 0 & -k\_{v} \end{array}\right)\left(\begin{array}{l} x^{\prime} \\\\ y^{\prime} \end{array}\right)+\left(\begin{array}{l} u\_{0} \\\\ v\_{0} \end{array}\right) $$

Perspective Projection

internal camera parameters
$$ \begin{array}{l} \alpha\_{u}=k\_{u} f \\\\ \alpha\_{v}=-k\_{v} f \\\\ u\_{0} \\\\ v\_{0} \end{array} $$
- have to be known to perform the projection
- they depend on the camera only
- Perform calibration to estimate

Calibration

Intrinsics parameters: describe the optical properties of each camera (“the camera model”)

$f$: focal length
$c\_x, c\_y$: the principal point (“optical center”), sometimes also denoted as $u\_0, v\_0$
$K\_1, \dots, K\_n$: distortion parameters (radial and tangential)

Extrinsic parameters: describe the location of each camera with respect to a global coordinate system

$\mathbf{T}$: translation vector
$\mathbf{R}$: $3 \times 3$ rotation matrix

Transformation of world coordinate of point $p^* = (x, y, z)$ to camera coordinate $p$:

$$ p = \mathbf{R} (x, y, z)^T + \mathbf{T} $$

Calibration steps

For each camera: A calibration target with a known geometry is captured from multiple views
The corner points are extracted (semi-)automatically
The locations of the corner points are used to estimate the intrinsics iteratively
Once the intrinsics are known, a fixed calibration target is captured from all of the camerasextrinsics

Triangulation

Assumption: the object location is known in multiple views
Ideally: The intersection of the lines-of-view determines the 3D location
Practically: least-squares approximation

Body Pose

Wed, 07 Jul 2021 00:00:00 +0000

Kinect

What is Kinect?

Fusion of two groundbreaking new technologies
- A cheap and fast RGB-D sensor
- A reliable Skeleton Tracking

Structured light

Kinect uses Structured Light to simulate a stereo camera system
Kinect provides an unique texture for every point of the image, therefore only Block-Matching is required

Pose Recognition for User Interaction

Few constrains:

Extremely low latency.
Low computational power.
High recognition rate, without false positives.
Any personalized training step.
Few people at once.
Complex poses will be usual.

Pose Recognition¹

1st step: Pixel classification

Speed is the key

Uses only one disparity image.
It classifies each pixel independently.
The process of feature extraction is simultaneous to the classification.
Simplest possible feature: difference between two pixels.
Classification is done through Random Decision Forests.
- Learning
  - Randomly choose a set of thresholds and features for splits.
  - Pick the threshold and feature that provide the largest information gain.
  - Recurse until a certain accuracy is reached or depth is obtained.
Everything has an optimal GPU implementation.

Training

Key: using a huge amount of training data
Synthetic Training DB
Pixel classification results

Joint estimation

Use mean shift clustering on the pixels with Gaussian kernel to infer the center of clusters.
Clustering is done in 3d space but every pixel is weighted by their world surface area to get depth invariance.
Finally the sum of the weighted pixels is used as a confidence measure.
Results

Criticism 👎

Not open source
Biased towards upper body frontal poses
Very difficult to improve or adapt.

Pose Estimation without Kinect: Convolutional Pose Machines ²

Pose Machine

Unconstrained 2D-pose estimation on real world RGB images.
Outputs confidence maps for every joint of the skeleton.
Works in multiple stages refining the confidence maps.
💡 Idea:
- Local image evidence is weak (first stage confidence maps)
- Part context can be a strong cue (confidence maps of other body joints)
  
  ➔ Use confidence maps of all body joints of the previous stage to refine current results

Example

Details

We denote the pixel location of the $p$-th anatomical landmark (refer to as a part) $Y\_{p} \in \mathcal{Z} \subset \mathbb{R}^{2}$

$\mathcal{Z}$: set of all $(u, v)$ locations in an image

🎯 Goal: to predict the image locations $Y = (Y\_1, \dots, Y\_P)$ for all $P$ parts.

In each stage $t \in \{1 \dots T\}$, the classifiers $g\_t$ predict beliefs for assigning a location to each part $Y\_{p}=z, \forall z \in \mathcal{Z}$, based on

features extracted from the image at the location $z$ denoted by $\mathbf{x}\_{z} \in \mathbb{R}^{d}$ and
contextual information from the preceding classifier in the neighborhood around each $Y\_p$ in stage $t$.

First stage

A classifier in the first stage $t = 1$ produces the following belief values:

$$ g\_1(\mathbf{x}\_z) \rightarrow \\{b\_1^p(Y\_p=z)\\}\_{p \in \\{0 \dots P\\}} $$

$b\_{1}^{p}\left(Y\_{p}=z\right)$: score predicted by the classifier $g\_1$ for assigning the $p$-th part in the first stage at image location $z$.

We represent all the beliefs of part $p$ evaluated at every location $z = (u, v)^T$ in the image as $\mathbf{b}\_{t}^{p} \in \mathbb{R}^{w \times h}$:

$$ \mathbf{b}\_{t}^{p}[u, v]=b\_{t}^{p}\left(Y\_{p}=z\right) $$

$w, h$: width and height of the image, respectively

For convenience, we denote the collection of belief maps for all the parts as $\mathbf{b}\_{t} \in \mathbb{R}^{w \times h \times(P+1)}$ ($+1$ for background)

Subsequent stages

The classifier predicts a belief for assigning a location to each part $Y\_{p}=z, \forall z \in \mathcal{Z}$, based on

features of the image data $\mathbf{x}\_{z}^{t} \in \mathbb{R}^{d}$ and
contextual information from the preceeding classifier in the neighborhood around each $Y\_p$

$$ g\_{t}\left(\mathbf{x}\_{z}^{\prime}, \psi\_{t}\left(z, \mathbf{b}\_{t-1}\right)\right) \rightarrow \left \\{b\_{t}^{p}\left(Y\_{p}=z\right)\right\\}\_{p \in \\{0 \ldots P+1\\}} $$

$\psi\_{t>1}(\cdot)$: mapping from the beliefs $b\_{t−1}$ to context features.

In each stage, the computed beliefs provide an increasingly refined estimate for the location of each part.

Confidence maps generation

Fully Convolutional Network (FCN)

Does not have Fully Connected Layers.
The same network can be applied to arbitrary image sizes.
Similar to a sliding window approach, but more efficient

CPM

The prediction and image feature computation modules of a pose machine can be replaced by a deep convolutional architecture allowing for both image and contextual feature representations to be learned directly from data.

Advantage of convolutional architectures: completely differentiale $\rightarrow$ enabling end-to-end joint trainining of all stages 👍

CPM structure

Learning in CPM

Potential problem of a network with a large number of layers: vanishing gradient

Solution

Define a loss function at the output of each stage $t$ that minimizes the $l_2$ distance between the predicted ($b\_{t}^{p}$) and ideal ($b\_{*}^{p}\left(Y\_{p}=z\right)$) belief maps for each part.
- The ideal belief map for a part $p$, $b\_{*}^{p}\left(Y\_{p}=z\right)$, are created by putting Gaussian peaks at ground truth locations of each body part $p$.
Cost function we aim to minimize at the output of each stage at each level:
$$ f\_{t}=\sum\_{p=1}^{P+1} \sum\_{z \in \mathcal{Z}}\left\|b\_{t}^{p}(z)-b\_{*}^{p}(z)\right\|\_{2}^{2} . $$
- $P$: all body parts
- $\mathcal{Z}$: set of all image locations in a believe map
The overall objective for the full architecture is obtained by adding the losses at each stage:
$$ \mathcal{F}=\sum\_{t=1}^{T} f\_{t} $$

J. Shotton et al., “Real-time human pose recognition in parts from single depth images,” CVPR 2011, 2011, pp. 1297-1304, doi: 10.1109/CVPR.2011.5995316. ↩︎
Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4732, 2016 ↩︎

Gesture Recognition

Tue, 20 Jul 2021 00:00:00 +0000

Introduction

Gesture

a movement usually of the body or limbs that expresses or emphasizes an idea, sentiment, or attitude
the use of motions of the limbs or body as a means of expression

Automatic Gesture Recognition

A gesture recognition system generates a semantic description for certain body motions
Gesture recognition exploits the power of non-verbal communication, which is very common in human-human interaction
Gesture recognition is often built on top of a human motion tracker

Applications

Multimodal Interaction
- Gestures + Speech recognition
- Gestures + gaze
- Human-Robot Interaction
- Interaction with Smart Environments
Understanding Human Interaction

Types of Gestures

Hand & arm gestures
- Pointing Gestures
- Sign Language
Head gestures
- Nodding, head shaking, turning, pointing
Body gestures

Automatic Gesture Recognition

Feature Acquisition
- Appearances: Markers, color, motion, shape, segementation, stereo, local descriptors, space-time interest points, …
- Model based: body- or hand-models
Classifiers
- SVM, ANN, HMMs, Adaboost, Dec. Trees, Deep Learning …

Hidden Markov Models (HMMs) for Gesture Recognition

“hidden”: comes from observing observations and drawing conclusions WITHOUT knowing the hidden sequence of states
Markov assumption (1st order): the next state depends ONLY on the current state (not on the complete state history)

A Hidden Markov Model is a five-tuple

$$ (S, \pi, \mathbf{A}, B, V) $$

$S = \\{s\_1, s\_2, \dots, s\_n\\}$: set of states
$\pi$: the initial probability distribution
- $\pi(s\_i)$ = probability of $s\_i$ being the first state of a state sequence
$\mathbf{A} = (a\_{ij})$: the matrix of state transition probabilities
- $(a\_{ij})$: probability of state $s\_j$ following $s\_i$
$B = \\{b\_1, b\_2, \dots, b\_n\\}$: the set of emission probability distributions/densities
- $b\_i(x)$: probability of observing $x$ when the system is in state $s\_i$
$V$: the observable feature space
- Can be discrete ($V = \\{x\_1, x\_2, \dots, x\_v\\}$) or continuous ($V = \mathbb{R}^d$)

Properties of HMMs

For the initial probabilities:
$$ \sum\_i \pi(s\_i) = 1 $$
- Often simplified by $$ \pi(s\_1) = 1, \quad \pi(s\_i > 1) = 0 $$
For state transition probabilities:
$$ \forall i: \sum\_j a\_{ij} = 1 $$
- Often: $a\_{ij} = 0$ for most $j$ except for a few states
When $V = \\{x\_1, x\_2, \dots, x\_v\\}$ then $b\_i$ are discrete probability distributions, the HMMs are called discrete HMMs
When $V = \mathbb{R}^d$ then $b\_i$ are continuous probability density functions, the HMMs are called continuous (density) HMMs

HMM Topologies

The Observation Model

Most popular: Gaussian mixture models

$$ P\left(x\_{t} \mid s\_{j}\right)=\sum\_{k=1}^{n\_{j}} c\_{j k} \cdot \frac{1}{\sqrt{(2 \pi)^{n}\left|\Sigma\_{j k}\right|}} e^{-\frac{1}{2}\left(x\_{t}-\mu\_{j k}\right)^{\mathrm{T}} \Sigma\_{j k}^{-1}\left(x\_{t}-\mu\_{j k}\right)} $$

$n\_j$: number of Gaussians (in state $j$)
$c\_{jk}$: mixture weight for $k$-th Gaussian (in state $j$)
$\mu\_{jk}$: means of $k$-th Gaussian (in state $j$)
$\Sigma\_{jk}$: covariane matrix of $k$-th Gaussian (in state $j$)

Three Main Tasks with HMMs

Given an HMM $\lambda$ and an observation $x\_1, x\_2, \dots, x\_T$

The evaluation problem

compute the probability of the observation $p(x\_1, x\_2, \dots, x\_T | \lambda)$

$\rightarrow$ “Forward Algorithm”
The decoding problem

compute the most likely state sequence $s\_{q1}, s\_{q2}, \dots, s\_{qT}$, i.e.
$$ \operatorname{argmax}\_{q 1, \ldots, q \tau} p\left(q\_{1}, . ., q\_{T} \mid x\_{1}, x\_{2}, \ldots, x\_{T}, \lambda\right) $$
$\rightarrow$ “Viterbi-Algorithm”
The learning/optimization problem

Find an HMM $\lambda^\prime$ s.t. $p\left(x\_{1}, x\_{2}, \ldots, x\_{T} \mid \lambda^{\prime}\right)>p\left(x\_{1}, x\_{2}, \ldots, x\_{T} \mid \lambda\right)$

$\rightarrow$ “Baum-Welch-Algo”, “Viterbi-Learning”

Sign Language Recognition

American Sign Language (ASL)
- 6000 gesture describe persons, places and things
- Exact meaning and strong rules of context and grammar for each
Sign recognition
- HMM ideal for complex and structured hand gestures of ASL

Feature extraction

Camera either located as a 1st-person and a 2nd-person view
Segment hand blobs by a skin color model

HMM for American Sign Language

Four-State HMM for each word
Training
- Automatic segmentation of sentences in five portions
- Initial estimates by iterative Viterbi-alignment
- Then Baum-Welch re-estimation
- No context used
Recognition
- With and without part-of-speech grammar
- All features / only relative features used

ASL Results

Desk-based

348 training and 94 testing sentences without contexts

Accuracy:

$$ Acc = \frac{N-D-S-I}{N} $$

$N$: #Words
$D$: #Deletions
$S$: #Substituitions
$I$: #Insertions

Wearable-based

400 training sentences and 100 for testing
Test 5-word sentences
Restricted and unrestricted similar!

Pointing Gesture Recognition

Pointing gestures
- are used to specify objects and locations
- can be needful to resolve ambiguities in verbal statements
Definition: Pointing gesture = movement of the arm towards a pointing target
Tasks
- Detect occurrence of human pointing gestures in natural arm movements
- Extract the 3D pointing direction

Interaction in a Smart Room

Action & Activity Recognition

Tue, 20 Jul 2021 00:00:00 +0000

Introduction

Motivation

Gain a higher level understanding of the scene, e.g.
- What are these persons doing (walking, sitting, working, hiding)?
- How are they doing it?
- What is going on in the scene (meeting, party, telephone conversation, etc…)?
Applications
- video indexing/analysis,
- smart-rooms,
- patient monitoring,
- surveillance,
- robots etc.

Actions, Activities

Event

“a thing that happens or takes place”
Examples
- Gestures
- Actions (running, drinking, standing up, etc.)
- Activities (preparing a meal, playing a game, etc.)
- Nature event (fire, storm, earthquake, etc.)
- …

Human actions

Def 1: Physical body motion
- E.g.: Walking, boxing, clapping, bending, …
Def 2: Interaction with environment on specific purpose
- E.g.

Activities

Complex sequence of action,
Possibly performed by multiple humans,
Typically longer temporal duration
Examples
- Preparing a meal
- Having a meeting
- Shaking hands
- Football team scoring a goal

Actions / Activity Hierarchy

Example: Small groups (meetings)

Individual actions: Speaking, writing, listening, walking, standing up, sitting down, “fidgeting”,…
Group activities: Meeting start, end, discussion, presentation, monologue, dialogue, white board, note-taking
Often audio-visual cues

Approaches

Time series classification problem similar to speech/gesture recognition
- Typical classifiers:
  - HMMs and variants (e.g. Coupled HMMs, Layered HMMs) Dynamic
  - Bayesian Networks (DBN)
  - Recurrent neural networks
Classification problem similar to object recognition/detection
- Typical classifiers:
  - Template matching
  - Boosting
  - Bag-of-Words SVMs
- Deep Learning approaches:
  - 2D CNN (e.g. Two-Stream CNN, Temporal Segment Network)
  - 3D CNN (e.g. C3D, I3D)
  - LSTM on top of 2D/3D CNN

Recognition with local feature descriptors

Try to model both Space and Time
- Combine spatial and motion descriptors to model an action
Action == Space-time objects
- Transfer object detectors to action recognition

Space-Time Features + Boosting

💡 Idea

Extract many features describing the relevant content of an image sequence
- Histogram of oriented gradients (HOG) to describe appearance
- Histogram of oriented flow (HOF) to describe motion in video
Use Boosting to select and combine good features for classification

Action features

Action volume = space-time cuboid region around the head (duration of action)
Encoded with block-histogram features $f\_{\theta}(\cdot)$
$$ \theta=(x, y, t, d x, d y, d t, \beta, \varphi) $$
- Location: $(x, y, t)$
- Space-time extent: $(d x, d y, d t)$
- Type of block: $\beta \in \\{\text{Plain, Temp-2, Spat-4}\\}$
- Type of histogram: $\varphi$
  - Histogram of optical flow (HOF)
  - Histogram of oriented gradient (HOG)
Example

Histogram features

(simplified) Histogram of oriented gradient (HOG)
- Apply gradient operator to each frame within sequence (eg. Sobel)
- Bin gradients discretized in 4 orientations to block-histogram
Histogram of optical flow (HOF)
- Calculate optical flow (OF) between frames
- Bin OF vectors discretized in 4 direction bins (+1 bin for no motion) to block-histogram
Normalized action cuboid has size 14x14x8 with units corresponding to 5x5x5 pixels

Action Learning

Use boosting method (eg. AdaBoost) to classify features within an action volume
Features: Block-histogram features

Boosting

A weak classifier h is a classifier with accuracy only slightly better than chance
Boosting combines a number of weak classifiers so that the ensemble is arbitrarily accurate
- Allows the use of simple (weak) classifiers without the loss if accuracy
- Selects features and trains the classifier

Space-Time Interest Points (STIP) + Bag-of-Words (BoW)

Inspired by Bag-of-Words (BoW) model for object classification

Bag-of-Words (BoW) model

“Visual Word“ vocabulary learning
- Cluster local features
- Visual Words = Cluster Means
BoW feature calculation
- Assign each local feature most similar visual word
- BoW feature = Histogram of visual word occurances within a region
Histogram can be used to classify objects (wth. SVM)

Bag of Visual Words (Stanford CS231 slides)

Feature detection and representation

Codewords dictionary formation

Bag of word representation

Space-Time Features: Detector

Space-Time Interest Points (STIP)

Space-Time Extension of Harris Operator

Space-Time Extension of Harris Operator
- Add dimensionality of time to the second moment matrix
- Look for maxima in extended Harris corner function H
Detection depends on spatio-temporal scale
Extract features at multiple levels of spatio-temporal scales (dense scale sampling)

Space-Time Features: Descriptor

Compute histogram descriptors of space-time volumes in neighborhood of detected points

Compute a 4-bin HOG for each cube in 3x3x2 space-time grid
Compute a 5-bin HOF for each cube in 3x3x2 space-time grid

Action classification

Spatio-temporal Bag-of-Words (BoW)
- Build Visual vocabulary of local feature representations using k-means clustering
- Assign each feature in a video to nearest vocabulary word
- Compute histogram of visual word occurrences over space time volume of a video squence
SVM classification
- Combine different feature types using multichannel $\chi^{2}$ Kernel
- One-against-all approach in case of multi-class classification

Dense Trajectories ¹

Dense sampling improves results over sparse interest points for image classification
The 2D space domain and 1D time domain in videos have very different characteristics $\rightarrow$ use them both

Feature trajectories

Efficient for representing videos
- Extracted using KLT tracker or matching SIFT descriptors between frames
- However, the quantity and quality is generally not enough 🤪
State-of-the-art: The state of the art now describe videos by dense trajectories

Dense Trajectories

Obtain trajectories by optical flow tracking on densely sampled points
- Sampling
  - Sample features points every 5th pixel
  - Remove untrackable points (structure / Eigenvalue analysis)
  - Sample points on eight different scales
- Tracking
  - Tracking by median filtering in the OF-Field
  - Trajectory length is fixed (e.g. 15 frames)
Feature tracking
- Points of subsequent frames are concatenated to form a trajectory
- Trajectories are limited to $L$ frames in order to avoid drift from their initial location
- The shape of a trajectory of length $L$ is described by the sequence
  $$ S=\left(\Delta P\_{t}, \ldots, \Delta P\_{t+L-1}\right) $$
- The resulting vector is normalized by
  $$ \begin{array}{c} \Delta P\_{t}=\left(P\_{t+1}-P\_{t}\right)=\left(x\_{t+1}-x\_{t}, y\_{t+1}-y\_{t}\right) \\\\ S^{\prime}=\frac{\left(\Delta P\_{t}, \ldots, \Delta P\_{t+L-1}\right)}{\sum\_{j=t}^{t+L-1}\left\|\Delta P\_{j}\right\|} \end{array} $$

Trajectory descriptors

Histogram of Oriented Gradient (HOG)
Histogram of Optical Flow (HOF)
HOGHOF
Motion Boundary Histogram (MBH)
- Take local gradients of x-y flow components and compute HOG as in static images

Wang, Heng, et al. “Dense trajectories and motion boundary descriptors for action recognition.” International journal of computer vision 103.1 (2013): 60-79. ↩︎

Action & Activity Recognition 2

Tue, 20 Jul 2021 00:00:00 +0000

What is action recognition?

Given an input video/image, perform some appropriate processing, and output the “action label”

CNNs for Action / Activity Recognition ¹

Why CNN?

Convolutional neural networks report the best performance in static image classification.
They automatically learn to extract generic features that transfer well across data sets.

Strategies for temporal fusion

Single Frame CNN (baseline)
- Network sees one frame at a time
- No temporal information
Late Fusion CNN
- Network sees two frames separated by F = 15 frames
- Both frames go into separate pathways
- Only the last layers have access to temporal information
Early Fusion CNN
- Modify the convolutional filters in the first layer to incorporate temporal information.
  - Filters of $11 \times 11 \times 3 \times T$ , where $T$ is the temporal context ($T=10$)
Slow Fusion CNN
- Layers higher in the hierarchy have access to larger temporal context
- Learn motion patterns at different scales

Multiresolution CNN

Faster training by reducing input size from $170 \times 170$ to $89 \times 89$

💡 Idea: takes advantage of the camera bias present in many online videos, since the object of interest often occupies the center region.

The context stream receives the downsampled frames at half the original spatial resolution (89 × 89 pixels)
The fovea stream receives the center 89 × 89 region at the original resolution

$\rightarrow$ The total input dimensionality is halved.

Evaluation

Dataset: Sports-1M (1 Million videos, 487 sport activities classes)

Encoding image and optical flow separately (two-stream CNNs) ²

3D convolutions for action recognition (C3D)

Notations:

video clips $\in c \times l \times h \times w$
- $c$: #channels
- $l$: length in number of frames
- $h, w$: height and width of the frame
3D convolution and pooling $\in d \times k \times k$
- $d$: kernel temporal depth
- $k$: kernel spatial size

C3D: 3 x 3 x 3 convolutions with stride 1 in space and time

Recurrent Convolutional Networks / CNN-RNN ³

LRCN

Task-specific instantiation
Activity recognition (average frame representations)
Image captioning (feed image info to each RNN)
Video description (sequence-to-sequence models)

Comparison of architectures

Type of convolutional and layers operators
- 2D kernels (image-based) vs.
- 3D kernels (video-based)
Input streams
- RGB (spatial stream), usually used in single-stream networks
- Precomputed optical flow (temporal stream)
- Further streams possible (e.g. depth, human bounding boxes)
Fusion strategy across multiple frames
- Feature aggregation over time
- Recurrent layers, such as LSTM

$\rightarrow$ Modern architectures are usually a combination of the above!

Fair comparison of the architectures is difficult!

different pre-training of models, some are trained from scratch
Activity recognition datasets have been too small for analysis of deep learning approaches $\rightarrow$ pre-training matters even more

Evolution of Activity Recognition Datasets

Construction of large-scale video datasets much harder then for images 🤪
Common datasets too tiny for proper research of deep methods

Evaluation of Action Recognition Architectures ⁴

Contributions

Release of the Kinetics dataset - a first large-scale dataset for Activity Recognition
Benchmarking of three „classic“ architectures for activity recognition
- Note: fair comparison is still quite difficult, since models still differ in their modalities and pre-training basis
New Architecture: I3D
- 3D CNN based Inception-V1 CNN (Google LeNet)
- “Inflation“ of trained 2-D filters in the 3-D Model

Evaluation of 3 “classic” architectures

ConvNet + LSTM (9M Parameters)
- Underlying CNN for feature extraction: Inception-V1
- LSTM with 512 hidden units (after the last AvgPool layer) + FC layer
- Estimating the action from the resulting prediction Sequence:
  - Training: output at each time-step used for loss calculation
  - Testing: output of the last frame used for final prediction
- Pre-trained on ImageNet
- Preprocessing Steps: down-sampling from 25 to 5 fps
3D - ConvNet (79M Parameters)
- Spatio-temporal filters, C3D architecture
- High number of parameters $\rightarrow$ harder to train 🤪
- CNN Input: 16-frame snippets
- Classification: score averaging over each snippet in the video
- Trained from scratch
Two Stream CNN (12 M Parameters)
- Underlying CNN for feature extraction: Inception-V1
- Spatial (RGB) and Temporal (Optical Flow) streams trained separately
- Prediction by score averaging
- CNN Pre-trained on ImageNet

Evaluation

Two-Stream are still the clear winners
3D-CNN show poor performance and very high number of parameters
- Note: this is the only architecture trained from scratch

Inflated 3D CNN (I3D)

💡 Idea: transfer the knowledge from the image recognition tasks in 3-D CNNs

I3-D Architecture

Inception-V1 architecture extended to 3D
Filters and pooling kernels inflated with the time dimension ($N \times N \rightarrow N \times N \times N$)
👍 Advantage: Pre-training on Image-Net possible (Learned weights of 2-D filters repeated N times along the time dimension)
Note: the 3-D extension is not fully symmetric in respect to pooling (Time dimension is different from the space dimensions)
- First two max-pooling layers do not perform temporal pooling
- Late max-pooling layers use symmetric 3x3x3 kernels
Evaluation
- I3D outperforms image-based approaches on each of the streams
- Combination of RGB input and optical flow still very useful

The role of pre-training

Pre-training on a video dataset (additionally to the Image-Net pre-training)

Pre-training on MiniKinetics
For 3D ConvNets, using additional data for pre-training is crucial
For 2D ConvNets, the difference seems to be smaller

$\rightarrow$ Pre-training is crucial

$\rightarrow$ I3D is the new State-of-The art model

Karpathy, Andrej, et al. “Large-scale video classification with convolutional neural networks.” Computer Vision and Pattern Recognition (CVPR), 2014 ↩︎
K. Simonyan, and A. Zisserman. Two-Stream Convolutional Networks for Action Recognition in Videos. In NIPS 2015. ↩︎
J. Donahue, et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description. In CVPR 2015. ↩︎
Carreira, J., & Zisserman, A. (2017). Quo Vadis, action recognition? A new model and the kinetics dataset. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017-January, 4724–4733. https://doi.org/10.1109/CVPR.2017.502 ↩︎

Clean Code

Fri, 06 Nov 2020 00:00:00 +0000

Motivation

Readability of code is important

Code is much more often read than written
Your write code for the next human to read it, not for the compiler/interpreter/computer!

Object-Oriented Design (OOD)

A design strategy to build a system “made up of interacting objects that maintain their own local state and provide operations on that state information.”

[Sommerville]

SOLID principles: Five principles of good OO design

Single Responsibility Principle (SRP)
Open Closed Principle (OCP)
Liskov Substitution Principle (LSP)
Interface Segregation Principle (ISP)
Dependency Inversion Principle (DIP)

Single Responsibility Principle (SRP)

“There should never be more than one reason for a class to change.“ — R. Martin

Each responsibility deals with one core concern
- It may also deal with further (cross-cutting) concerns
Bad smell: Big class (~ >200 LOC, >15 methods/fields)
- Useful refactoring: Extract class
Benefits:
- Code is easier to understand
- Adding/modifying functionality should affect few classes
- Risk of breaking code is minimised

Insertion: Command-Query-Separation

Separate commands (actions) from simple queries (requests)
Reason
- Commands are expected to have side effects on an object’s state
- Queries should not change the state of an object
- Appropriate designs are simpler to understand and easier to test

Open Closed Principle (OCP)

“Software entities (classes, modules, functions, etc.) should be open for extension, but closed for modifi-cation.” — R. Martin, paraphrasing B. Meyer

💡 Idea: Modify behaviour by adding new code, NOT by changing old code
Strongly related to the “Information Hiding Principle”

Example: Drawing a list of shapes using a switch statement

for (Shape shape : ShapeList)
 switch (shape.getType()) {
 case SQUARE: square.draw()
 case CIRCLE: circle.draw()
 }

Needs to be modified for new shapes 🤪

Solution: use abstractions to keep the function open for extension

for (Shape shape : ShapeList)
 shape.draw();

Liskov Substitution Principle (LSP)

“Functions that use pointers or references to base classes must be able to use objects of derived classes without knowing it.”

— R. Martin

Example

Square is-a Rectangle? Only in a mathematical sense!
Square can-NOT-substitute Rectangle, because it offers limited behaviour (setWidth and setHeight are dependent)

LSP is related to B. Meyer‘s Design by Contract (DbC):

“When redefining a routine [in a derivative], you may only replace its precondition by a weaker one, and its postcondition by a stronger *one.” *

— B. Meyer

In our case, rectangle’s setWidth postcondition: width = w and height = h
Square’s setWidth postcondition: width = w and height = w
Only weaker preconditions and stronger postconditions are allowed, as only they preserve substitutability. It is not allowed to change conditions to arbitrarily different ones

Possible solution according to Liskov:

Square/Rectangle can-substitute Shape,
if Shape collects
- less specific behaviour
- Alternative: Drop height = h from Rectangle’s postcondition

Interface Segregation Principle (ISP)

“Clients should not be forced to depend upon interfaces that they do not use.”

— R. Martin

Interfaces should be kept as lean as possible

High cohesion: Interfaces should only be concerned with single concepts
Interface pollution: Interfaces should not depend on other interfaces just because a subclass requires those
Interfaces should be separated if used by different clients
Refactorings: Extract interface/superclass

Example:

Dependency Inversion Principle (DIP)

“A. High level modules should not depend upon low level modules. Both should depend upon abstractions.

B. Abstractions should not depend upon details. Details should depend upon abstractions.”

— R. Martin

Example:

Better design:

Why “Inversion”?

An interface has been used to invert the dependency between packages
But in general: Add abstract concept that both classes A and B depend on

More Principles

Law of Demeter (don’t talk to strangers)

A module should not know about the innards of the objects it manipulates.

Corresponds to the bad smell “Message Chains”:
```
value = getClassA().getClassB(). ... .getNeededValue();
```
👆 Ties code to particular class structure, which is likely to break. :cry
Rule: A method m of a class C should only call the methods of
- C
- An object created by m
- An object passed as an argument to m
- An object held in an instance variable of C

Example:

Violation

class Motor {

 public void startEngine() {
 // start the motor
 }
}

class Car {

 public Motor motor;

 public Car() {
 motor = new Motor();
 }

}

class Driver {

 public void drive() {
 Car myCar = new Car();
 myCar.motor.startEngine(); // violation!!!
 }
}

Solution

class Car {
 private Motor motor;

 public Car() {
 motor = new Motor();
 }

 public void getReadyToDrive() {
 this.motor.startEngine()
 }
}

class Drive {
 public void drive() {
 Car myCar = new Car();
 myCar.getReadyToDrive();
 }
}

Boy Scout Rule

„Leave the campground cleaner than you found it!“ — The Boy Scouts of America

Code degrades as time passes
We seldom start with a greenfield
Being honest:
- To the code
- To your colleagues
- To yourself about the code
Refactor your code before checking it in

Principle of Least Surprise

Any function or class should implement the behaviours that another programmer could reasonably expect

Also called principle of least astonishment (POLA)

“If a necessary feature has a high astonishment factor, it may be necessary to redesign the feature.”

If obvious behaviour remains unimplemented, readers and users…
- no longer depend on their intuition about function names
- fall back on reading internals

Coding Conventions

Naming

Standardised (with respect to a project or team)
Meaningful, i.e. clear for everyone
Intention-revealing:
Make meaningful distinction and avoid disinformation
- Hints on context
- Hints on types
- Certain prefixes
Avoid noninformation
- Except for well-accepted cases (e.g. i as a loop counter)

Commenting

“Don’t comment bad code—rewrite it.“

— B. W. Kernighan, P. J. Plaugher

Good comments are

explaining
- Legal issues
- Performance issues
- Train of thought
- Intent
- Algorithms
Good comments are warning
- Of consequences
- Over importance
Good comments are informative
- Open issues, to-dos

Whenever possible, use well-named code to tell what is done

Intermediate variables explaining steps
Extra methods encapsulating expressions

Formatting

Visually representing levels of cohesion
Vertical openness between concepts,
- e.g. declarations
- e.g. add blank lines after imports or after a method is finished
- lines that are related should be written densely together
Horizontal openness
- to accentuate operators / operator precedence
- to separate parameters
- use spaces to emphasize elements and indent to make scopes visible

Don’t repeat yourself (DRY)

Do not duplicate pieces of code!

Copy & paste decreases…
- Maintainability: Losing track of copies
- Understandability
  - Code is less compact
  - An identical concept needs to be understood multiple times
- Evolvability
  - Need to find and modify all copies, When removing bugs or changing behaviour
Duplicated code fosters errors and inconsistencies

Keep it simple, stupid (KISS)

“Make everything as simple as possible, but not simpler”

— Albert Einstein

Good code is easy to understand by anybody
Good code addresses the problem adequately
For example, if an IEnumerable is suitable, do not use anICollection or even an IList
Techniques which help ensure that your code is understandable by others:
- Code reviews
- Pair programming

You ain’t gonna need it (YAGNI)

Only implement required features!

Featurism is costly:
- unrequested features need to be tested, documented
- over-engineered systems sacrifice maintainability, as they are overly complex (KISS)
Beware of optimisations!
- Often merely treat symptoms
- Too costly to be done prematurely

Single Level of Abstraction (SLA)

Newspaper metaphor:
- Good newspaper articles are well-ordered
- Navigation with details increasing:
  - headline (very high abstraction)
  - text with synopsis (high abstraction)
  - rest (details)
Statements within a function should be at the same abstraction level
- if not, extract expressions/statements of higher detail into an own method
Functions in a class: The abstraction level should decrease depth- first when reading from top to bottom

Refactoring

If it stinks, change it.

Methods tend to grow during development
Bad odour (smell) of a long method arises
What to do? Extract cohesive parts into new methods

What is Refactoring?

A „disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior.“

— M. Fowler

The First Rule in Refactoring

Refactor with tests only!

Good tests help to prevent introducing bugs into the program through refactoring

Bad Smells

Bad code smells: symptoms for deeper problems

Long method: having code blocks lead by comments
- 👨‍⚕️Cure: Extract Method: extract commented block
Duplicated code
Feature envy: class A excessively calls another class B’s methods
- 👨‍⚕️Cure: parts of A’s methods want to be in class B
  1. Extract Method: extract code block calling class B
  2. Move Method: move extracted part to class B
Data class: class merely holds data (“dumb data holder”)
- 👨‍⚕️Cure: enforce information hiding principle, collect functionality
  1. Encapsulate field: getter/setter instead of public access
  2. Remove setting method: only for read-only values
  3. Move method: collect functionality implemented elsewhere
    - think about responsibilities of the class
Large/God class: class tries to do too much
Inappropriate intimacy: class has dependencies on implementation details of another class
…

More catelog see: https://www.refactoring.com/catalog/index.html

When to Refactor?

It is not that simple to find out when to refactor
So-called “bad smells” in code may give a good indication when refactoring is worthwhile
More general guidelines
- when you find yourself looking up details frequently
  - what was the order of the method parameters again?
  - where was this method again and what does it do?
- when you feel the need to write a comment
  - try to refactor the code so that the comment becomes superfluous

Limitations

May influence performance negatively 🤪
- However, it is recommended to do the refactoring first
- and the performance tuning on the cleaner code afterwards

Appendix

Separation of Concerns (SoC)

Each module should be focused on a single concern.

👍 Benefits
- Loose coupling, high cohesion
- Better testability: each test stays focused on one module
Some concerns may crosscut a system‘s core concerns
- Typical crosscutting concerns:
  - Tracing/Logging
  - Security
  - Transactionality
  - Caching
- Aspect Oriented Programming (AOP) provides adequate concepts

Order of Implementation

For the implementation (and unit testing later) always try to move from the least-coupled to the most-coupled classes

avoids unnecessary creation of “stubs”

Example

Use a Version Control System

History of commented changes
Shared working in a team, even on same artefacts Branching and merging
Tagging versions as pre-release etc.
Reverting to previous revisions
- reduces fears of breaking code
- encourages a programmer‘s willingness to refactor code

Test First

Test-Driven Development
Clean tests should follow the F.I.R.S.T. rules
- Fast: to run them frequently
- Independent: A failing test does not influence others
- Repeatable: in any environment, so there is no excuse for failing tests
- Self-Validating: Tests either pass or fail automatically
- Timely: Tests are written right before production code
Tests should follow same standards as production code
- and be executed in a continuous manner
  - so-called continuous integration
  - reduces fear of breaking code

Static Code Analyses

Classes of metrics
- Duplication (detection of DRY violation)
- Unit tests (test coverage should be > 90%)
- Complexity (avg. LoC per class)
- Potential bugs
- Coding rules
- Comments
- Architecture & design
Code Reviews
Explaining your code to others helps…
- Detecting errors and unclean passages
- Spreading knowledge through a team,
- esp. to less experienced colleagues
  - about design principles
  - about further aspects of the system under development
Refactoring helps to instantly apply suggestions, so follow-up ideas can be given in one session
- Works only in small groups with few opinions
- In larger groups, design reviews are better suitable

Lecture | Haobin Tan

Computer Vision for Human-Computer Interaction

Pattern Recognition

Why pattern recognition and what is it?

What is machine learning?

Classifiers

Bayes Classification

Gaussian Mixture Models

Gaussian classification

Gaussian Mixture Models (GMMs)

parametric vs. non-parametric

generative vs. discriminative

Linear Discriminant Functions

Support Vector Machines

Linear SVMs

K-nearest Neighbours

Clustering

K-means

Agglomerative Hierarchical Clustering

Curse of dimensionality

Example

Dimensionality reduction

Face Detection: Color-Based

TL;DR

Color-based face detection overview

Color

Color spaces

Problems

How to model skin color?

Histogram as skin color model

Histogram Backprojection

Histogram Matching

Histogram Backprojection vs. Matching

Parametric models

Gaussian Density Models

Mixture of Gaussian Models

Bayes Classifier

Discriminative Models / Classifiers

Performance Measures

For classification

ROC (Receiver Operating Characteristic)

Skin-color: Analysis and Comparison

From Skin-Colored Pixels to Faces

Perceptual Grouping

From Skin Blobs To Faces

Face Detection: Neural-Network-Based

Motivation

Detection

Simple neuron model

Topologies

Parameters

Training

Neural Network Based Face Detection1

Network Topology

System Overview

Network Training

Training Set

Preprocessing

Histogram equalization

Training Procedure

Neural Network Based Face Filter

Localization and Ground-Truth

Face Recognition: Traditional Approaches

Face Recognition for Human-Computer Interaction (HCI)

Main Problem

Closed Set vs. Open Set Identification

Authentication/Verification

Feature-based (Geometrical) approaches

Classification

Appearance-based approaches

Holistic appearance-based approaches

Eigenfaces

💡 Idea

Objective

PCA

Training

Testing

Projections onto the face space

PCA for face matching and recognition

Problems and shortcomings

Neural Network Based Face Detection¹

Local binary Pattern Histogram (LBPH)¹

Face Recogntion using 3D Models²

DeepFace ¹

FaceNet²

Deep Face Recognition ³