Face | Haobin Tan

Face Detection: Color-Based

Fri, 06 Nov 2020 00:00:00 +0000

TL;DR

Different color spaces and classifiers can be used
- Models: histograms, Gaussian Models, Mixture of Gaussians Model
- Histogram-backprojection / Histogram matching
- Bayes classifier
- Discriminative Classifiers (ANN, SVM)
Bayesian classifier and ANN seem to work well
- Sufficient training data is needed for modeling the pdf, in particular for Bayesian approach (positive & negative pdfs learned)
Advantages: Fast, rotation & scale invariant, robust against occlusions
Disadvantages:
- Affected by illumination
- Cannot distinguish head and hands
- Skin-colored objects in the background problematic
Metric: ROC curve used to compare classification results / methods

Color-based face detection overview

💡 Idea: human skin has consistent color, which is distinct from many objects

Possible approach:

Find skin colored pixels
Group skin colored pixels

(and apply some heuristics) to find the face

Color

Grayscale Image: Each pixel represented by one number (typically integer between 0 and 255)
Color image: Pixels represented by three numbers

Different representations exist –> „Color Spaces“

Color spaces

RGB
- most widely used
- specifies colors in terms of the primary colors red (R), green (G), and blue (B)
HSV/HSI: hue (H), saturation (S) and value(V)/intensity (I)
- Closely related to human perception (hue, colorfulness and brightness)
  - Hue: “color”
  - Saturation: How “pure” the color is?
  - Value: “lightness”
Class Y spaces: YCbCr (Digital Video), YIQ (NTSC), YUV (PAL)
- Y channel contains brightness, other two channels store chrominance (U=B-Y, V=R-Y)
- Conversion from RGB to Yxx is a linear transformation
Perceptually uniform spaces
- Perceived color difference is uniform to difference in color values
- Euclidian distance can be used for color comparison
Chromatic Color Spaces
- Two color channels containing chrominance (colour) information
  - HS (taken from HSV)
  - UV (taken from YUV)
- Normalized rg from RGB:
  - r = R / (R+G+B)
  - g = G / (R+G+B)
  - b = B / (R+G+B)
- Sometimes it is argued that chromatic skin color models are more robust

Problems

Reflected color depends on spectrum of the light source (and properties of the object / surface)
If the light source / illumination changes, the reflected color signal changes!!! 🤪

How to model skin color?

Non-parametric models: typically histograms
Parametric models
- Gaussian Model
- Gaussian Mixture Model
Or just learn decision boundaries between classes (discriminative model)
- ANN, SVM, …

Histogram as skin color model

👍 Advantages: Works very well in practice
👎 Disadvantages
- Memory size quickly gets high
- A large number of labelled skin and non-skin samples is needed!

Histogram Backprojection

The simplest (and fastest) way to utilize histogram information
Each pixel in the backprojection is set to the value of the (skin-color) histogram bin indexed by the color of the respective pixel
- A color $x$ is considered as skin color if $H\_{+}(x) > \theta$
E.g.

Histogram Matching

Backprojection
- is good, when the color distribution of the target is monomodal.
- is not optimal, when the target is multi colored! 😢
🔧 Solution: Build a histogram of the image within the search window, and compare it to the target histogram.
- distance metrics for histograms, e.g.:
  - Battacharya distance
  - Histogram intersection
  - Earth-movers distance,…

Histogram Backprojection vs. Matching

Histogram Backprojection
- Compares color of a single pixel with color model
- Fast and simple
- Can only cope well with mono-modal distributions
- sufficient for skin-color classification
Histogram Matching / Intersection
- Compares color histogram of image patch with color model
- Better performance
- Can cope with multi-modal distributions
- Computationally expensive

Parametric models

Gaussian Density Models

Gaussian Densities
- Assume that the distribution of skin colors p(x) has a parametric functional form
- Most common function: Gaussion function $\mathrm{G}(\mathbf{x} ; \mu, \mathbf{C})$
  $$ p(x | \text{skin})=G(x ; \mu, C)=\frac{1}{(2 \pi)^{d / 2}|C|^{1 / 2} }\exp \left\\{-1 / 2(x-\mu)^{\top} C^{-1}(x-\mu)\right\\} $$
  - Mean $\mu$ and covariance matrix $C$ are estimated from a training set of skin colors $S = {x\_1,x\_2,...,x\_N}$:
    - $\mu = E\{x\}$
    - $C = E\{(\boldsymbol{x}-\mu)^T(\boldsymbol{x}-\mu)\}$
- A color is considered as skin color if
  - $p(x|\text{skin}) > \theta$
  - $p(x|\text{skin}) > p(x|\text{non-skin})$

Mixture of Gaussian Models

$$ p(x)=\sum\_{i=1}^{K} \pi\_{i} G\left(x, \mu\_{i}, C\_{i}\right) $$

Parameter set $\Phi$ can be estimated using the EM algorithm
- Iteratively changes parameters so as to maximize the log-likelihood of the training set: $$ L=\log \prod\_{i=1}^{N} p\left(x\_{i} \mid \Phi\right) $$
A color is considered as skin color if
- $p(x|\text{skin}) > \theta$
- $p(x|\text{skin}) > p(x|\text{non-skin})$

Bayes Classifier

Skin Classification using Bayes Decision Rule
- Minimum cost decision rule
- Classify pixel to skin class if $P(\text{Skin} | x)>P(\text{Non-Skin} | x)$
- Decision Rule:
  $$ \frac{p(\mathbf{x} \mid \text {Skin})}{p(\mathbf{x} \mid \text {Non-Skin})} \geq \frac{P(\text {Non-Skin})}{P(\text {Skin})} $$
- The classconditionals $p(x|\omega)$ can be estimated from the corresponding histograms:
  $$ p\left(x \mid \omega\_{i}\right)=h\_{i}(x) / \sum\_{x} h\_{i}(x) $$
  - $h\_i(x)$: count of pixels from class $\omega\_{i}$ that have value $x$

Discriminative Models / Classifiers

Artificial Neural Networks
Support Vector Machine

Performance Measures

For classification

When comparing recognition hypotheses with ground-truth annotations have to consider four cases:

More see: Evaluation

ROC (Receiver Operating Characteristic)

Used for the task of classification
Measures the trade-off between true positive rate and false positive rate

$$ \begin{array}{l} \text { true positive rate }=\frac{\mathrm{TP}}{\mathrm{Pos}}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} \\\\ \text { false positive rate }=\frac{\mathrm{FP}}{\mathrm{Neg}}=\frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TN}} \end{array} $$

Each prediction hypothesis has generally an associated probability value or score
The performance values can therefore plotted into a graph for each possible score as a threshold
Example:

Skin-color: Analysis and Comparison

Conclusions ¹

Bayesian approach and MLP worked best
- Bayesian approach needs much more memory
Approach is largely unaffected by choice of color space, but
Results degraded when only chrominance channels were used

From Skin-Colored Pixels to Faces

Skin-colored pixels need to be grouped into object representations
🔴 Problems:
- skin-colored background,
- further skin-colored body parts (hands, arms, …),
- Noise, …

Perceptual Grouping

Morphological Operators: Operators performing an action on shapes where the input and output is a binary image.
Threshold each pixel‘s skin affiliation –> Binary Image
Morphological Erosion
- Remove pixels from edges of objects
- Set pixel value to min value of surrounding pixels
Morphological Dilatation
- Add pixels to edges of objects
- Set pixel value to max value of surrounding pixels
Morphological Opening
- Apply erosion, then dilatation
- Goal:
  - Smooth outline
  - Open small bridges
  - Eliminate outliers
Morphological Closing
- Apply dilatation, then erosion
- Goal:
  - Smooth inner edges
  - Connect small distances
  - Fill unwanted holes
Apply morphological closing then morphological opening
- Resulting image is reduced to connected regions of skin color (blobs)

From Skin Blobs To Faces

Goal: align bounding box around face candidate
Important for:
- Face Recognition
- Head Pose Estimation
Different approaches:
- Choose cluster with biggest size
- Ellipse fitting (approximate face region by ellipse)
- Heuristics to distinguish between different skin clusters
- Use temporal information (tracking)
- Facial Feature Detection
- …

S. L. Phung, A. Bouzerdoum and D. Chai, “Skin segmentation using color pixel classification: analysis and comparison,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 1, pp. 148-154, Jan. 2005, doi: 10.1109/TPAMI.2005.17. ↩︎

Face Detection: Neural-Network-Based

Fri, 13 Nov 2020 00:00:00 +0000

Motivation

Idea: Use a search-window to scan over an image
Train a classifier to decide whether the search windows contains a face or not?

Detection

Simple neuron model

Topologies

Parameters

Adjustable Parameters are

Connection weights (to be learned)
Activation function (fixed)
Number of layers (fixed)
Number of neurons per layer (fixed)

Training

Backpropagation with gradient descent

Neural Network Based Face Detection¹

Idea: Use an artifical neural network to detect upright frontal faces
- Network receives as input a 20x20 pixel region of an image
- output ranges from -1 (no face present) to +1 (face present)
- the neural network „face-filter“ is applied at every location in the image
- to detect faces with different sizes, the input image is repeatedly scaled down

Network Topology

20x20 pixel input retina
4 types of receptive hidden fields
One real-valued output

System Overview

Network Training

Training Set

1050 normalized face images
15 face images generated by rotating and scaling original face images
1000 randomly chosen non-face images

Preprocessing

correct for different lighting conditions (overall brightness, shadows)
rescale images to fixed size

Histogram equalization

Defines a mapping of gray levels $p$ into gray levels $q$ such that the distribution of $q$ is close to being uniform
Stretches contrast (expands the range of gray levels)
Transforms different input images so that they have similar intensity distributions (thus reducing the effect of different illumination)
Example
Algorithm
- The probability of an occurrence of a pixel of level $i$ in the image:
  $$ p\left(x\_{i}\right)=\frac{n\_{i}}{n}, \qquad i \in 0, \ldots, L-1 $$
  - $L$: number of gray levels
  - $n\_i$: number of occurences of gray level $i$
- Define $c$ as the cumulative distribution function:
  $$ c(i)=\sum\_{j=0}^{i} p\left(x\_{j}\right) $$
- Create a transformation of the form
  $$ y\_i = T(x\_i) = c(i), \qquad y\_i \in [0, 1] $$
  will produce a level $y$ for each level $x$ in the original image, such that the cumulative probability function of $y$ will be linearized across the value range.
  $$ y\_{i}^{\prime}=y\_{i} \cdot(\max -\min )+\min $$

Training Procedure

Randomly choose 1000 non-face images
Train network to produce 1 for faces, -1 for non-faces
Run network on images containing no faces. Collect subimages in which network incorrectly identifes a face (output > 0)
Select up to 250 of these „false positives“ at random and add them to the training set as negative examples

Neural Network Based Face Filter

Output of ANN defines a filter for faces
Search
- Scan input image with search window, apply ANN to search window
- Input image needs to be rescaled in order to detect faces with different size
Output needs to be post-processed
- Noise removal
- Merging overlapping detections
Speed up can be achieved
- Increase step size
- Make ANN more flexible to translation
- Hierarchical, pyramidal search

Localization and Ground-Truth

For localization, the test data is mostly annotated with ground-truth bounding boxes
Comparing hypotheses to Ground-Truth
- Overlap
  $$ O = \frac{\text{GT } \cap \text{ DET}}{\text{GT } \cup \text{ DET}} $$
  
  Also called Intersection over Union (IoU)
- Often used as threshold: Overlap>50%

Neural Network Based Face Detection, by Henry A. Rowley, Shumeet Baluja, and Takeo Kanade. IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 20, number 1, pages 23-38, January 1998. ↩︎

Face Recognition: Traditional Approaches

Thu, 04 Feb 2021 00:00:00 +0000

Face Recognition for Human-Computer Interaction (HCI)

Main Problem

The variations between the images of the same face due to illumination and viewing direction are almost always larger than image variations due to change in face identity.

– Moses, Adini, Ullman, ECCV‘94

Closed Set vs. Open Set Identification

Closed-Set Identification
- The system reports which person from the gallery is shown on the test image: Who is he?
- Performance metric: Correct identification rate
Open-Set Identification
- The system first decides whether the person on the test image is a known or unknown person. If he is a known person who he is?
- Performance metric
  - False accept: The invalid identity is accepted as one of the individuals in the database.
  - False reject: An individual is rejected even though he/she is present in the database.
  - False classify: An individual in the database is correctly accepted but misclassified as one of the other individuals in the training data

Authentication/Verification

A person claims to be a particular member. The system decides if the test image and the training image is the same person: Is he who he claims he is?

Performance metric:

False Reject Rate (FRR): Rate of rejecting a valid identity
False Accept Rate (FAR): Rate of incorrectly accepting an invalid identity.

Feature-based (Geometrical) approaches

“Face Recognition: Features versus Templates” ¹

Eyebrow thickness and vertical position at the eye center position
A coarse description of the left eyebrow‘s arches
Nose vertical position and width
Mouth vertical position, width, height upper and lower lips
Eleven radii describing the chin shape
Face width at nose position
Face width halfway between nose tip and eyes

Classification

Nearest neighbor classifier with Mahalanobis distance as the distance metric:

$$ \Delta_{j}(x)=\left(x-m_{j}\right)^{T} \Sigma^{-1}\left(x-m_{j}\right) $$

$x$: input face image
$m\_j$: average vector representing the $j$-th person
$\Sigma$: Covariance matrix

Different people are characterized only by their average feature vector.

The distribution is common and estimated by using all the examples in the training set.

Appearance-based approaches

Can be either

holistic (process the whole face as the input), or
local / fiducial (process facial features, such as eyes, mouth, etc. seperately)

Processing steps: align faces with facial landmarks

Use manually labeled or automatically detected eye centers
Normalize face images to a common coordination, remove translation,, rotation and scaling factors
Crop off unnecessary background

Holistic appearance-based approaches

Eigenfaces

💡 Idea

A face image defines a point in the high dimensional image space.

Different face images share a number of similarities with each other

They can be described by a relatively low dimensional subspace
Project the face images into an appropriately chosen subspace and perform classification by similarity computation (distance, angle)
- Dimensionality reduction procedure used here is called Karhunen-Loéve transformation or principal component analysis (PCA)

Objective

Find the vectors that best account for the distribution of face images within the entire image space

PCA

For more details see: Principle Component Analysis (PCA)

Find direction vectors so as to minimize the average projection error
Project on the linear subspace spanned by these vectors
Use covariance matrix to find these direction vectors
Project on the largest K direction vectors to reduce dimensionality

PCA for eigenfaces:

$$ \begin{array}{l} Y=\left[y\_{1}, y\_{2}, y\_{3}, \ldots, y\_{K}\right] \\\\ m=\frac{1}{K}\sum y \\\\ C=(Y-m)(Y-m)^{T} \\\\ D=U^{T} C U \\\\ \Omega=U^{\top}(y-m) \end{array} $$

where

$y$: Face image
$Y$: Face matrix
$m$: Mean face
$C$: Covariance matrix
$D$: Eigenvalues
$U$: Eigenvectors
$\Omega$: Representation coefficients

Training

Acquire initial set of face images (training set):
$$ Y = [y\_1, y\_2, \dots, y\_K] $$
Calculate the eigenfaces/eigenvectors from the training set, keeping only the $M$ images/vectors corresponding to the highest eigenvalues
$$ U = (u\_1, u\_2, \dots, u\_M) $$
Calculate representation of each known individual $k$ in face space
$$ \Omega\_k = U^T(y\_k - m) $$

Testing

Project input new image y into face space
$$ \Omega = U^T(y - m) $$
Find most likely candidate class $k$ by distance computation
$$ \epsilon\_k = \\|\Omega - \Omega\_k\\| \quad \text{for all } \Omega\_k $$

Projections onto the face space

Principal components are called “eigenfaces” and they span the “face space”.
Images can be reconstructed by their projections in face space:

$$ Y\_f = \sum\_{i=1}^{M} \omega\_i u\_i $$

Appearance of faces in face-space does not change a lot

Difference of mean-adjusted image $(Y-m)$ and projection $Y\_f$ gives a measure of „faceness“
- Distance from face space can be used to detect faces
Different cases of projections onto face space
- Case 1: Projection of a known individual
  
  $\rightarrow$ Near face space ($\epsilon < \theta\_{\delta}$) and near known face $\Omega\_k$ ($\epsilon\_k < \theta\_{\epsilon}$)
- Case 2: Projection of an unkown individual
  
  $\rightarrow$ Near face space, far from reference vectors
- Case 3 and 4: not a face (far from face space)

PCA for face matching and recognition

Projects all faces onto a universal eigenspace to “encode” via principal components
Uses inverse-distance as a similarity measure $S(p,g)$ for matching & recognition

Problems and shortcomings

Eigenfaces do NOT distinguish between shape and appearance
PCA does NOT use class information
- PCA projections are optimal for reconstruction from a low dimensional basis, they may not be optimal from a discrimination standpoint

Fisherface

Linear Discriminant Analysis (LDA)

For more details about LDA, see: LDA Summary)

A.k.a. Fischer‘s Linear Discriminant
Preserves separability of classes
Maximizes ratio of projected between-classes to projected within-class scatter

$$ W\_{\mathrm{fld}}=\arg \underset{W}{\max } \frac{\left|W^{T} S\_{B} W\right|}{\left|W^{T} S\_{W} W\right|} $$

Where

$S\_{B}=\sum\_{i=1}^{c}\left|x\_{i}\right|\left(\mu\_{i}-\mu\right)\left(\mu\_{i}-\mu\right)^{T}$: Between-class scatter
- $c$: Number of classes
- $\mu\_i$: mean of class $X\_i$
- $|X\_i|$: number of samples of $X\_i$
$S\_{W}=\sum\_{i=1}^{c} \sum\_{x\_{k} \in X\_{i}}\left(x\_{k}-\mu\_{i}\right)\left(x\_{k}-\mu\_{i}\right)^{T}$: Within-class scatter

LDA vs. PCA

LDA for Fisherfaces

Fisher’s Linear Discriminant

projects away the within-class variation (lighting, expressions) found in training set
preserves the separability of the classes.

Local appearance-based approach

Local vs Holistic approaches:

Local variations on the facial appearance (different expression,occlsion, lighting)
- lead to modifications on the entire representation in the holistic approaches
- while in local approaches ONLY the corresponding local region is effected
Face images contain different statistical illumination (high frequency at the edges and low frequency at smooth regions). It’s easier to represent the varying statistics linearly by using local representation.
Local approaches facilitate the weighting of each local region in terms of their effect on face recognition.

Modular Eigen Spaces

Classification using fiducial regions instead of using entire face ².

Local PCA (Modular PCA)

Face images are divided into $N$ smaller sub-images
PCA is applied on each of these sub-images
Performed better than global PCA on large variations of illumination and expression
No imporvements under variation of pose

Local Feature based

🎯 Objective: To mitigate the effect of expression, illumination, and occlusion variations by performing local analysis and by fusing the outputs of extracted local features at the feature or at the decision level.

Gabor Filters

Elastic Bunch Graphs (EBG)

Local Binary Pattern (LBP) Histogram

http://cbcl.mit.edu/people/poggio/journals/brunelli-poggio-IEEE-PAMI-1993.pdf ↩︎
Pentland, Moghaddam and Starner, “View-based and modular eigenspaces for face recognition,” 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 1994, pp. 84-91, doi: 10.1109/CVPR.1994.323814. ↩︎

Face Recognition: Features

Mon, 15 Feb 2021 00:00:00 +0000

Local Appearance-based Face Recognition

Some popular facial descriptions achieving good results

Local binary Pattern Histogram (LBPH)
Gabor Feature
Discrete Cosine Transform (DCT)
SIFT
etc.

Local binary Pattern Histogram (LBPH)¹

Divide image into cells
Compare each pixel to each of its neighbors
- Where the pixel’s value is greater than the threshold value (e.g., center pixel in this example), write “1”
- Otherwise, write “0”
$\rightarrow$ gives a binary number
Convert binary into decimal
Compute the histogram, over the cell
Use the histogram for classification
- SVM
- Histogram-distances

Tutorials and explanation:

Face Recognition: Understanding LBPH Algorithm

how is the LBP |Local Binary Pattern| values calculated? ~ xRay Pixy

High dim. dense local Feature Extraction

Computing features densely (e.g. on overlapping patches in many scales in the image)
Problem: very very high dimensionality!!!
Solution: Encode into a compact form
- Bag of Visual Word (BoVW) model
- Fisher encoding

Fisher Vector Encoding

Aggregates feature vectors into a compact representation
Fitting a parametric generative model (e.g. Gaussian Mixture Model)
Encode derivative of the likelihood of model w.r.t its parameters

Face recognition across pose (Alignment)

Problem

Different view-point / head orientation
Recoginition results degrade, when images of different head orientation have to be matched 😭

Major directions to address the face recognition across pose Probelm

Geometric pose normalization (image affine warps)
2D specific pose models, image rendering at pixel or feature level (2D+3D approaches)
3D face Model fitting

Pose Normalization

💡 Idea

Find several facial features (mesh)
Use complete mesh to normalize face

Here we will use 2D Active Appearance Models

A texture and shape-based parametric model
Efficient fitting algorithm: Inverse compositional (IC) algorithm

Model and fitting

Independent shape and appearance model

$$ \begin{array}{c} \text{shape:} \quad s=\left(x\_{1}, y\_{1}, x\_{2}, y\_{2}, \cdots, x\_{v}, y\_{v}\right)^{T}=s\_{0}+\sum\_{i=1}^{n} p\_{i} s\_{i} \\\\ \text{appearance:} \quad A(x)=A\_{0}(x)+\sum\_{i=1}^{m} \lambda\_{i} A\_{i}(x) \quad \forall x \in s\_{0} \end{array} $$

Fitting goal:

$$ \arg \min \_{p, \lambda} \sum\_{x \in s\_{0}}\left[A\_{0}(x)+\sum\_{i=1}^{m} \lambda_{i} A\_{i}(x)-I(W(x ; p))\right]^{2} $$

Fitting examples

Fitted mesh
Mismatched mesh

Fitted modal can be used to warp image to frontal pose (e.g. using piecewise affine transformation of mesh triangles)

Faces with different poses from FERET data base and their pose- aligned images

Results

Much better results under pose variations compared to simple affine transform
Different warping functions can be used
- Piecewise affine transformation worked best
Approach works well with local-DCT-based approach
- but not so well with holistic approaches, such as Eigenfaces (PCA) 🤪

Face Recogntion using 3D Models²

A method for face recognition across variations in pose and illumination.
Simulates the process of image formation in 3D space.
Estimates 3D shape and texture of faces from single images by fitting a statistical morphable model of 3D faces to images.
Faces are represented by model parameters for 3D shape and texture.

Model-based Recognition

Face vectors

The morphable face model is based on a vector space representation of faces that is constructed such that any combination of shape and texture vectors $S\_i$ and $T\_i$ describes a realistic human face:

$$ S=\sum_{i=1}^{m} a_{i} S_{i} \quad T=\sum_{i=1}^{m} b_{i} T_{i} $$

The definition of shape and texture vectors is based on a reference face $\mathbf{I}\_0$.

The location of the vertices of the mesh in Cartesian coordinates is $(x\_k, y\_k, z\_k)$ with colors $(R\_k, G\_k, B\_k)$

Reference shape and texture vectors are defined by:

$$ \begin{array}{l} S\_{0}=\left(x\_{1}, y\_{1}, z\_{1}, x\_{2}, \ldots, x\_{n}, y\_{n}, z\_{n}\right)^{T} \\\\ T\_{0}=\left(R\_{1}, G\_{1}, B\_{1}, R\_{2}, \ldots, R\_{n}, G\_{n}, B\_{n}\right)^{T} \end{array} $$

To encode a novel scan $\mathbf{I}$, the flow field from $\mathbf{I}\_0$ to $\mathbf{I}$ is computed.

PCA

PCA is performed on the set of shape and texture vectors separately.
Eigenvectors form an orthogonal basis:
$$ \mathbf{S}=\overline{\mathbf{s}}+\sum\_{i=1}^{m-1} \alpha\_{i} \cdot \mathbf{s}\_{i}, \quad \mathbf{T}=\overline{\mathbf{t}}+\sum\_{i=1}^{m-1} \beta\_{i} \cdot \mathbf{t}\_{i} $$
Example

Model-based Image Analysis

🎯 Goal: find shape and texture coefficients describing a 3D face model such that rendering produces an image $\mathbf{I}\_{\text{model}}$ that is as similar as possible to $\mathbf{I}\_{\text{input}}$
For initialization 7 facial feature points, such as the corners of the eyes or tip of the nose, should be labelled manually
Model fitting: Minimize
$$ E\_{I}=\sum\_{x, y}\left\\|\mathbf{I}\_{\text {input }}(x, y)-\mathbf{I}\_{\text {model }}(x, y)\right\\|^{2} $$
- Shape, texture, transformation, and illumination are optimized for the entire face and refined for each segment.
- Complex iterative optimization procedure

Databases

Necessary to develop and improve algorithms
Provide common testbeds and benchmarks which allow for comparing different approaches
Different databases focus on different problems

Well-known databases for face recognition

FERET
FRVT
FRGC
CMU-PIE
BANCA
XM2VTS
…

Observations

One 3-D image is more powerful for face recognition than one 2- D image.
One high resolution 2-D image is more powerful for face recognition than one 3-D image.
Using 4 or 5 well-chosen 2-D face images is more powerful for face recognition than one 3-D face image or multi-modal 3D+2D face.

Wild Face Datasets

Labeled Faces In the Wild Dataset (LFW)

Face Verification: Given a pair of images specify whether they belong to the same person
13K images, 5.7K people
Standard benchmark in the community
Several test protocols depending upon availability of training data within and outside the dataset.

YouTube Faces Dataset (YTF)

Video Face Verification: Given a pair of videos specify whether they belong to the same person
3425 videos, 1595 people
Standard benchmark in the community
Wide pose, expression and illumination variation

T. Ahonen, A. Hadid and M. Pietikainen, “Face Description with Local Binary Patterns: Application to Face Recognition,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 12, pp. 2037-2041, Dec. 2006, doi: 10.1109/TPAMI.2006.244. ↩︎
V. Blanz and T. Vetter, “Face recognition based on fitting a 3D morphable model,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 9, pp. 1063-1074, Sept. 2003, doi: 10.1109/TPAMI.2003.1227983. ↩︎

Face Recognition: Deep Learning

Mon, 15 Feb 2021 00:00:00 +0000

DeepFace ¹

Main idea

Learn a deep (7 layers) NN (20 million parameters) on 4 million identity labeled face images directly on RGB pixels.

Alignment

Use 6 fiducial points for 2D warp
Then 67 points for 3D model
Frontalize the face for input to NN

Representation

Output is fed in $k$-way softmax, that generates probability distribution over class labels.
🎯 Goal of training: maximize the probability of the correct class

FaceNet²

💡Idea

Map images to a compact Euclidean space, where distances correspond to face similarity
Find $f(x)\ \in \mathbb{R}^d$ for image $x$, so that
- $d^2(f(x\_1), f(x\_2)) \rightarrow \text{small}$, if $x\_1, x\_2 \in \text{same identity}$
- $d^2(f(x\_1), f(x\_3)) \rightarrow \text{large}$, if $x\_1, x\_2 \in \text{different identities}$

System architecture

CNN: optimized embedding
Triplet-based loss function: training

Triplet loss

Image triplets:

$$ \begin{array}{c} \left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{p}\right)\right\\|\_{2}^{2}+\alpha<\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{n}\right)\right\\|_{2}^{2} \\\\ \forall\left(f\left(x\_{i}^{a}\right), f\left(x\_{i}^{p}\right), f\left(x\_{i}^{n}\right)\right) \in \mathcal{T} \end{array} $$ where

$x\_i^a$: Anchor image
$x\_i^p$: Positive image
$x\_i^n$: Negative image
$\mathcal{T}$: Set of all possible triplets in the training set
$\alpha$: Margin between positive and negative pairs

Total Loss function to be minimized:

$$ L=\sum\_{i}^{N}\left[\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{p}\right)\right\\|\_{2}^{2}-\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{n}\right)\right\\|\_{2}^{2}+\alpha\right] $$

Triplet selection

Online Generation
Select only the semi-hard negatives and using all anchor-positive pairs of mini-batch

$\rightarrow$ Select $x\_i^n$ such that
$$ \left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{p}\right)\right\\|\_{2}^{2}<\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{n}\right)\right\\|\_{2}^{2} $$

Results

LFW: 99.63% $\pm$ 0.09
Youtube Faces DB: 95.12% $\pm$ 0.39

Deep Face Recognition ³

Key Questions

Can large scale datasets be built with minimal human intervention? Yes!
Can we propose a convolutional neural network which can compete with that of internet giants like Google and Facebook? Yes!

Dataset Collection

Candidate list generation: Finding names of celebrities
- Tap the knowledge on the web
- 5000 identities
Manual verification of celebrities: Finding Popular Celebrities
- Collect representative images for each celebrity
- 200 images/identity
- Remove people with low representation on Google.
- Remove overlap with public benchmarks
- 2622 celebrities for the final dataset
Rank image sets
- 2000 images per identity
- Searching by appending keyword “actor”
- Learning classifier using data obtained the previous step.
- Ranking 2000 images and selecting top 1000 images
- Approx. 2.6 Million images of 2622 celebrities
Near duplicate removal
- VLAD descriptor based near duplicate removal
Manual filtering
- Curating the dataset further using manual checks

Convolutional Neural Network

The “Very Deep” Architecture
- 3 x 3 Convolution Kernels (Very small)
- Conv. Stride 1 px.
- Relu non-linearity
- No local contrast normalisation
- 3 Fully connected layers
Training
- Random Gaussian Initialization
- Stochastic Gradient Descent with back prop.
- Batch Size: 256
- Incremental FC layer training
- Learning Task Specific Embedding
  - Learning embedding by minimizing triplet loss
    $$ \sum\_{(a, p, n) \in T} \max \left\\{0, \alpha-\left\\|\mathbf{x}\_{a}-\mathbf{x}\_{n}\right\\|\_{2}^{2}+\left\\|\mathbf{x}\_{a}-\mathbf{x}\_{p}\right\\|\_{2}^{2}\right\\} $$
  - Learning a projection from 4096 to 1024 dimensions
  - On line triplet formation at the beginning of each iteration
  - Fine tuned on target datasets
  - Only the projection layers learnt

Y. Taigman, M. Yang, M. Ranzato and L. Wolf, “DeepFace: Closing the Gap to Human-Level Performance in Face Verification,” 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 1701-1708, doi: 10.1109/CVPR.2014.220. ↩︎
Schroff, Florian & Kalenichenko, Dmitry & Philbin, James. (2015). FaceNet: A unified embedding for face recognition and clustering. 815-823. 10.1109/CVPR.2015.7298682. ↩︎
Omkar M. Parkhi, Andrea Vedaldi and Andrew Zisserman. Deep Face Recognition. In Xianghua Xie, Mark W. Jones, and Gary K. L. Tam, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 41.1-41.12. BMVA Press, September 2015. ↩︎

Facial Feature Detection

Thu, 18 Feb 2021 00:00:00 +0000

Introduction

What are facial features?

Facial features are referred to as salient parts of a face region which carry meaningful information.

E.g. eye, eyeblow, nose, mouth
A.k.a facial landmarks

What is facial feature detection?

Facial feature detection is defined as methods of locating the specific areas of a face.

Applications of facial feature detection

Face recognition
Model-based head pose estimation
Eye gaze tracking
Facial expression recognition
Age modeling

Problems in facial feature detection

Identity variations

Each person has unique facial part
Expression variations

Some facial features change their state (e.g. eye blinks).
Head rotations

If a head orientation changes, the visual appearance also changes.
Scale variations

Changes in resolution and distance to the camera affect appearance.
Lighting conditions

Light has non-linear effects on the pixel values of a image.
Occlusions

Hair or glasses might hide facial features.

Older approaches (from face detection)

Integral projections + geometric constraints
Haar-Filter Cascades
PCA-based methods (Modular Eigenspace)
Morphable 3D Model

Statistical appearance models

💡 Idea: make use of prior-knowledge, i.e. models, to reduce the complexity of the task
Needs to be able to deal with variability $\rightarrow$ deformable models
Use statistical models of shape and texture to find facial landmark points
Good models should
- Capture the various characteristics of the object to be detected
- Be a compact representation in order to avoid heavy calculation
- Be robust against noise

Basic idea

Training stage: construction of models
Test stage: Search the region of interest (ROI)

Appearance models

Represent both texture and shape
Statistical model learned from training data
Modeling shape variability
- Landmark points
$$ x=\left[x\_{1}, y\_{1}, x\_{2}, y\_{2}, \ldots, x\_{n}, y\_{n}\right]^{T} $$
- Model
  $$ x \approx \bar{x}+P\_{s} b\_{s} $$
  - $\bar{x}$: Mean vector
  - $P\_s$: Eigenvectors of covariance matrix
  - $b\_s = P\_s^T(x - \bar{x})$
Modeling intensity variability:
- Gray values
  $$ h=\left[g\_{1}, g\_{2}, \ldots, g\_{k}\right]^{T} $$
- Model
  $$ h \approx \bar{h} + P\_ib\_i $$
  - $\bar{h}$: Mean vector
  - $P\_s$: Eigenvectors of covariance matrix
  - $b\_i = P\_i^T(h - \bar{h})$

Training of appearance models

1. Construct a shape model with Principal component analysis (PCA)

A shape is represented with manually labeled points.

The shape model approximates the shape of an object.

Procrustes Analysis

Align the shapes all together to remove translation, rotation and scaling

PCA

The positions of labeled points are

$$ x = \bar{x}+P\_{s} b\_{s} $$

$\bar{x}$: Mean shape
$P\_s$: Orthogonal modes of variation obtained by PCA
$b\_s$: Shape parameters in the projected space

The shapes are represented with fewer parameters ($\operatorname{Dim}(x) > \operatorname{Dim}(b\_s)$)

Generating plausible shapes:

2. Construct a texture model which represents grey-scale (or color) values at each point

Warp the image so that the labeled points fit on the mean shape

Then normalize the intensity on the shape-free patch.

Texture warping

Texture model

The pixel values on the shape-free patch

$$ g = \bar{g} + P\_g b\_g $$

$\bar{g}$ : Mean of normalized pixel values
$P\_g$ : Orthogonal modes of variation obtained by PCA
$b\_g$: Texture parameters in the projected space

The pixel values (appearance) are presented with fewer parameters ($\operatorname{Dim}(g) > \operatorname{Dim}(b\_g)$)

3. Model the correlation between shapes and grey-level models

The concatenated vector is

$$ b=\left(\begin{array}{c} W\_{s} b\_{s} \\\\ b\_{g} \end{array}\right) $$

Apply PCA:

$$ b=P\_{c} c=\left(\begin{array}{l} P\_{c s} \\\\ P\_{c g} \end{array}\right)c $$

Now the parameter $\mathbf{c}$ can control both shape and grey-level models

The shape model
$$ x=\bar{x}+P\_{s} W\_{s}^{-1} P\_{c s} c $$
The grey-level model
$$ g=\bar{g}+P\_{g} P\_{c g} c $$

Examples of synthesized faces

Various objects can be synthesized by controlling the parameter $\mathbf{c}$

Dataset for Building Model

IMM data set from Danish Technical University

240 images with 640*480 size; 40 individuals, with 36 males and 4 females.
Each Subject 6 shots, with different pose, expressions and illuminations.
Each image is labeled with 58 landmarks; 3 closed and 4 opened point-paths.

Image Interpretation with Models

🎯 Goal: find the set of parameters which best match the model to the image
- Optimize some cost function
- Difficult optimization problem
Set of parameters
- Defines shape, position, appearance
- Can be used for further processing
  - Position of landmarks
  - Face recognition
  - Facial expression recognition
  - Pose estimation
Problem: Optimizing the model fit
- Active Shape Models
- Active Appearance Models

Active Shape Models (ASM)

Given a rough starting position, create an instance of $\mathbf{X}$ of the model using

shape parameters $b$
translation $T=(X\_t,Y\_t)$
scale $s$
rotation $\theta$

Iterative approach:

Examine region of the image around $\mathbf{X}\_i$ to find the best nearby match for the point $\mathbf{X}\_i^\prime$
Update parameters $(b, T, s, \theta)$ to best fit the new points $\mathbf{X}$ (constrain the model parameters to be within three standard deviations)
Repeat until convergence

In practice: search along profile normals

The optimal parameters are searched from multi-resolution images hierarchically (faster algorithm)
1. Search for the object in a coarse image
2. Refine the location in a series of higher resolution images.

Example of search

Disadvantages

Uses mainly shape constraints for search
Does not take advantage of texture across the target

Active Appearance Models (AAM)

Optimize parameters, so as to minimize the difference of a synthesized image and the target image
Solved using a gradient-descent approach

Fitting AAMs

Learning linear relation matrix $\mathbf{R}$ using multi-variate linear regression

Generate training set by perturbing model parameters for training images
Include small displacements in position, scale, and orientation
Record perturbation and image difference
Experimentally, optimal perturbation around 0.5 standard deviations for each parameter

ASM vs. AAM

ASM

Seeks to match a set of model points to an image, constrained by a statistical model of shape
Matches model points using an iterative technique (variant of EM-algorithm)
A search is made around the current position of each point to find a nearby point which best matches texture for the landmark
Parameters of the shape model are then updated to move the model points closer to the new points in the image

AAM: matches both position of model points and representation of texture of the object to an image

Uses the difference between current synthesized image and target image to update parameters
Typically, less landmark points are needed

Summary of ASM and AAM

Statistical appearance models provide a compact representation
Can model variations such as different identities, facial expression, appearances, etc.
Labeled training images are needed (very time-consuming) 🤪
Original formulation of ASM and AAM is computationally expensive (i.e. slow) 🤪
But, efficient extensions and speed-ups exist!
- Multi-resolution search
- Constrained AAM search
- Inverse compositional AAMs (CMU)
Usage
- Facial fiducial point detection
- Face recognition, pose estimation
- Facial expression analysis
- Audio-visual speech recognition

More Modern Approaches: Conditional Random Forests For Real Time Facial Feature Detection¹

Basics

Regression tree

Basically like classification decision tree
In the nodes-decisions are comparison of numbers
In the leafs-numbers or multidimensional vectors of numbers
Example

Random regression forests

Set of random regression trees
Random
- Different trees trained on random subset of training data
- After training, predictions for unseen samples can be made by averaging the predictions from all the individual regression trees

Basic idea

Train different set of trees for different head pose.
The leaf nodes accumulates votes for the different facial fiducial points

Regression forests training

Each Tree is trained from randomly selected set of images.
Extract patches in each image
Training goal: accumulate probability for a feature point $C\_n$ given a patch $P$ at the leaf node
- Each patch is represented by appearance features $I$, and displacement vectors $D$ (offsets) to each of the facial fiducial feature point. I.e. $P = \\{I, D\\}$
- A simple patch comparison is used as Tree-node splitting criterion

Regression forests testing

Given: a random face image
Extract densely set of patches from the image
Feed all patches to all trees in the forest
Get for each patch $P\_i$ a corresponding set of leafs
A density estimator for the location of ffp’s is calculated
Run meanshift to find all locations

Conditional Regression Forest

Conditional regression tree works alike.
For training:
- Compute a probability for a concrete head pose
- For each head pose divide the training set in disjoint subsets according to the pose
- Train a regression forest for each subset
For testing:
- Estimate the probabilities for each head pose
- Select trees from different regression forests
- Estimate the density function for all facial feature points.
- Finalize the exact poition by clustering over all feature candidate votes for a given facial feature point. (e.g., by meanshift)

Experiments and results

Training set:
- 13233 face images from LFW Database
- 10 annotated facial feature points per face image
Training
- Maximum tree depth = 20
- 2500 splitting candidates and 25 thresholds per split
- 1500 images to train each tree
- 200 patches per image (20 * 20 pixels).
- For head pose two different subsets with 3 and 5 head poses are generated (accuracy 72,5%)
- Required time for face detection and head pose estimation is 33 ms.
Results

CNN based models

Stacked Hourglass Network ²

Fully-convolutional neural network
Repeated down- and upsampling + shortcut connections
Based on RGB face image, produce one heatmap for each landmark
Heatmaps are transformed into numerical coordinates using DSNT

M. Dantone, J. Gall, G. Fanelli and L. Van Gool, “Real-time facial feature detection using conditional regression forests,” 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 2012, pp. 2578-2585, doi: 10.1109/CVPR.2012.6247976. ↩︎
Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9912 LNCS, 483–499. https://doi.org/10.1007/978-3-319-46484-8_29 ↩︎

Facial Expression Recognition

Thu, 18 Feb 2021 00:00:00 +0000

What is facial expression analysis?

What is Facial Expression?

Facial expressions are the facial changes in response to a person‘s internal emotional states, interntions, or social communications.

Role of facial expressions

Almost the most powerful, natural, and immediate way (for human beings) to communicate emotions and intentions
Face can express emotion sooner than people verbalize or realize feelings
Faces and facial expressions are an important aspect in interpersonal communication and man-machine interfaces

Facial Expressions

Facial expression(s):
- nonverbal communication
- voluntary / involuntary
- results from one or more motions or positions of the muscles of the face
- closely associated with our emotions
The fact: Most people’s success rate at reading emotions from facial expression is only a little over 50 percent.

Facial expression analysis vs. Emotion analysis

Emotion analysis requires higher level knowledge, such as context information.
Besides emotions, facial expressions can also express intention, cognitive processes, physical effort, etc.

Emotions conveyed by Facial Expressions

Six basic emotions (assumed to be innate)

Basic structure of facial expression analysis systems

Levels of description

Emotions

Discrete classes

Six basic emotions
Positive, neutral, negative

Continuous valued dimensions

Emotions as a continuum along 2/3 dimension
Circumplex model by Russel
- Valence: unpleasant - pleasant
- Arousal: low – high activation

Facial Action Units (AUs)

Facial Action Coding System (FACS)

A human-observer based system designed to detect subtle changes in facial features
Viewing videotaped facial behavior in slow motion, trained observer can manually FACS code all possible facial displays
These facial displays are referred to as action units (AU) and may occur individually or in combinations.

Action Units (AUs)

There are 44 AUs
30 AUs related to contractions of special facial muscles
- 12 AUs for upper face
- 18 AUs for lower face
Anatomic basis of the remaining 14 is unspecified $\rightarrow$ referred to in Facial Action Coding System (FACS) as miscellaneous actions
For action units that vary in intensity, a 5-point ordinal scale is used to measure the degree of muscle contraction

Combination of AUs

More than 7000 different AU combinations have been observed.

Additive: appearance of single AUs does NOT change. E.g.
Nonadditive: appearance of single AUs does change. E.g.

Individual Differences in Subjects

Variations in appearance
- Face shape,
- Texture
- Color
- Facial and scalp hair
due to sex, ethnic background, and age differences
Variations in expressiveness

Transitions Among Expressions

Simplifying assumption: expressions are singular and begin and end with a neutral position
Transitions from action units or combination of actions to another may involve NO intervening neutral state.
Parsing the stream of behavior is an essential requirement of a robust facial analysis system, and training data are needed that include dynamic combinations of action units, which may be either additive or nonadditive.

Intensity of Facial Expression

Facial actions can vary in intensity
FACS coding uses 5-point intensity scale to describe intensity variation of action units
Some related action units function as sets to represent intensity variation.
- E.g. in the eye region, action units 41, 42, and 43 or 45 can represent intensity variation from slightly drooped to closed eyes.

Relation to other Facial Behavior or Nonfacial Behavior

Facial expression is one of several channels of nonverbal communication.
The message values of various modes may differ depending on context.
For robustness, should be integrated with
- Gesture
- Prosody
- Speech

Different datasets and systems

Using geometric features + ANN (2001 / early work)

Recognizing Action Units for Facial Expression Analysis¹

An Automatic Facial Analysis (AFA) system to analyze facial expressions based on both permanent facial features (brows, eyes, mouth) and transient facial features (depending of facial furrows) in a nearly frontal-view image sequences.
A group of action units (neutral expression, six upper face AUs and 10 lower face AUs) are recognized whether they occur alone or in combinations.

Cohn-Kanade AU-Coded Facial Expression Database

100 subjects from varying ethnic backgrounds.
23 different facial expressions (single action units and combinations of action units)
Frontal faces, small head motion
Variations in lighting
- ambient lighting
- single-high-intensity lamp
- dual high-intensity lamps with reflective umbrellas
Coded with FACS and assigned emotion-specified labels (happy, surprise, anger, disgust, fear, sadness)
Example

Feature-based Automatic Facial Action Analysis (AFA) System

Feature detection & feature location
- Region of the face and location of individual face features detected automatically in the initial frame using neural network based approach
- Contours of face features and components adjusted manually in the initial frame
- Face features are then tracked automatically
  - permanent features (e.g., brows, eyes, lips)
  - transient features (lines and furrows)
Feature extraction: Group facial features into separate collections of feature parameters
- 15 normalized upper face parameters
- 9 normalized lower face parameters
Parameters fed to two neural-network-based classifiers

Facial Feature Extraction

Multistate Facial Component Models of a Frontal Face

Permanent components/features
- Lip
- Eye
- Brow
- Cheek
Transient component/features
- Furrows and wrinkles appear perpendicular to the direction of the motion of the activated muscles
- Classification
  - present (appear, deepen or lengthen)
  - absent
- Detection
  - Canny edge detector
  - Nasal root / crow’s-feet wrinkles
  - Nasolabial furrows

Facial Feature Representation

Face coordinate system
- $x = $ line between inner corners of eyes
- $y = $ perpendicular to x
Group facial features
- upper face features: 15 parameters
  - Example
- lower face features: 9 parameters
  - Example

AU Recognition by Neural Networks

Three layer neural networks (one hidden layer)
Standard back-propagation method
- Separate networks for upper- / lower face

Using appearance-based features + SVM (2006)

Automatic Recognition of Facial Actions in Spontaneous Expression²

RU-FACS data set

Containts spontaneous expressions
100 subjects

Using Deep features (CNN) + fusion (2013)

Emotion Recognition in the Wild Challenge (EmotiW)

🎯 Goal: Move to more realistic out of the lab data
AFEW Dataset (Acted Facial Expressions in the Wild)
- Extracted from movies
- Annotated with six basic emotions
- Movie clips from 330 subjects, age range: 1-70
- Semi-automatic annotation pipeline
  - Recommender sytem + manual annotation

2013 Winner

Combining Modality Specific Deep Neural Networks for Emotion Recognition in Video³

Convolutional Network

Inputs are images of size 40x40, cropped randomly
Four layers, 3 convolutions followed by max or average pooling and a fully-connected layer

Representing video sequence

CNN gives 7-dim output per frame
Multiple frames are averaged into 10 vectors describing the sequence
- For shorter sequences, frames / vectors get expanded (duplicated)
Results in 70-dim feature vector (10*7)
Classification with SVM

Other Features

„Bag of Mouth“
Audio-features

Typical Pipline

Face detection and alignment
Extract various features and different representations
Build multiple classifiers
Fusion of results

Other Applications

Pain Analysis
Analysis of psychological disorders
Workload / stress analysis
Adaptive user interfaces
Advertisment

Y. . -I. Tian, T. Kanade and J. F. Cohn, “Recognizing action units for facial expression analysis,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 97-115, Feb. 2001, doi: 10.1109/34.908962. ↩︎
Littlewort, Gwen & Frank, Mark & Lainscsek, Claudia & Fasel, Ian & Movellan, Javier. (2006). Automatic Recognition of Facial Actions in Spontaneous Expressions. Journal of Multimedia. 1. 10.4304/jmm.1.6.22-35. ↩︎
Kahou, Samira Ebrahimi & Pal, Christopher & Bouthillier, Xavier & Froumenty, Pierre & Gulcehre, Caglar & Memisevic, Roland & Vincent, Pascal & Courville, Aaron & Bengio, Y. & Ferrari, Raul & Mirza, Mehdi & Jean, Sébastien & Carrier, Pierre-Luc & Dauphin, Yann & Boulanger-Lewandowski, Nicolas & Aggarwal, Abhishek & Zumer, Jeremie & Lamblin, Pascal & Raymond, Jean-Philippe & Wu, Zhenzhou. (2013). Combining modality specific deep neural networks for emotion recognition in video. ICMI 2013 - Proceedings of the 2013 ACM International Conference on Multimodal Interaction. 543-550. 10.1145/2522848.2531745. ↩︎

Face

Sat, 19 Dec 2020 00:00:00 +0000

Modern Face Recognition Overview

Sat, 19 Dec 2020 00:00:00 +0000

Face recognition is a series of several related problems:

Face detection: Look at a picture and find all the faces in it
Focus on each face and be able to understand that even if a face is turned in a weird direction or in bad lighting, it is still the same person.
Be able to pick out unique features of the face that you can use to tell it apart from other people (like how big the eyes are, how long the face is, etc.)
Compare the unique features of that face to all the people you already know to determine the person’s name.

Step 1: Face detection

Face detection = locate the faces in a photograph

One of the methods for face detection is called Histogram of Oriented Gradients (HOG)¹ invented in 2005.

To find faces in an image, we’ll start by making our image black and white because we don’t need color data to find faces:

Then we’ll look at every single pixel in our image one at a time. For every single pixel, we want to look at the pixels that directly surrounding it:

Our goal is to figure out how dark the current pixel is compared to the pixels directly surrounding it. Then we want to draw an arrow showing in which direction the image is getting darker:

Looking at just this one pixel and the pixels around it. The image is getting darker towards the upper right.

If we repeat that process for every single pixel in the image, we will end up with every pixel being replaced by an arrow. These arrows are called gradients and they show the flow from light to dark across the entire image:

Q: Why should we replace the pixels with gradients?

A: If we analyze pixels directly, really dark images and really light images of the same person will have totally different pixel values. But by only considering the direction that brightness changes, both really dark images and really bright images will end up with the same exact representation. That makes the problem a lot easier to solve! 👏

But saving the gradient for every single pixel gives us way too much detail. It would be better if we could just see the basic flow of lightness/darkness at a higher level so we could see the basic pattern of the image. To do this,

Break up the image into small squares of 16x16 pixels each
In each square, count up how many gradients point in each major direction (how many point up, point up-right, point right, etc…).
Replace that square in the image with the arrow directions that were the strongest.

The end result is we turn the original image into a very simple representation that captures the basic structure of a face in a simple way:

The original image is turned into a HOG representation that captures the major features of the image regardless of image brightnesss.

To find faces in this HOG image, all we have to do is find the part of our image that looks the most similar to a known HOG pattern that was extracted from a bunch of other training faces:

Step 2: Posing and Projecting Faces

After isolating the faces in our image, we have to deal with the problem that faces turned different directions look totally different to a computer:

To account for this, we will try to warp each picture so that the eyes and lips are always in the sample place in the image. This will make it a lot easier for us to compare faces in the next steps.

To do this, we are going to use an algorithm called face landmark estimation ². The basic idea is we will come up with 68 specific points (called landmarks) that exist on every face — the top of the chin, the outside edge of each eye, the inner edge of each eyebrow, etc. Then we will train a machine learning algorithm to be able to find these 68 specific points on any face:

Result of locating the 68 face landmarks on our test image:

Now we know where eyes and mouth are, we’ll simple rotate, scale, and shear the images so that the eyes and mouth are centered as best as possible. We are only going to use basic image transformations like rotation and scale that preserve parallel lines (called affine transformations):

Now no matter how the face is turned, we are able to center the eyes and mouth are in roughly the same position in the image. This will make our next step a lot more accurate. 👏

Step 3: Encoding Faces

The simplest approach to face recognition is to directly compare the unknown face we found in Step 2 with all the pictures we have of people that have already been tagged. When we find a previously tagged face that looks very similar to our unknown face, it must be the same person.

What we need is a way to extract a few basic measurements from each face. Then we could measure our unknown face the same way and find the known face with the closest measurements.

How to measure a face?

The solution is to train a deep convolutional neural network which can generate 128 measurements (a.k.a. Embedding) for each face ³.

The training process works by looking at 3 face images at a time:

Load a training face image of a known person
Load another picture of the same known person
Load a picture of a totally different person

Then the algorithm looks at the measurements it is currently generating for each of those three images. It then tweaks the neural network slightly so that it makes sure the measurements it generates for #1 and #2 are slightly closer while making sure the measurements for #2 and #3 are slightly further apart.

After repeating this step millions of times for millions of images of thousands of different people, the neural network learns to reliably generate 128 measurements for each person. Any ten different pictures of the same person should give roughly the same measurements.

Encoding face image

Once the network has been trained, it can generate measurements for any face, even ones it has never seen before. All we need to do ourselves is run our face images through their pre-trained network to get the 128 measurements for each face. Here’s the measurements for our test image:

We don’t need to care what parts of the face are these 128 numbers measuring exactly. All that we care is that the network generates nearly the same numbers when looking at two different pictures of the same person.

Step 4: Finding the person’s name from the encoding

This last step is actually the easiest step in the whole process. All we have to do is find the person in our database of known people who has the closest measurements to our test image.

We can do that by using any basic machine learning classification algorithm (e.g. SVM). All we need to do is train a classifier that can take in the measurements from a new test image and tells which known person is the closest match.

Example

Train a classifier with the embeddings of about 20 pictures each of Will Ferrell, Chad Smith and Jimmy Falon:

Then run the classifier on every frame of the famous youtube video of Will Ferrell and Chad Smith pretending to be each other on the Jimmy Fallon show:

Open Source Face Recognition library

face_recognition: Recognize and manipulate faces from Python or from the command line with the world’s simplest face recognition library.
facenet-pytorch: Face Recognition Using Pytorch

Reference

Machine Learning is Fun! Part 4: Modern Face Recognition with Deep Learning

Eigenface

Sat, 19 Dec 2020 00:00:00 +0000

Google Colab Notebook

Open in Google Colab

Face | Haobin Tan

Face Detection: Color-Based

TL;DR

Color-based face detection overview

Color

Color spaces

Problems

How to model skin color?

Histogram as skin color model

Histogram Backprojection

Histogram Matching

Histogram Backprojection vs. Matching

Parametric models

Gaussian Density Models

Mixture of Gaussian Models

Bayes Classifier

Discriminative Models / Classifiers

Performance Measures

For classification

ROC (Receiver Operating Characteristic)

Skin-color: Analysis and Comparison

From Skin-Colored Pixels to Faces

Perceptual Grouping

From Skin Blobs To Faces

Face Detection: Neural-Network-Based

Motivation

Detection

Simple neuron model

Topologies

Parameters

Training

Neural Network Based Face Detection1

Network Topology

System Overview

Network Training

Training Set

Preprocessing

Histogram equalization

Training Procedure

Neural Network Based Face Filter

Localization and Ground-Truth

Face Recognition: Traditional Approaches

Face Recognition for Human-Computer Interaction (HCI)

Main Problem

Closed Set vs. Open Set Identification

Authentication/Verification

Feature-based (Geometrical) approaches

Classification

Appearance-based approaches

Holistic appearance-based approaches

Eigenfaces

💡 Idea

Objective

PCA

Training

Testing

Projections onto the face space

PCA for face matching and recognition

Problems and shortcomings

Fisherface

Linear Discriminant Analysis (LDA)

LDA for Fisherfaces

Local appearance-based approach

Modular Eigen Spaces

Local PCA (Modular PCA)

Local Feature based

Gabor Filters

Elastic Bunch Graphs (EBG)

Local Binary Pattern (LBP) Histogram

Face Recognition: Features

Local Appearance-based Face Recognition

Local binary Pattern Histogram (LBPH)1

High dim. dense local Feature Extraction

Fisher Vector Encoding

Face recognition across pose (Alignment)

Pose Normalization

💡 Idea

Model and fitting

Faces with different poses from FERET data base and their pose- aligned images

Results

Neural Network Based Face Detection¹

Local binary Pattern Histogram (LBPH)¹

Face Recogntion using 3D Models²

DeepFace ¹

FaceNet²

Deep Face Recognition ³

More Modern Approaches: Conditional Random Forests For Real Time Facial Feature Detection¹