Face Recognition: Deep Learning

DeepFace 1

Main idea

Learn a deep (7 layers) NN (20 million parameters) on 4 million identity labeled face images directly on RGB pixels.


  • Use 6 fiducial points for 2D warp

  • Then 67 points for 3D model

  • Frontalize the face for input to NN

截屏2021-02-16 20.51.01


  • Output is fed in $k$-way softmax, that generates probability distribution over class labels.

    截屏2021-02-16 20.52.06

  • 🎯 Goal of training: maximize the probability of the correct class



  • Map images to a compact Euclidean space, where distances correspond to face similarity
  • Find $f(x)\ \in \mathbb{R}^d$ for image $x$, so that
    • $d^2(f(x\_1), f(x\_2)) \rightarrow \text{small}$, if $x\_1, x\_2 \in \text{same identity}$
    • $d^2(f(x\_1), f(x\_3)) \rightarrow \text{large}$, if $x\_1, x\_2 \in \text{different identities}$

System architecture

截屏2021-02-16 21.04.45

  • CNN: optimized embedding
  • Triplet-based loss function: training

Triplet loss

Image triplets:

截屏2021-02-16 21.06.14 $$ \begin{array}{c} \left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{p}\right)\right\\|\_{2}^{2}+\alpha<\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{n}\right)\right\\|_{2}^{2} \\\\ \forall\left(f\left(x\_{i}^{a}\right), f\left(x\_{i}^{p}\right), f\left(x\_{i}^{n}\right)\right) \in \mathcal{T} \end{array} $$ where
  • $x\_i^a$: Anchor image

  • $x\_i^p$: Positive image

  • $x\_i^n$: Negative image

  • $\mathcal{T}$: Set of all possible triplets in the training set

  • $\alpha$: Margin between positive and negative pairs

Total Loss function to be minimized:

$$ L=\sum\_{i}^{N}\left[\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{p}\right)\right\\|\_{2}^{2}-\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{n}\right)\right\\|\_{2}^{2}+\alpha\right] $$

Triplet selection

  • Online Generation

  • Select only the semi-hard negatives and using all anchor-positive pairs of mini-batch

    $\rightarrow$ Select $x\_i^n$ such that

    $$ \left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{p}\right)\right\\|\_{2}^{2}<\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{n}\right)\right\\|\_{2}^{2} $$


  • LFW: 99.63% $\pm$ 0.09
  • Youtube Faces DB: 95.12% $\pm$ 0.39

Deep Face Recognition 3

Key Questions

  • Can large scale datasets be built with minimal human intervention? Yes!

  • Can we propose a convolutional neural network which can compete with that of internet giants like Google and Facebook? Yes!

Dataset Collection

  1. Candidate list generation: Finding names of celebrities

    • Tap the knowledge on the web
    • 5000 identities
  2. Manual verification of celebrities: Finding Popular Celebrities

    • Collect representative images for each celebrity

    • 200 images/identity

    • Remove people with low representation on Google.

    • Remove overlap with public benchmarks

    • 2622 celebrities for the final dataset

  3. Rank image sets

    • 2000 images per identity
    • Searching by appending keyword “actor”
    • Learning classifier using data obtained the previous step.
    • Ranking 2000 images and selecting top 1000 images
    • Approx. 2.6 Million images of 2622 celebrities
  4. Near duplicate removal

    • VLAD descriptor based near duplicate removal
  5. Manual filtering

    • Curating the dataset further using manual checks

Convolutional Neural Network

  • The “Very Deep” Architecture

    • 3 x 3 Convolution Kernels (Very small)

    • Conv. Stride 1 px.

    • Relu non-linearity

    • No local contrast normalisation

    • 3 Fully connected layers

  • Training

    • Random Gaussian Initialization

    • Stochastic Gradient Descent with back prop.

    • Batch Size: 256

    • Incremental FC layer training

    • Learning Task Specific Embedding

      • Learning embedding by minimizing triplet loss

        $$ \sum\_{(a, p, n) \in T} \max \left\\{0, \alpha-\left\\|\mathbf{x}\_{a}-\mathbf{x}\_{n}\right\\|\_{2}^{2}+\left\\|\mathbf{x}\_{a}-\mathbf{x}\_{p}\right\\|\_{2}^{2}\right\\} $$
      • Learning a projection from 4096 to 1024 dimensions

      • On line triplet formation at the beginning of each iteration

      • Fine tuned on target datasets

      • Only the projection layers learnt

