<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Face | Haobin Tan</title><link>https://haobin-tan.netlify.app/tags/face/</link><atom:link href="https://haobin-tan.netlify.app/tags/face/index.xml" rel="self" type="application/rss+xml"/><description>Face</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Thu, 18 Feb 2021 00:00:00 +0000</lastBuildDate><image><url>https://haobin-tan.netlify.app/media/icon_hu7d15bc7db65c8eaf7a4f66f5447d0b42_15095_512x512_fill_lanczos_center_3.png</url><title>Face</title><link>https://haobin-tan.netlify.app/tags/face/</link></image><item><title>Face Detection: Color-Based</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/03-face-detection-color/</link><pubDate>Fri, 06 Nov 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/03-face-detection-color/</guid><description>&lt;h2 id="tldr">TL;DR&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>Different color spaces and classifiers can be used&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Models: histograms, Gaussian Models, Mixture of Gaussians Model&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Histogram-backprojection / Histogram matching&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Bayes classifier&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Discriminative Classifiers (ANN, SVM)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Bayesian classifier and ANN seem to work well&lt;/p>
&lt;ul>
&lt;li>Sufficient training data is needed for modeling the pdf, in particular for Bayesian approach (positive &amp;amp; negative pdfs learned)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Advantages: Fast, rotation &amp;amp; scale invariant, robust against occlusions&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Disadvantages:&lt;/p>
&lt;ul>
&lt;li>Affected by illumination&lt;/li>
&lt;li>Cannot distinguish head and hands&lt;/li>
&lt;li>Skin-colored objects in the background problematic&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Metric: ROC curve used to compare classification results / methods&lt;/p>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="color-based-face-detection-overview">Color-based face detection overview&lt;/h2>
&lt;p>💡 &lt;strong>Idea: human skin has consistent color, which is distinct from many objects&lt;/strong>&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-10%2014.57.37.png" alt="截屏2020-11-10 14.57.37">&lt;/p>
&lt;p>Possible approach:&lt;/p>
&lt;ol>
&lt;li>Find skin colored pixels&lt;/li>
&lt;li>Group skin colored pixels&lt;/li>
&lt;/ol>
&lt;ul>
&lt;li>(and apply some heuristics) to find the face&lt;/li>
&lt;/ul>
&lt;h2 id="color">Color&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Grayscale&lt;/strong> Image: Each pixel represented by &lt;strong>one&lt;/strong> number (typically integer between 0 and 255)&lt;/li>
&lt;li>&lt;strong>Color&lt;/strong> image: Pixels represented by &lt;strong>three&lt;/strong> numbers&lt;/li>
&lt;/ul>
&lt;p>Different representations exist &amp;ndash;&amp;gt; „Color Spaces“&lt;/p>
&lt;h3 id="color-spaces">Color spaces&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>RGB&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>most widely used&lt;/p>
&lt;/li>
&lt;li>
&lt;p>specifies colors in terms of the primary colors &lt;strong>red (R), green (G), and blue (B)&lt;/strong>&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-10%2015.00.08-20201110184617048.png" alt="截屏2020-11-10 15.00.08">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>HSV/HSI&lt;/strong>: &lt;strong>hue (H)&lt;/strong>, &lt;strong>saturation (S)&lt;/strong> and &lt;strong>value(V)/intensity (I)&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Closely related to human perception (hue, colorfulness and brightness)&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-10%2017.27.38.png" alt="截屏2020-11-10 17.27.38">&lt;/p>
&lt;ul>
&lt;li>Hue: &amp;ldquo;color&amp;rdquo;&lt;/li>
&lt;li>Saturation: How &amp;ldquo;pure&amp;rdquo; the color is?&lt;/li>
&lt;li>Value: &amp;ldquo;lightness&amp;rdquo;&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Class Y spaces&lt;/strong>: YCbCr (Digital Video), YIQ (NTSC), YUV (PAL)&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Y channel contains brightness, other two channels store chrominance (U=B-Y, V=R-Y)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Conversion from RGB to Yxx is a linear transformation&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-10%2018.18.27.png" alt="截屏2020-11-10 18.18.27">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Perceptually uniform spaces&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Perceived color difference is uniform to difference in color values&lt;/li>
&lt;li>Euclidian distance can be used for color comparison&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-10%2018.19.07.png" alt="截屏2020-11-10 18.19.07">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Chromatic Color Spaces&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Two color channels containing chrominance (colour) information&lt;/p>
&lt;ul>
&lt;li>HS (taken from HSV)&lt;/li>
&lt;li>UV (taken from YUV)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Normalized rg from RGB:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>r = R / (R+G+B)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>g = G / (R+G+B)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>b = B / (R+G+B)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Sometimes it is argued that chromatic skin color models are more robust&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="problems">Problems&lt;/h4>
&lt;ul>
&lt;li>Reflected color depends on spectrum of the light source (and properties of the object / surface)&lt;/li>
&lt;li>If the light source / illumination changes, the reflected color signal changes!!! 🤪&lt;/li>
&lt;/ul>
&lt;h2 id="how-to-model-skin-color">How to model skin color?&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>&lt;a href="#histogram-as-skin-color-model">Non-parametric models: typically histograms&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="#parametric-models">Parametric models&lt;/a>&lt;/p>
&lt;ul>
&lt;li>Gaussian Model&lt;/li>
&lt;li>Gaussian Mixture Model&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Or just learn decision boundaries between classes (&lt;a href="#discriminative-models--classifiers">discriminative model&lt;/a>)&lt;/p>
&lt;ul>
&lt;li>ANN, SVM, &amp;hellip;&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="histogram-as-skin-color-model">Histogram as skin color model&lt;/h3>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-10%2018.34.57.png" alt="截屏2020-11-10 18.34.57">&lt;/p>
&lt;ul>
&lt;li>👍 Advantages: Works very well in practice&lt;/li>
&lt;li>👎 Disadvantages
&lt;ul>
&lt;li>Memory size quickly gets high&lt;/li>
&lt;li>A large number of labelled skin and non-skin samples is needed!&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="histogram-backprojection">Histogram Backprojection&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>The simplest (and fastest) way to utilize histogram information&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Each pixel in the backprojection is set to the value of the (skin-color) histogram bin indexed by the color of the respective pixel&lt;/p>
&lt;ul>
&lt;li>A color $x$ is considered as skin color if $H\_{+}(x) > \theta$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>E.g.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-22%2022.20.33.png" alt="截屏2021-07-22 22.20.33">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="histogram-matching">Histogram Matching&lt;/h4>
&lt;ul>
&lt;li>Backprojection
&lt;ul>
&lt;li>is good, when the color distribution of the target is monomodal.&lt;/li>
&lt;li>is not optimal, when the target is multi colored! &amp;#x1f622;&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>🔧 Solution: Build a histogram of the image within the search window, and compare it to the target histogram.
&lt;ul>
&lt;li>distance metrics for histograms, e.g.:
&lt;ul>
&lt;li>
&lt;p>Battacharya distance&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Histogram intersection&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Earth-movers distance,&amp;hellip;&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="histogram-backprojection-vs-matching">Histogram Backprojection vs. Matching&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Histogram Backprojection&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Compares color of a single pixel with color model&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Fast and simple&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Can only cope well with mono-modal distributions&lt;/p>
&lt;/li>
&lt;li>
&lt;p>sufficient for skin-color classification&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Histogram Matching / Intersection&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Compares color histogram of image patch with color model&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Better performance&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Can cope with multi-modal distributions&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Computationally expensive&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="parametric-models">Parametric models&lt;/h3>
&lt;h4 id="gaussian-density-models">Gaussian Density Models&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Gaussian Densities&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Assume that the distribution of skin colors p(x) has a parametric functional form&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Most common function: Gaussion function $\mathrm{G}(\mathbf{x} ; \mu, \mathbf{C})$
&lt;/p>
$$
p(x | \text{skin})=G(x ; \mu, C)=\frac{1}{(2 \pi)^{d / 2}|C|^{1 / 2} }\exp \left\\{-1 / 2(x-\mu)^{\top} C^{-1}(x-\mu)\right\\}
$$
&lt;ul>
&lt;li>Mean $\mu$ and covariance matrix $C$ are estimated from a training set of skin colors $S = {x\_1,x\_2,...,x\_N}$:
&lt;ul>
&lt;li>$\mu = E\{x\}$&lt;/li>
&lt;li>$C = E\{(\boldsymbol{x}-\mu)^T(\boldsymbol{x}-\mu)\}$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>A color is considered as skin color if&lt;/p>
&lt;ul>
&lt;li>$p(x|\text{skin}) > \theta$&lt;/li>
&lt;li>$p(x|\text{skin}) > p(x|\text{non-skin})$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="mixture-of-gaussian-models">Mixture of Gaussian Models&lt;/h4>
$$
p(x)=\sum\_{i=1}^{K} \pi\_{i} G\left(x, \mu\_{i}, C\_{i}\right)
$$
&lt;ul>
&lt;li>
&lt;p>Parameter set $\Phi$ can be estimated using the &lt;strong>EM&lt;/strong> algorithm&lt;/p>
&lt;ul>
&lt;li>Iteratively changes parameters so as to maximize the log-likelihood of the training set:
$$
L=\log \prod\_{i=1}^{N} p\left(x\_{i} \mid \Phi\right)
$$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>A color is considered as skin color if&lt;/p>
&lt;ul>
&lt;li>$p(x|\text{skin}) > \theta$&lt;/li>
&lt;li>$p(x|\text{skin}) > p(x|\text{non-skin})$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="bayes-classifier">Bayes Classifier&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Skin Classification using &lt;strong>Bayes Decision Rule&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Minimum cost decision rule&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Classify pixel to skin class if $P(\text{Skin} | x)>P(\text{Non-Skin} | x)$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Decision Rule:
&lt;/p>
$$
\frac{p(\mathbf{x} \mid \text {Skin})}{p(\mathbf{x} \mid \text {Non-Skin})} \geq \frac{P(\text {Non-Skin})}{P(\text {Skin})}
$$
&lt;/li>
&lt;li>
&lt;p>The classconditionals $p(x|\omega)$ can be estimated from the corresponding histograms:
&lt;/p>
$$
p\left(x \mid \omega\_{i}\right)=h\_{i}(x) / \sum\_{x} h\_{i}(x)
$$
&lt;ul>
&lt;li>$h\_i(x)$: count of pixels from class $\omega\_{i}$ that have value $x$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="discriminative-models--classifiers">Discriminative Models / Classifiers&lt;/h3>
&lt;ul>
&lt;li>Artificial Neural Networks&lt;/li>
&lt;li>Support Vector Machine&lt;/li>
&lt;/ul>
&lt;h2 id="performance-measures">Performance Measures&lt;/h2>
&lt;h3 id="for-classification">For classification&lt;/h3>
&lt;p>When comparing recognition hypotheses with ground-truth annotations have to consider four cases:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/confusion-matrix.png" alt="Measuring Performance: The Confusion Matrix – Glass Box" style="zoom: 40%;" />
&lt;blockquote>
&lt;p>More see: &lt;a href="https://haobin-tan.netlify.app/docs/ai/machine-learning/ml-fundamentals/evaluation/">Evaluation&lt;/a>&lt;/p>
&lt;/blockquote>
&lt;h4 id="roc-receiver-operating-characteristic">ROC (Receiver Operating Characteristic)&lt;/h4>
&lt;ul>
&lt;li>Used for the task of classification&lt;/li>
&lt;li>Measures the trade-off between true positive rate and false positive rate&lt;/li>
&lt;/ul>
$$
\begin{array}{l}
\text { true positive rate }=\frac{\mathrm{TP}}{\mathrm{Pos}}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} \\\\
\text { false positive rate }=\frac{\mathrm{FP}}{\mathrm{Neg}}=\frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TN}}
\end{array}
$$
&lt;ul>
&lt;li>
&lt;p>Each prediction hypothesis has generally an associated probability value or score&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The performance values can therefore plotted into a graph for each possible score as a threshold&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Example:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-12%2023.27.18.png" alt="截屏2020-11-12 23.27.18">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="skin-color-analysis-and-comparison">Skin-color: Analysis and Comparison&lt;/h3>
&lt;p>Conclusions &lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Bayesian approach and MLP worked best&lt;/p>
&lt;ul>
&lt;li>Bayesian approach needs much more memory&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Approach is largely unaffected by choice of color space, but&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Results degraded when only chrominance channels were used&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="from-skin-colored-pixels-to-faces">From Skin-Colored Pixels to Faces&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>Skin-colored pixels need to be grouped into object representations&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-13%2014.56.21.png" alt="截屏2020-11-13 14.56.21" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>🔴 Problems:&lt;/p>
&lt;ul>
&lt;li>skin-colored background,&lt;/li>
&lt;li>further skin-colored body parts (hands, arms, &amp;hellip;),&lt;/li>
&lt;li>Noise, &amp;hellip;&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="perceptual-grouping">Perceptual Grouping&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Morphological Operators&lt;/strong>: Operators performing an action on shapes where the input and output is a binary image.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Threshold each pixel‘s skin affiliation &amp;ndash;&amp;gt; Binary Image&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-13%2014.58.11.png" alt="截屏2020-11-13 14.58.11">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Morphological Erosion&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;em>Remove&lt;/em> pixels from edges of objects&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Set pixel value to &lt;strong>min&lt;/strong> value of surrounding pixels&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-13%2015.00.53.png" alt="截屏2020-11-13 15.00.53">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Morphological Dilatation&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;em>Add&lt;/em> pixels to edges of objects&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Set pixel value to &lt;strong>max&lt;/strong> value of surrounding pixels&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-13%2015.41.11.png" alt="截屏2020-11-13 15.41.11">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Morphological Opening&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Apply erosion, then dilatation&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-13%2015.42.38.png" alt="截屏2020-11-13 15.42.38">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Goal:&lt;/p>
&lt;ul>
&lt;li>Smooth outline&lt;/li>
&lt;li>Open small bridges&lt;/li>
&lt;li>Eliminate outliers&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Morphological Closing&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Apply dilatation, then erosion&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-13%2015.45.25.png" alt="截屏2020-11-13 15.45.25">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Goal:&lt;/p>
&lt;ul>
&lt;li>Smooth inner edges&lt;/li>
&lt;li>Connect small distances&lt;/li>
&lt;li>Fill unwanted holes&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Apply morphological closing then morphological opening&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Resulting image is reduced to connected regions of skin color (blobs)&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-13%2015.59.57.png" alt="截屏2020-11-13 15.59.57">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="from-skin-blobs-to-faces">From Skin Blobs To Faces&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Goal: align bounding box around face candidate&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-13%2016.01.23.png" alt="截屏2020-11-13 16.01.23">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Important for:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Face Recognition&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Head Pose Estimation&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Different approaches:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Choose cluster with biggest size&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Ellipse fitting (approximate face region by ellipse)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Heuristics to distinguish between different skin clusters&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Use temporal information (tracking)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Facial Feature Detection&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&amp;hellip;&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>S. L. Phung, A. Bouzerdoum and D. Chai, &amp;ldquo;Skin segmentation using color pixel classification: analysis and comparison,&amp;rdquo; in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 1, pp. 148-154, Jan. 2005, doi: 10.1109/TPAMI.2005.17.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Face Detection: Neural-Network-Based</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/04-face-detection-ann/</link><pubDate>Fri, 13 Nov 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/04-face-detection-ann/</guid><description>&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;ul>
&lt;li>Idea: Use a search-window to scan over an image&lt;/li>
&lt;li>Train a classifier to decide whether the search windows contains a face or not?&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-11-13%2016.16.57.png" alt="截屏2020-11-13 16.16.57" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="detection">Detection&lt;/h2>
&lt;h3 id="simple-neuron-model">Simple neuron model&lt;/h3>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-11-13%2016.20.47.png" alt="截屏2020-11-13 16.20.47" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="topologies">Topologies&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-13%2016.21.15.png" alt="截屏2020-11-13 16.21.15" style="zoom:67%;" />
&lt;h3 id="parameters">Parameters&lt;/h3>
&lt;p>Adjustable Parameters are&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Connection weights (to be learned)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Activation function (fixed)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Number of layers (fixed)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Number of neurons per layer (fixed)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="training">Training&lt;/h3>
&lt;p>Backpropagation with gradient descent&lt;/p>
&lt;h2 id="neural-network-based-face-detection1">Neural Network Based Face Detection&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/h2>
&lt;ul>
&lt;li>Idea: Use an artifical neural network to detect upright frontal faces
&lt;ul>
&lt;li>
&lt;p>Network receives as input a 20x20 pixel region of an image&lt;/p>
&lt;/li>
&lt;li>
&lt;p>output ranges from -1 (no face present) to +1 (face present)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>the neural network „face-filter“ is applied at every location in the image&lt;/p>
&lt;/li>
&lt;li>
&lt;p>to detect faces with different sizes, the input image is repeatedly scaled down&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="network-topology">Network Topology&lt;/h3>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-11-13%2016.28.33.png" alt="截屏2020-11-13 16.28.33" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;ul>
&lt;li>20x20 pixel input retina&lt;/li>
&lt;li>4 types of receptive hidden fields&lt;/li>
&lt;li>One real-valued output&lt;/li>
&lt;/ul>
&lt;h3 id="system-overview">System Overview&lt;/h3>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-11-13%2016.29.19.png" alt="截屏2020-11-13 16.29.19" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="network-training">Network Training&lt;/h3>
&lt;h4 id="training-set">Training Set&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>1050 normalized face images&lt;/p>
&lt;/li>
&lt;li>
&lt;p>15 face images generated by rotating and scaling original face images&lt;/p>
&lt;/li>
&lt;li>
&lt;p>1000 randomly chosen non-face images&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="preprocessing">Preprocessing&lt;/h4>
&lt;ul>
&lt;li>correct for different lighting conditions (overall brightness, shadows)&lt;/li>
&lt;li>rescale images to fixed size&lt;/li>
&lt;/ul>
&lt;h4 id="histogram-equalization">Histogram equalization&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Defines a mapping of gray levels $p$ into gray levels $q$ such that the distribution of $q$ is close to being uniform&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Stretches contrast (expands the range of gray levels)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Transforms different input images so that they have similar intensity distributions (thus reducing the effect of different illumination)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Example&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-13%2016.32.18.png" alt="截屏2020-11-13 16.32.18" style="zoom:67%;" />
&lt;/li>
&lt;li>
&lt;p>Algorithm&lt;/p>
&lt;ul>
&lt;li>
&lt;p>The probability of an occurrence of a pixel of level $i$ in the image:
&lt;/p>
$$
p\left(x\_{i}\right)=\frac{n\_{i}}{n}, \qquad i \in 0, \ldots, L-1
$$
&lt;ul>
&lt;li>$L$: number of gray levels&lt;/li>
&lt;li>$n\_i$: number of occurences of gray level $i$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Define $c$ as the cumulative distribution function:
&lt;/p>
$$
c(i)=\sum\_{j=0}^{i} p\left(x\_{j}\right)
$$
&lt;/li>
&lt;li>
&lt;p>Create a transformation of the form
&lt;/p>
$$
y\_i = T(x\_i) = c(i), \qquad y\_i \in [0, 1]
$$
&lt;p>
will produce a level $y$ for each level $x$ in the original image, such that the cumulative probability function of $y$ will be linearized across the value range.
&lt;/p>
$$
y\_{i}^{\prime}=y\_{i} \cdot(\max -\min )+\min
$$
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="training-procedure">Training Procedure&lt;/h4>
&lt;ol>
&lt;li>
&lt;p>Randomly choose 1000 non-face images&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Train network to produce 1 for faces, -1 for non-faces&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Run network on images containing no faces. Collect subimages in which network incorrectly identifes a face (output &amp;gt; 0)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Select up to 250 of these „false positives“ at random and add them to the training set as negative examples&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h3 id="neural-network-based-face-filter">Neural Network Based Face Filter&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Output of ANN defines a filter for faces&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Search&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Scan input image with search window, apply ANN to search window&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Input image needs to be rescaled in order to detect faces with different size&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Output needs to be post-processed&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Noise removal&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Merging overlapping detections&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Speed up can be achieved&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Increase step size&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Make ANN more flexible to translation&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Hierarchical, pyramidal search&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="localization-and-ground-truth">Localization and Ground-Truth&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>For localization, the test data is mostly annotated with ground-truth bounding boxes&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Comparing hypotheses to Ground-Truth&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Overlap
&lt;/p>
$$
O = \frac{\text{GT } \cap \text{ DET}}{\text{GT } \cup \text{ DET}}
$$
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-11-13%2016.43.11.png" alt="截屏2020-11-13 16.43.11" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;blockquote>
&lt;p>Also called &lt;strong>Intersection over Union (IoU)&lt;/strong>&lt;/p>
&lt;/blockquote>
&lt;/li>
&lt;li>
&lt;p>Often used as threshold: Overlap&amp;gt;50%&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>&lt;em>Neural Network Based Face Detection, by Henry A. Rowley, Shumeet Baluja, and Takeo Kanade. IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 20, number 1, pages 23-38, January 1998.&lt;/em>&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Face Recognition: Traditional Approaches</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/05-face_recognition-traditional/</link><pubDate>Thu, 04 Feb 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/05-face_recognition-traditional/</guid><description>&lt;h2 id="face-recognition-for-human-computer-interaction-hci">Face Recognition for Human-Computer Interaction (HCI)&lt;/h2>
&lt;h3 id="main-problem">Main Problem&lt;/h3>
&lt;blockquote>
&lt;p>The variations between the images of the same face due to illumination and viewing direction are almost always larger than image variations due to change in face identity.&lt;/p>
&lt;p>&amp;ndash; Moses, Adini, Ullman, ECCV‘94&lt;/p>
&lt;/blockquote>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-04%2023.57.03.png" alt="截屏2021-02-04 23.57.03">&lt;/p>
&lt;h3 id="closed-set-vs-open-set-identification">Closed Set vs. Open Set Identification&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Closed-Set Identification&lt;/strong>
&lt;ul>
&lt;li>The system reports which person from the gallery is shown on the test image: Who is he?&lt;/li>
&lt;li>Performance metric: Correct identification rate&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Open-Set Identification&lt;/strong>
&lt;ul>
&lt;li>The system first decides whether the person on the test image is a known or unknown person. If he is a known person who he is?&lt;/li>
&lt;li>Performance metric
&lt;ul>
&lt;li>&lt;strong>False accept&lt;/strong>: The invalid identity is accepted as one of the individuals in the database.&lt;/li>
&lt;li>&lt;strong>False reject&lt;/strong>: An individual is rejected even though he/she is present in the database.&lt;/li>
&lt;li>&lt;strong>False classify&lt;/strong>: An individual in the database is correctly accepted but misclassified as one of the other individuals in the training data&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="authenticationverification">Authentication/Verification&lt;/h3>
&lt;p>A person claims to be a particular member. The system decides if the test image and the training image is the same person: Is he who he claims he is?&lt;/p>
&lt;p>Performance metric:&lt;/p>
&lt;ul>
&lt;li>False Reject Rate (FRR): Rate of rejecting a valid identity&lt;/li>
&lt;li>False Accept Rate (FAR): Rate of incorrectly accepting an invalid identity.&lt;/li>
&lt;/ul>
&lt;h2 id="feature-based-geometrical-approaches">Feature-based (Geometrical) approaches&lt;/h2>
&lt;p>&amp;ldquo;Face Recognition: Features versus Templates&amp;rdquo; &lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-05 00.12.27.png" alt="截屏2021-02-05 00.12.27" style="zoom:67%;" />
&lt;ul>
&lt;li>Eyebrow thickness and vertical position at the eye center position&lt;/li>
&lt;li>A coarse description of the left eyebrow‘s arches&lt;/li>
&lt;li>Nose vertical position and width&lt;/li>
&lt;li>Mouth vertical position, width, height upper and lower lips&lt;/li>
&lt;li>Eleven radii describing the chin shape&lt;/li>
&lt;li>Face width at nose position&lt;/li>
&lt;li>Face width halfway between nose tip and eyes&lt;/li>
&lt;/ul>
&lt;h3 id="classification">Classification&lt;/h3>
&lt;p>&lt;strong>Nearest neighbor classifier&lt;/strong> with &lt;strong>Mahalanobis distance&lt;/strong> as the distance metric:
&lt;/p>
$$
\Delta_{j}(x)=\left(x-m_{j}\right)^{T} \Sigma^{-1}\left(x-m_{j}\right)
$$
&lt;ul>
&lt;li>$x$: input face image&lt;/li>
&lt;li>$m\_j$: average vector representing the $j$-th person&lt;/li>
&lt;li>$\Sigma$: Covariance matrix&lt;/li>
&lt;/ul>
&lt;p>Different people are characterized only by their average feature vector.&lt;/p>
&lt;p>The distribution is common and estimated by using all the examples in the training set.&lt;/p>
&lt;h2 id="appearance-based-approaches">Appearance-based approaches&lt;/h2>
&lt;p>Can be either&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;a href="#holistic-appearance-based-approaches">holistic&lt;/a>&lt;/strong> (process the whole face as the input), or&lt;/li>
&lt;li>&lt;a href="#local-appearance-based-approach">&lt;strong>local / fiducial&lt;/strong>&lt;/a> (process facial features, such as eyes, mouth, etc. seperately)&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-08%2010.28.36.png" alt="截屏2021-02-08 10.28.36">&lt;/p>
&lt;p>Processing steps: align faces with facial landmarks&lt;/p>
&lt;ul>
&lt;li>Use manually labeled or automatically detected eye centers&lt;/li>
&lt;li>Normalize face images to a common coordination, remove translation,, rotation and scaling factors&lt;/li>
&lt;li>Crop off unnecessary background&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-08%2010.30.38.png" alt="截屏2021-02-08 10.30.38">&lt;/p>
&lt;h2 id="holistic-appearance-based-approaches">Holistic appearance-based approaches&lt;/h2>
&lt;h3 id="eigenfaces">Eigenfaces&lt;/h3>
&lt;h4 id="-idea">💡 Idea&lt;/h4>
&lt;p>A face image defines a point in the high dimensional image space.&lt;/p>
&lt;p>Different face images share a number of similarities with each other&lt;/p>
&lt;ul>
&lt;li>They can be described by a relatively low dimensional subspace&lt;/li>
&lt;li>Project the face images into an appropriately chosen subspace and perform classification by similarity computation (distance, angle)
&lt;ul>
&lt;li>Dimensionality reduction procedure used here is called &lt;mark>&lt;strong>Karhunen-Loéve transformation&lt;/strong>&lt;/mark> or &lt;mark>&lt;strong>principal component analysis (PCA)&lt;/strong>&lt;/mark>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="objective">Objective&lt;/h4>
&lt;p>Find the vectors that best account for the distribution of face images within the entire image space&lt;/p>
&lt;h4 id="pca">PCA&lt;/h4>
&lt;blockquote>
&lt;p>For more details see: &lt;a href="https://haobin-tan.netlify.app/docs/ai/machine-learning/unsupervised/pca/">Principle Component Analysis (PCA)&lt;/a>&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>Find direction vectors so as to minimize the average projection error&lt;/li>
&lt;li>Project on the linear subspace spanned by these vectors&lt;/li>
&lt;li>Use covariance matrix to find these direction vectors&lt;/li>
&lt;li>Project on the largest K direction vectors to reduce dimensionality&lt;/li>
&lt;/ul>
&lt;p>PCA for eigenfaces:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-09%2011.11.40.png" alt="截屏2021-02-09 11.11.40">
&lt;/p>
$$
\begin{array}{l}
Y=\left[y\_{1}, y\_{2}, y\_{3}, \ldots, y\_{K}\right] \\\\
m=\frac{1}{K}\sum y \\\\
C=(Y-m)(Y-m)^{T} \\\\
D=U^{T} C U \\\\
\Omega=U^{\top}(y-m)
\end{array}
$$
&lt;p>
where&lt;/p>
&lt;ul>
&lt;li>
&lt;p>$y$: Face image&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$Y$: Face matrix&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$m$: Mean face&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$C$: Covariance matrix&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$D$: Eigenvalues&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$U$: Eigenvectors&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$\Omega$: Representation coefficients&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="training">Training&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Acquire initial set of face images (training set):
&lt;/p>
$$
Y = [y\_1, y\_2, \dots, y\_K]
$$
&lt;/li>
&lt;li>
&lt;p>Calculate the eigenfaces/eigenvectors from the training set, keeping only the $M$ images/vectors corresponding to the highest eigenvalues
&lt;/p>
$$
U = (u\_1, u\_2, \dots, u\_M)
$$
&lt;/li>
&lt;li>
&lt;p>Calculate representation of each known individual $k$ in face space
&lt;/p>
$$
\Omega\_k = U^T(y\_k - m)
$$
&lt;/li>
&lt;/ul>
&lt;h4 id="testing">Testing&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Project input new image &lt;em>y&lt;/em> into face space
&lt;/p>
$$
\Omega = U^T(y - m)
$$
&lt;/li>
&lt;li>
&lt;p>Find most likely candidate class $k$ by distance computation
&lt;/p>
$$
\epsilon\_k = \\|\Omega - \Omega\_k\\| \quad \text{for all } \Omega\_k
$$
&lt;/li>
&lt;/ul>
&lt;h4 id="projections-onto-the-face-space">&lt;strong>Projections onto the face space&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Principal components are called “&lt;strong>eigenfaces&lt;/strong>” and they span the “face space”.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Images can be reconstructed by their projections in face space:&lt;/p>
&lt;/li>
&lt;/ul>
$$
Y\_f = \sum\_{i=1}^{M} \omega\_i u\_i
$$
&lt;p>​ Appearance of faces in face-space does not change a lot&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Difference of mean-adjusted image $(Y-m)$ and projection $Y\_f$ gives a measure of &lt;em>„faceness“&lt;/em>&lt;/p>
&lt;ul>
&lt;li>Distance from face space can be used to detect faces&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Different cases of projections onto face space&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-09%2011.38.26.png" alt="截屏2021-02-09 11.38.26" style="zoom:80%;" />
&lt;ul>
&lt;li>
&lt;p>Case 1: Projection of a &lt;em>known&lt;/em> individual&lt;/p>
&lt;p>$\rightarrow$ Near face space ($\epsilon &lt; \theta\_{\delta}$) and near known face $\Omega\_k$ ($\epsilon\_k &lt; \theta\_{\epsilon}$)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Case 2: Projection of an &lt;em>unkown&lt;/em> individual&lt;/p>
&lt;p>$\rightarrow$ Near face space, far from reference vectors&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Case 3 and 4: not a face (far from face space)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="pca-for-face-matching-and-recognition">PCA for face matching and recognition&lt;/h4>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-09%2011.53.26.png" alt="截屏2021-02-09 11.53.26">&lt;/p>
&lt;ul>
&lt;li>Projects all faces onto a &lt;strong>universal&lt;/strong> eigenspace to “encode” via principal components&lt;/li>
&lt;li>Uses inverse-distance as a similarity measure $S(p,g)$ for matching &amp;amp; recognition&lt;/li>
&lt;/ul>
&lt;h4 id="problems-and-shortcomings">Problems and shortcomings&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Eigenfaces do NOT distinguish between shape and appearance&lt;/p>
&lt;/li>
&lt;li>
&lt;p>PCA does NOT use class information&lt;/p>
&lt;ul>
&lt;li>PCA projections are optimal for reconstruction from a low dimensional basis, they may not be optimal from a discrimination standpoint&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-09%2011.58.01.png" alt="截屏2021-02-09 11.58.01" style="zoom:67%;" />
&lt;/li>
&lt;/ul>
&lt;h3 id="fisherface">Fisherface&lt;/h3>
&lt;h4 id="linear-discriminant-analysis-lda">Linear Discriminant Analysis (LDA)&lt;/h4>
&lt;blockquote>
&lt;p>For more details about LDA, see: &lt;a href="https://haobin-tan.netlify.app/docs/ai/machine-learning/non-parametric/lda-summary/">LDA Summary&lt;/a>)&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>A.k.a. &lt;strong>Fischer‘s Linear Discriminant&lt;/strong>&lt;/li>
&lt;li>Preserves separability of classes&lt;/li>
&lt;li>Maximizes ratio of projected between-classes to projected within-class scatter&lt;/li>
&lt;/ul>
$$
W\_{\mathrm{fld}}=\arg \underset{W}{\max } \frac{\left|W^{T} S\_{B} W\right|}{\left|W^{T} S\_{W} W\right|}
$$
&lt;p>Where&lt;/p>
&lt;ul>
&lt;li>$S\_{B}=\sum\_{i=1}^{c}\left|x\_{i}\right|\left(\mu\_{i}-\mu\right)\left(\mu\_{i}-\mu\right)^{T}$: Between-class scatter
&lt;ul>
&lt;li>$c$: Number of classes&lt;/li>
&lt;li>$\mu\_i$: mean of class $X\_i$&lt;/li>
&lt;li>$|X\_i|$: number of samples of $X\_i$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>$S\_{W}=\sum\_{i=1}^{c} \sum\_{x\_{k} \in X\_{i}}\left(x\_{k}-\mu\_{i}\right)\left(x\_{k}-\mu\_{i}\right)^{T}$: Within-class scatter&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>LDA vs. PCA&lt;/strong>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-09%2012.25.35.png" alt="截屏2021-02-09 12.25.35" style="zoom:67%;" />
&lt;h4 id="lda-for-fisherfaces">LDA for Fisherfaces&lt;/h4>
&lt;p>Fisher’s Linear Discriminant&lt;/p>
&lt;ul>
&lt;li>projects away the within-class variation (lighting, expressions) found in training set&lt;/li>
&lt;li>preserves the separability of the classes.&lt;/li>
&lt;/ul>
&lt;h2 id="local-appearance-based-approach">Local appearance-based approach&lt;/h2>
&lt;p>Local vs Holistic approaches:&lt;/p>
&lt;ul>
&lt;li>Local variations on the facial appearance (different expression,occlsion, lighting)
&lt;ul>
&lt;li>lead to modifications on the entire representation in the holistic approaches&lt;/li>
&lt;li>while in local approaches ONLY the corresponding local region is effected&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Face images contain different statistical illumination (high frequency at the edges and low frequency at smooth regions). It&amp;rsquo;s easier to represent the varying statistics linearly by using local representation.&lt;/li>
&lt;li>Local approaches facilitate the weighting of each local region in terms of their effect on face recognition.&lt;/li>
&lt;/ul>
&lt;h3 id="modular-eigen-spaces">Modular Eigen Spaces&lt;/h3>
&lt;p>Classification using fiducial regions instead of using entire face &lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-09%2012.59.14.png" alt="截屏2021-02-09 12.59.14">&lt;/p>
&lt;h3 id="local-pca-modular-pca">Local PCA (Modular PCA)&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Face images are divided into $N$ smaller sub-images&lt;/p>
&lt;/li>
&lt;li>
&lt;p>PCA is applied on each of these sub-images&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-09%2013.01.08-20210723110742829.png" alt="截屏2021-02-09 13.01.08">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Performed &lt;strong>better&lt;/strong> than global PCA on large variations of illumination and expression&lt;/p>
&lt;/li>
&lt;li>
&lt;p>No imporvements under variation of pose&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="local-feature-based">Local Feature based&lt;/h3>
&lt;p>🎯 Objective: To mitigate the effect of expression, illumination, and occlusion variations by performing local analysis and by fusing the outputs of extracted local features at the feature or at the decision level.&lt;/p>
&lt;h4 id="gabor-filters">Gabor Filters&lt;/h4>
&lt;h4 id="elastic-bunch-graphs-ebg">Elastic Bunch Graphs (EBG)&lt;/h4>
&lt;h4 id="local-binary-pattern-lbp-histogram">Local Binary Pattern (LBP) Histogram&lt;/h4>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>&lt;a href="http://cbcl.mit.edu/people/poggio/journals/brunelli-poggio-IEEE-PAMI-1993.pdf">http://cbcl.mit.edu/people/poggio/journals/brunelli-poggio-IEEE-PAMI-1993.pdf&lt;/a>&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>Pentland, Moghaddam and Starner, &amp;ldquo;View-based and modular eigenspaces for face recognition,&amp;rdquo; &lt;em>1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition&lt;/em>, Seattle, WA, USA, 1994, pp. 84-91, doi: 10.1109/CVPR.1994.323814.&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Face Recognition: Features</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/06-face_recognition-features/</link><pubDate>Mon, 15 Feb 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/06-face_recognition-features/</guid><description>&lt;h2 id="local-appearance-based-face-recognition">Local Appearance-based Face Recognition&lt;/h2>
&lt;p>🎯 Objective: To mitigate the effect of expression, illumination, and occlusion variations by performing local analysis and by fusing the outputs of extracted local features at the feature or at the decision level.&lt;/p>
&lt;p>Some popular facial descriptions achieving good results&lt;/p>
&lt;ul>
&lt;li>Local binary Pattern Histogram (LBPH)&lt;/li>
&lt;li>Gabor Feature&lt;/li>
&lt;li>Discrete Cosine Transform (DCT)&lt;/li>
&lt;li>SIFT&lt;/li>
&lt;li>etc.&lt;/li>
&lt;/ul>
&lt;h3 id="local-binary-pattern-histogram-lbph1">Local binary Pattern Histogram (LBPH)&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-16%2011.10.39.png" alt="截屏2021-02-16 11.10.39" style="zoom:80%;" />
&lt;ul>
&lt;li>
&lt;p>Divide image into cells&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Compare each pixel to each of its neighbors&lt;/p>
&lt;ul>
&lt;li>Where the pixel&amp;rsquo;s value is greater than the threshold value (e.g., center pixel in this example), write &amp;ldquo;1&amp;rdquo;&lt;/li>
&lt;li>Otherwise, write &amp;ldquo;0&amp;rdquo;&lt;/li>
&lt;/ul>
&lt;p>$\rightarrow$ gives a binary number&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Convert binary into decimal&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Compute the histogram, over the cell&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Use the histogram for classification&lt;/p>
&lt;ul>
&lt;li>SVM&lt;/li>
&lt;li>Histogram-distances&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>Tutorials and explanation:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://towardsdatascience.com/face-recognition-how-lbph-works-90ec258c3d6b">Face Recognition: Understanding LBPH Algorithm&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://www.youtube.com/watch?v=h-z9-bMtd7w">how is the LBP |Local Binary Pattern| values calculated? ~ xRay Pixy&lt;/a>&lt;/li>
&lt;/ul>
&lt;/blockquote>
&lt;h3 id="high-dim-dense-local-feature-extraction">&lt;strong>High dim. dense local Feature Extraction&lt;/strong>&lt;/h3>
&lt;ul>
&lt;li>Computing features densely (e.g. on overlapping patches in many scales in the image)&lt;/li>
&lt;li>Problem: very very high dimensionality!!!&lt;/li>
&lt;li>Solution: Encode into a compact form
&lt;ul>
&lt;li>Bag of Visual Word (BoVW) model&lt;/li>
&lt;li>Fisher encoding&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="fisher-vector-encoding">Fisher Vector Encoding&lt;/h4>
&lt;ul>
&lt;li>Aggregates feature vectors into a compact representation&lt;/li>
&lt;li>Fitting a parametric generative model (e.g. Gaussian Mixture Model)&lt;/li>
&lt;li>Encode derivative of the likelihood of model w.r.t its parameters&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-16%2011.38.19.png" alt="截屏2021-02-16 11.38.19">&lt;/p>
&lt;h2 id="face-recognition-across-pose-alignment">Face recognition across pose (Alignment)&lt;/h2>
&lt;p>Problem&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Different view-point / head orientation&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-16%2011.40.44.png" alt="截屏2021-02-16 11.40.44" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>Recoginition results degrade, when images of different head orientation have to be matched 😭&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Major directions to address the face recognition across pose Probelm&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Geometric pose normalization (image affine warps)&lt;/li>
&lt;li>2D specific pose models, image rendering at pixel or feature level (2D+3D approaches)&lt;/li>
&lt;li>3D face Model fitting&lt;/li>
&lt;/ul>
&lt;h3 id="pose-normalization">Pose Normalization&lt;/h3>
&lt;h4 id="-idea">💡 &lt;strong>Idea&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>Find several facial features (mesh)&lt;/li>
&lt;li>Use complete mesh to normalize face&lt;/li>
&lt;/ul>
&lt;p>Here we will use &lt;strong>2D Active Appearance Models&lt;/strong>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-16%2011.51.52.png" alt="截屏2021-02-16 11.51.52" style="zoom:80%;" />
&lt;ul>
&lt;li>
&lt;p>A texture and shape-based parametric model&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Efficient fitting algorithm: &lt;strong>Inverse compositional (IC)&lt;/strong> algorithm&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="model-and-fitting">Model and fitting&lt;/h4>
&lt;p>Independent shape and appearance model
&lt;/p>
$$
\begin{array}{c}
\text{shape:} \quad s=\left(x\_{1}, y\_{1}, x\_{2}, y\_{2}, \cdots, x\_{v}, y\_{v}\right)^{T}=s\_{0}+\sum\_{i=1}^{n} p\_{i} s\_{i} \\\\
\text{appearance:} \quad A(x)=A\_{0}(x)+\sum\_{i=1}^{m} \lambda\_{i} A\_{i}(x) \quad \forall x \in s\_{0}
\end{array}
$$
&lt;p>
Fitting goal:
&lt;/p>
$$
\arg \min \_{p, \lambda} \sum\_{x \in s\_{0}}\left[A\_{0}(x)+\sum\_{i=1}^{m} \lambda_{i} A\_{i}(x)-I(W(x ; p))\right]^{2}
$$
&lt;p>
Fitting examples&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Fitted mesh&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-16%2012.02.54.png" alt="截屏2021-02-16 12.02.54">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Mismatched mesh&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-16%2012.03.27.png" alt="截屏2021-02-16 12.03.27">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Fitted modal can be used to warp image to frontal pose (e.g. using piecewise affine transformation of mesh triangles)&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-16%2012.13.08.png">&lt;figcaption>
&lt;h4>Faces with different poses from FERET data base and their pose- aligned images&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;h4 id="results">Results&lt;/h4>
&lt;ul>
&lt;li>Much better results under pose variations compared to simple affine transform&lt;/li>
&lt;li>Different warping functions can be used
&lt;ul>
&lt;li>Piecewise affine transformation worked best&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Approach works well with local-DCT-based approach
&lt;ul>
&lt;li>but not so well with holistic approaches, such as Eigenfaces (PCA) 🤪&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="face-recogntion-using-3d-models2">Face Recogntion using 3D Models&lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>&lt;/h2>
&lt;ul>
&lt;li>A method for face recognition across variations in pose and illumination.&lt;/li>
&lt;li>Simulates the process of image formation in 3D space.&lt;/li>
&lt;li>Estimates 3D shape and texture of faces from single images by fitting a statistical morphable model of 3D faces to images.&lt;/li>
&lt;li>Faces are represented by model parameters for 3D shape and texture.&lt;/li>
&lt;/ul>
&lt;h4 id="model-based-recognition">Model-based Recognition&lt;/h4>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-16%2012.19.23.png" alt="截屏2021-02-16 12.19.23" style="zoom:67%;" />
&lt;h4 id="face-vectors">Face vectors&lt;/h4>
&lt;p>The morphable face model is based on a vector space representation of faces that is constructed such that &lt;strong>any combination of shape and texture vectors $S\_i$ and $T\_i$ describes a realistic human face&lt;/strong>:
&lt;/p>
$$
S=\sum_{i=1}^{m} a_{i} S_{i} \quad T=\sum_{i=1}^{m} b_{i} T_{i}
$$
&lt;p>
The definition of shape and texture vectors is based on a reference face $\mathbf{I}\_0$.&lt;/p>
&lt;p>The location of the vertices of the mesh in Cartesian coordinates is $(x\_k, y\_k, z\_k)$ with colors $(R\_k, G\_k, B\_k)$&lt;/p>
&lt;p>Reference shape and texture vectors are defined by:
&lt;/p>
$$
\begin{array}{l}
S\_{0}=\left(x\_{1}, y\_{1}, z\_{1}, x\_{2}, \ldots, x\_{n}, y\_{n}, z\_{n}\right)^{T} \\\\
T\_{0}=\left(R\_{1}, G\_{1}, B\_{1}, R\_{2}, \ldots, R\_{n}, G\_{n}, B\_{n}\right)^{T}
\end{array}
$$
&lt;p>
To encode a novel scan $\mathbf{I}$, the flow field from $\mathbf{I}\_0$ to $\mathbf{I}$ is computed.&lt;/p>
&lt;h4 id="pca">PCA&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>PCA is performed on the set of shape and texture vectors separately.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Eigenvectors form an orthogonal basis:
&lt;/p>
$$
\mathbf{S}=\overline{\mathbf{s}}+\sum\_{i=1}^{m-1} \alpha\_{i} \cdot \mathbf{s}\_{i}, \quad \mathbf{T}=\overline{\mathbf{t}}+\sum\_{i=1}^{m-1} \beta\_{i} \cdot \mathbf{t}\_{i}
$$
&lt;/li>
&lt;li>
&lt;p>Example&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-16%2020.36.08.png" alt="截屏2021-02-16 20.36.08" style="zoom:67%;" />
&lt;/li>
&lt;/ul>
&lt;h3 id="model-based-image-analysis">Model-based Image Analysis&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>🎯 Goal: find shape and texture coefficients describing a 3D face model such that rendering produces an image $\mathbf{I}\_{\text{model}}$ that is as similar as possible to $\mathbf{I}\_{\text{input}}$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>For initialization 7 facial feature points, such as the corners of the eyes or tip of the nose, should be labelled manually&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-16%2020.38.43.png" alt="截屏2021-02-16 20.38.43">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Model fitting: Minimize
&lt;/p>
$$
E\_{I}=\sum\_{x, y}\left\\|\mathbf{I}\_{\text {input }}(x, y)-\mathbf{I}\_{\text {model }}(x, y)\right\\|^{2}
$$
&lt;ul>
&lt;li>Shape, texture, transformation, and illumination are optimized for the entire face and refined for each segment.&lt;/li>
&lt;li>Complex iterative optimization procedure&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="databases">Databases&lt;/h2>
&lt;ul>
&lt;li>Necessary to develop and improve algorithms&lt;/li>
&lt;li>Provide common testbeds and benchmarks which allow for comparing different approaches&lt;/li>
&lt;li>Different databases focus on different problems&lt;/li>
&lt;/ul>
&lt;p>Well-known databases for face recognition&lt;/p>
&lt;ul>
&lt;li>FERET&lt;/li>
&lt;li>FRVT&lt;/li>
&lt;li>FRGC&lt;/li>
&lt;li>CMU-PIE&lt;/li>
&lt;li>BANCA&lt;/li>
&lt;li>XM2VTS&lt;/li>
&lt;li>&amp;hellip;&lt;/li>
&lt;/ul>
&lt;h3 id="observations">Observations&lt;/h3>
&lt;ul>
&lt;li>One 3-D image is &lt;em>more powerful&lt;/em> for face recognition than one 2- D image.&lt;/li>
&lt;li>One high resolution 2-D image is &lt;em>more powerful&lt;/em> for face recognition than one 3-D image.&lt;/li>
&lt;li>Using 4 or 5 well-chosen 2-D face images is &lt;em>more powerful&lt;/em> for face recognition than one 3-D face image or multi-modal 3D+2D face.&lt;/li>
&lt;/ul>
&lt;h4 id="wild-face-datasets">Wild Face Datasets&lt;/h4>
&lt;h4 id="labeled-faces-in-the-wild-dataset-lfw">&lt;strong>Labeled Faces In the Wild Dataset (LFW)&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Face Verification: Given a pair of images specify whether they belong to the same person&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-16%2020.44.55.png" alt="截屏2021-02-16 20.44.55" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>13K images, 5.7K people&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Standard benchmark in the community&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Several test protocols depending upon availability of training data within and outside the dataset.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="youtube-faces-dataset-ytf">&lt;strong>YouTube Faces Dataset (YTF)&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Video Face Verification: Given a pair of videos specify whether they belong to the same person&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-16%2020.46.03.png" alt="截屏2021-02-16 20.46.03" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>3425 videos, 1595 people&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Standard benchmark in the community&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Wide pose, expression and illumination variation&lt;/p>
&lt;/li>
&lt;/ul>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>T. Ahonen, A. Hadid and M. Pietikainen, &amp;ldquo;Face Description with Local Binary Patterns: Application to Face Recognition,&amp;rdquo; in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 12, pp. 2037-2041, Dec. 2006, doi: 10.1109/TPAMI.2006.244.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>V. Blanz and T. Vetter, &amp;ldquo;Face recognition based on fitting a 3D morphable model,&amp;rdquo; in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 9, pp. 1063-1074, Sept. 2003, doi: 10.1109/TPAMI.2003.1227983.&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Face Recognition: Deep Learning</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/07-face_recognition-deep_learning/</link><pubDate>Mon, 15 Feb 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/07-face_recognition-deep_learning/</guid><description>&lt;h2 id="deepface-1">DeepFace &lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/h2>
&lt;h3 id="main-idea">Main idea&lt;/h3>
&lt;p>Learn a deep (7 layers) NN (20 million parameters) on 4 million identity labeled face images directly on RGB pixels.&lt;/p>
&lt;h3 id="alignment">Alignment&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Use 6 fiducial points for 2D warp&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Then 67 points for 3D model&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Frontalize the face for input to NN&lt;/p>
&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-16%2020.51.01.png" alt="截屏2021-02-16 20.51.01" style="zoom:67%;" />
&lt;h3 id="representation">Representation&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Output is fed in $k$-way softmax, that generates probability distribution over class labels.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-16%2020.52.06.png" alt="截屏2021-02-16 20.52.06">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>🎯 Goal of training: &lt;strong>maximize the probability of the correct class&lt;/strong>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="facenet2">FaceNet&lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>&lt;/h2>
&lt;h4 id="idea">💡Idea&lt;/h4>
&lt;ul>
&lt;li>Map images to a compact Euclidean space, where &lt;strong>distances correspond to face similarity&lt;/strong>&lt;/li>
&lt;li>Find $f(x)\ \in \mathbb{R}^d$ for image $x$, so that
&lt;ul>
&lt;li>$d^2(f(x\_1), f(x\_2)) \rightarrow \text{small}$, if $x\_1, x\_2 \in \text{same identity}$&lt;/li>
&lt;li>$d^2(f(x\_1), f(x\_3)) \rightarrow \text{large}$, if $x\_1, x\_2 \in \text{different identities}$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="system-architecture">System architecture&lt;/h3>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-16%2021.04.45.png" alt="截屏2021-02-16 21.04.45">&lt;/p>
&lt;ul>
&lt;li>CNN: optimized embedding&lt;/li>
&lt;li>Triplet-based loss function: training&lt;/li>
&lt;/ul>
&lt;h3 id="triplet-loss">Triplet loss&lt;/h3>
&lt;p>Image triplets:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-16%2021.06.14.png" alt="截屏2021-02-16 21.06.14" style="zoom:67%;" />
$$
\begin{array}{c}
\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{p}\right)\right\\|\_{2}^{2}+\alpha&lt;\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{n}\right)\right\\|_{2}^{2} \\\\
\forall\left(f\left(x\_{i}^{a}\right), f\left(x\_{i}^{p}\right), f\left(x\_{i}^{n}\right)\right) \in \mathcal{T}
\end{array}
$$
where
&lt;ul>
&lt;li>
&lt;p>$x\_i^a$: Anchor image&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$x\_i^p$: Positive image&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$x\_i^n$: Negative image&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$\mathcal{T}$: Set of all possible triplets in the training set&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$\alpha$: Margin between positive and negative pairs&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Total Loss function to be minimized:
&lt;/p>
$$
L=\sum\_{i}^{N}\left[\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{p}\right)\right\\|\_{2}^{2}-\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{n}\right)\right\\|\_{2}^{2}+\alpha\right]
$$
&lt;h3 id="triplet-selection">Triplet selection&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Online Generation&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Select only the &lt;strong>semi-hard negatives&lt;/strong> and using all anchor-positive pairs of mini-batch&lt;/p>
&lt;p>$\rightarrow$ Select $x\_i^n$ such that
&lt;/p>
$$
\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{p}\right)\right\\|\_{2}^{2}&lt;\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{n}\right)\right\\|\_{2}^{2}
$$
&lt;/li>
&lt;/ul>
&lt;h3 id="results">Results&lt;/h3>
&lt;ul>
&lt;li>LFW: 99.63% $\pm$ 0.09&lt;/li>
&lt;li>Youtube Faces DB: 95.12% $\pm$ 0.39&lt;/li>
&lt;/ul>
&lt;h2 id="deep-face-recognition-3">Deep Face Recognition &lt;sup id="fnref:3">&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref">3&lt;/a>&lt;/sup>&lt;/h2>
&lt;p>&lt;strong>Key Questions&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Can large scale datasets be built with minimal human intervention? &lt;a href="#dataset-collection">Yes&lt;/a>!&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Can we propose a convolutional neural network which can compete with that of internet giants like Google and Facebook? &lt;a href="#convolutional-neural-network">Yes&lt;/a>!&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="dataset-collection">Dataset Collection&lt;/h3>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Candidate list generation&lt;/strong>: &lt;strong>Finding names of celebrities&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Tap the knowledge on the web&lt;/li>
&lt;li>5000 identities&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Manual verification of celebrities: Finding Popular Celebrities&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Collect representative images for each celebrity&lt;/p>
&lt;/li>
&lt;li>
&lt;p>200 images/identity&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Remove people with low representation on Google.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Remove overlap with public benchmarks&lt;/p>
&lt;/li>
&lt;li>
&lt;p>2622 celebrities for the final dataset&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Rank image sets&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>2000 images per identity&lt;/li>
&lt;li>Searching by appending keyword “actor”&lt;/li>
&lt;li>Learning classifier using data obtained the previous step.&lt;/li>
&lt;li>Ranking 2000 images and selecting top 1000 images&lt;/li>
&lt;li>Approx. 2.6 Million images of 2622 celebrities&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Near duplicate removal&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>VLAD descriptor based near duplicate removal&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Manual filtering&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Curating the dataset further using manual checks&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;h3 id="convolutional-neural-network">Convolutional Neural Network&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>The “Very Deep” Architecture&lt;/p>
&lt;ul>
&lt;li>
&lt;p>3 x 3 Convolution Kernels (Very small)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Conv. Stride 1 px.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Relu non-linearity&lt;/p>
&lt;/li>
&lt;li>
&lt;p>No local contrast normalisation&lt;/p>
&lt;/li>
&lt;li>
&lt;p>3 Fully connected layers&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Training&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Random Gaussian Initialization&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Stochastic Gradient Descent with back prop.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Batch Size: 256&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Incremental FC layer training&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Learning Task Specific Embedding&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Learning embedding by minimizing triplet loss
&lt;/p>
$$
\sum\_{(a, p, n) \in T} \max \left\\{0, \alpha-\left\\|\mathbf{x}\_{a}-\mathbf{x}\_{n}\right\\|\_{2}^{2}+\left\\|\mathbf{x}\_{a}-\mathbf{x}\_{p}\right\\|\_{2}^{2}\right\\}
$$
&lt;/li>
&lt;li>
&lt;p>Learning a projection from 4096 to 1024 dimensions&lt;/p>
&lt;/li>
&lt;li>
&lt;p>On line triplet formation at the beginning of each iteration&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Fine tuned on target datasets&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Only the projection layers learnt&lt;/strong>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>Y. Taigman, M. Yang, M. Ranzato and L. Wolf, &amp;ldquo;DeepFace: Closing the Gap to Human-Level Performance in Face Verification,&amp;rdquo; 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 1701-1708, doi: 10.1109/CVPR.2014.220.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>Schroff, Florian &amp;amp; Kalenichenko, Dmitry &amp;amp; Philbin, James. (2015). FaceNet: A unified embedding for face recognition and clustering. 815-823. 10.1109/CVPR.2015.7298682.&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:3">
&lt;p>Omkar M. Parkhi, Andrea Vedaldi and Andrew Zisserman. Deep Face Recognition. In Xianghua Xie, Mark W. Jones, and Gary K. L. Tam, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 41.1-41.12. BMVA Press, September 2015.&amp;#160;&lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Facial Feature Detection</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/08-facial-features-detection/</link><pubDate>Thu, 18 Feb 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/08-facial-features-detection/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;h3 id="what-are-facial-features">What are facial features?&lt;/h3>
&lt;p>Facial features are referred to as &lt;strong>salient parts of a face region which carry meaningful information&lt;/strong>.&lt;/p>
&lt;ul>
&lt;li>E.g. eye, eyeblow, nose, mouth&lt;/li>
&lt;li>A.k.a &lt;mark>&lt;strong>facial landmarks&lt;/strong>&lt;/mark>&lt;/li>
&lt;/ul>
&lt;h3 id="what-is-facial-feature-detection">What is facial feature detection?&lt;/h3>
&lt;p>Facial feature detection is defined as methods of &lt;strong>locating the specific areas of a face&lt;/strong>.&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-18%2023.13.42.png" alt="截屏2021-02-18 23.13.42" style="zoom:80%;" />
&lt;h3 id="applications-of-facial-feature-detection">Applications of facial feature detection&lt;/h3>
&lt;ul>
&lt;li>Face recognition&lt;/li>
&lt;li>Model-based head pose estimation&lt;/li>
&lt;li>Eye gaze tracking&lt;/li>
&lt;li>Facial expression recognition&lt;/li>
&lt;li>Age modeling&lt;/li>
&lt;/ul>
&lt;h3 id="problems-in-facial-feature-detection">Problems in facial feature detection&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Identity variations&lt;/strong>&lt;/p>
&lt;p>Each person has unique facial part&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Expression variations&lt;/strong>&lt;/p>
&lt;p>Some facial features change their state (e.g. eye blinks).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Head rotations&lt;/strong>&lt;/p>
&lt;p>If a head orientation changes, the visual appearance also changes.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Scale variations&lt;/strong>&lt;/p>
&lt;p>Changes in resolution and distance to the camera affect appearance.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Lighting conditions&lt;/strong>&lt;/p>
&lt;p>Light has non-linear effects on the pixel values of a image.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Occlusions&lt;/strong>&lt;/p>
&lt;p>Hair or glasses might hide facial features.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="older-approaches-from-face-detection">Older approaches (from face detection)&lt;/h2>
&lt;ul>
&lt;li>Integral projections + geometric constraints&lt;/li>
&lt;li>Haar-Filter Cascades&lt;/li>
&lt;li>PCA-based methods (Modular Eigenspace)&lt;/li>
&lt;li>Morphable 3D Model&lt;/li>
&lt;/ul>
&lt;h2 id="statistical-appearance-models">Statistical appearance models&lt;/h2>
&lt;ul>
&lt;li>💡 Idea: make use of prior-knowledge, i.e. models, to reduce the complexity of the task&lt;/li>
&lt;li>Needs to be able to deal with variability $\rightarrow$ &lt;strong>deformable models&lt;/strong>&lt;/li>
&lt;li>Use statistical models of shape and texture to find facial landmark points&lt;/li>
&lt;li>Good models should
&lt;ul>
&lt;li>Capture the various characteristics of the object to be detected&lt;/li>
&lt;li>Be a compact representation in order to avoid heavy calculation&lt;/li>
&lt;li>Be robust against noise&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="basic-idea">Basic idea&lt;/h3>
&lt;ol>
&lt;li>&lt;strong>Training&lt;/strong> stage: construction of models&lt;/li>
&lt;li>&lt;strong>Test&lt;/strong> stage: Search the region of interest (ROI)&lt;/li>
&lt;/ol>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2011.30.55.png" alt="截屏2021-02-19 11.30.55" style="zoom:80%;" />
&lt;h3 id="appearance-models">Appearance models&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Represent both &lt;strong>texture&lt;/strong> and &lt;strong>shape&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Statistical model learned from training data&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Modeling shape variability&lt;/p>
&lt;ul>
&lt;li>Landmark points&lt;/li>
&lt;/ul>
$$
x=\left[x\_{1}, y\_{1}, x\_{2}, y\_{2}, \ldots, x\_{n}, y\_{n}\right]^{T}
$$
&lt;ul>
&lt;li>
&lt;p>Model
&lt;/p>
$$
x \approx \bar{x}+P\_{s} b\_{s}
$$
&lt;ul>
&lt;li>$\bar{x}$: Mean vector&lt;/li>
&lt;li>$P\_s$: Eigenvectors of covariance matrix&lt;/li>
&lt;li>$b\_s = P\_s^T(x - \bar{x})$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Modeling intensity variability:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Gray values
&lt;/p>
$$
h=\left[g\_{1}, g\_{2}, \ldots, g\_{k}\right]^{T}
$$
&lt;/li>
&lt;li>
&lt;p>Model
&lt;/p>
$$
h \approx \bar{h} + P\_ib\_i
$$
&lt;ul>
&lt;li>$\bar{h}$: Mean vector&lt;/li>
&lt;li>$P\_s$: Eigenvectors of covariance matrix&lt;/li>
&lt;li>$b\_i = P\_i^T(h - \bar{h})$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="training-of-appearance-models">Training of appearance models&lt;/h3>
&lt;h4 id="1-construct-a-shape-model-with-principal-component-analysis-pca">1. Construct a shape model with Principal component analysis (PCA)&lt;/h4>
&lt;p>A shape is represented with manually labeled points.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-19%2011.40.18.png" alt="截屏2021-02-19 11.40.18">&lt;/p>
&lt;p>The shape model approximates the shape of an object.&lt;/p>
&lt;h5 id="procrustes-analysis">&lt;strong>Procrustes Analysis&lt;/strong>&lt;/h5>
&lt;p>Align the shapes all together to remove translation, rotation and scaling&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-19%2011.51.46.png" alt="截屏2021-02-19 11.51.46">&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-19%2011.52.05.png" alt="截屏2021-02-19 11.52.05">&lt;/p>
&lt;p>&lt;strong>PCA&lt;/strong>&lt;/p>
&lt;p>The positions of labeled points are
&lt;/p>
$$
x = \bar{x}+P\_{s} b\_{s}
$$
&lt;ul>
&lt;li>$\bar{x}$: Mean shape&lt;/li>
&lt;li>$P\_s$: Orthogonal modes of variation obtained by PCA&lt;/li>
&lt;li>$b\_s$: Shape parameters in the projected space&lt;/li>
&lt;/ul>
&lt;p>The shapes are represented with fewer parameters ($\operatorname{Dim}(x) > \operatorname{Dim}(b\_s)$)&lt;/p>
&lt;p>Generating plausible shapes:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2012.38.11.png" alt="截屏2021-02-19 12.38.11" style="zoom:80%;" />
&lt;h4 id="2-construct-a-texture-model-which-represents-grey-scale-or-color-values-at-each-point">2. Construct a texture model which represents grey-scale (or color) values at each point&lt;/h4>
&lt;p>Warp the image so that the labeled points fit on the mean shape&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2012.39.55.png" alt="截屏2021-02-19 12.39.55" style="zoom:80%;" />
&lt;p>Then normalize the intensity on the &lt;em>shape-free&lt;/em> patch.&lt;/p>
&lt;h5 id="texture-warping">Texture warping&lt;/h5>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2012.58.42.png" alt="截屏2021-02-19 12.58.42" />
&lt;h4 id="texture-model">Texture model&lt;/h4>
&lt;p>The pixel values on the shape-free patch
&lt;/p>
$$
g = \bar{g} + P\_g b\_g
$$
&lt;ul>
&lt;li>$\bar{g}$ : Mean of normalized pixel values&lt;/li>
&lt;li>$P\_g$ : Orthogonal modes of variation obtained by PCA&lt;/li>
&lt;li>$b\_g$: Texture parameters in the projected space&lt;/li>
&lt;/ul>
&lt;p>The pixel values (appearance) are presented with fewer parameters ($\operatorname{Dim}(g) > \operatorname{Dim}(b\_g)$)&lt;/p>
&lt;h4 id="3-model-the-correlation-between-shapes-and-grey-level-models">3. Model the correlation between shapes and grey-level models&lt;/h4>
&lt;p>The concatenated vector is
&lt;/p>
$$
b=\left(\begin{array}{c}
W\_{s} b\_{s} \\\\
b\_{g}
\end{array}\right)
$$
&lt;p>
Apply PCA:
&lt;/p>
$$
b=P\_{c} c=\left(\begin{array}{l}
P\_{c s} \\\\
P\_{c g}
\end{array}\right)c
$$
&lt;p>
Now the parameter $\mathbf{c}$ can control both shape and grey-level models&lt;/p>
&lt;ul>
&lt;li>
&lt;p>The shape model
&lt;/p>
$$
x=\bar{x}+P\_{s} W\_{s}^{-1} P\_{c s} c
$$
&lt;/li>
&lt;li>
&lt;p>The grey-level model
&lt;/p>
$$
g=\bar{g}+P\_{g} P\_{c g} c
$$
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Examples of synthesized faces&lt;/strong>&lt;/p>
&lt;p>Various objects can be synthesized by controlling the parameter $\mathbf{c}$&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2013.07.53.png" alt="截屏2021-02-19 13.07.53" style="zoom:80%;" />
&lt;h3 id="dataset-for-building-model">Dataset for Building Model&lt;/h3>
&lt;p>IMM data set from Danish Technical University&lt;/p>
&lt;ul>
&lt;li>
&lt;p>240 images with 640*480 size; 40 individuals, with 36 males and 4 females.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Each Subject 6 shots, with different pose, expressions and illuminations.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Each image is labeled with 58 landmarks; 3 closed and 4 opened point-paths.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-19%2013.09.03.png" alt="截屏2021-02-19 13.09.03">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="image-interpretation-with-models">Image Interpretation with Models&lt;/h3>
&lt;ul>
&lt;li>🎯 &lt;strong>Goal: find the set of parameters which best match the model to the image&lt;/strong>
&lt;ul>
&lt;li>Optimize some cost function&lt;/li>
&lt;li>Difficult optimization problem&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Set of parameters
&lt;ul>
&lt;li>Defines shape, position, appearance&lt;/li>
&lt;li>Can be used for further processing
&lt;ul>
&lt;li>
&lt;p>Position of landmarks&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Face recognition&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Facial expression recognition&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Pose estimation&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Problem: Optimizing the model fit
&lt;ul>
&lt;li>&lt;a href="#active-shape-models-asm">Active Shape Models&lt;/a>&lt;/li>
&lt;li>&lt;a href="#active-appearance-models-aam">Active Appearance Models&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="active-shape-models-asm">Active Shape Models (ASM)&lt;/h3>
&lt;p>Given a rough starting position, create an instance of $\mathbf{X}$ of the model using&lt;/p>
&lt;ul>
&lt;li>shape parameters $b$&lt;/li>
&lt;li>translation $T=(X\_t,Y\_t)$&lt;/li>
&lt;li>scale $s$&lt;/li>
&lt;li>rotation $\theta$&lt;/li>
&lt;/ul>
&lt;p>Iterative approach:&lt;/p>
&lt;ol>
&lt;li>Examine region of the image around $\mathbf{X}\_i$ to find the best nearby match for the point $\mathbf{X}\_i^\prime$&lt;/li>
&lt;li>Update parameters $(b, T, s, \theta)$ to best fit the new points $\mathbf{X}$ (constrain the model parameters to be within three standard deviations)&lt;/li>
&lt;li>Repeat until convergence&lt;/li>
&lt;/ol>
&lt;p>In practice: &lt;strong>search along profile normals&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>The optimal parameters are searched from &lt;strong>multi-resolution&lt;/strong> images hierarchically (faster algorithm)&lt;/p>
&lt;ol>
&lt;li>Search for the object in a coarse image&lt;/li>
&lt;li>Refine the location in a series of higher resolution images.&lt;/li>
&lt;/ol>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-19%2013.31.48.png" alt="截屏2021-02-19 13.31.48">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Example of search&lt;/strong>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2013.32.21.png" alt="截屏2021-02-19 13.32.21" style="zoom:80%;" />
&lt;h4 id="disadvantages">Disadvantages&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Uses mainly shape constraints for search&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Does not take advantage of texture across the target&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="active-appearance-models-aam">Active Appearance Models (AAM)&lt;/h3>
&lt;ul>
&lt;li>Optimize parameters, so as to minimize the difference of a synthesized image and the target image&lt;/li>
&lt;li>Solved using a gradient-descent approach&lt;/li>
&lt;/ul>
&lt;h4 id="fitting-aams">Fitting AAMs&lt;/h4>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2015.51.19.png" alt="截屏2021-02-19 15.51.19" style="zoom:80%;" />
&lt;p>Learning linear relation matrix $\mathbf{R}$ using multi-variate linear regression&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Generate training set by perturbing model parameters for training images&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Include small displacements in position, scale, and orientation&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Record perturbation and image difference&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Experimentally, optimal perturbation around 0.5 standard deviations for each parameter&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="asm-vs-aam">ASM vs. AAM&lt;/h3>
&lt;p>&lt;strong>ASM&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Seeks to match a set of model points to an image, constrained by a statistical model of shape&lt;/li>
&lt;li>Matches model points using an &lt;strong>iterative&lt;/strong> technique (variant of EM-algorithm)&lt;/li>
&lt;li>A search is made around the current position of each point to find a nearby point which best matches texture for the landmark&lt;/li>
&lt;li>Parameters of the shape model are then updated to move the model points closer to the new points in the image&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>AAM&lt;/strong>: matches both position of model points and representation of texture of the object to an image&lt;/p>
&lt;ul>
&lt;li>Uses the difference between current synthesized image and target image to update parameters&lt;/li>
&lt;li>Typically, less landmark points are needed&lt;/li>
&lt;/ul>
&lt;h3 id="summary-of-asm-and-aam">Summary of ASM and AAM&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Statistical appearance models provide a compact representation&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Can model variations such as different identities, facial expression, appearances, etc.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Labeled training images are needed (very time-consuming) 🤪&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Original formulation of ASM and AAM is computationally expensive (i.e. slow) 🤪&lt;/p>
&lt;/li>
&lt;li>
&lt;p>But, efficient extensions and speed-ups exist!&lt;/p>
&lt;ul>
&lt;li>Multi-resolution search&lt;/li>
&lt;li>Constrained AAM search&lt;/li>
&lt;li>Inverse compositional AAMs (CMU)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Usage&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Facial fiducial point detection&lt;/strong>&lt;/li>
&lt;li>Face recognition, pose estimation&lt;/li>
&lt;li>Facial expression analysis&lt;/li>
&lt;li>Audio-visual speech recognition&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="more-modern-approaches-conditional-random-forests-for-real-time-facial-feature-detection1">More Modern Approaches: &lt;strong>Conditional Random Forests&lt;/strong> For Real Time Facial Feature Detection&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/h2>
&lt;h3 id="basics">Basics&lt;/h3>
&lt;h4 id="regression-tree">Regression tree&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Basically like classification decision tree&lt;/p>
&lt;/li>
&lt;li>
&lt;p>In the nodes-decisions are comparison of numbers&lt;/p>
&lt;/li>
&lt;li>
&lt;p>In the leafs-numbers or multidimensional vectors of numbers&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Example&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-19%2016.02.51.png" alt="截屏2021-02-19 16.02.51">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="random-regression-forests">Random regression forests&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Set of random regression trees&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Random&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Different trees trained on random subset of training data&lt;/p>
&lt;/li>
&lt;li>
&lt;p>After training, predictions for unseen samples can be made by averaging the predictions from all the individual regression trees&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-19%2016.03.52.png" alt="截屏2021-02-19 16.03.52">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="basic-idea-1">Basic idea&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Train different set of trees for different head pose.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The leaf nodes accumulates votes for the different facial fiducial points&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-19%2016.21.08.png" alt="截屏2021-02-19 16.21.08">&lt;/p>
&lt;h3 id="regression-forests-training">Regression forests training&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Each Tree is trained from randomly selected set of images.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Extract patches in each image&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Training goal: accumulate probability for a feature point $C\_n$ given a patch $P$ at the leaf node&lt;/p>
&lt;ul>
&lt;li>Each patch is represented by appearance features $I$, and displacement vectors $D$ (offsets) to each of the facial fiducial feature point. I.e. $P = \\{I, D\\}$&lt;/li>
&lt;li>A simple patch comparison is used as Tree-node splitting criterion&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="regression-forests-testing">Regression forests testing&lt;/h3>
&lt;ul>
&lt;li>Given: a random face image&lt;/li>
&lt;li>Extract densely set of patches from the image&lt;/li>
&lt;li>Feed all patches to all trees in the forest&lt;/li>
&lt;li>Get for each patch $P\_i$ a corresponding set of leafs&lt;/li>
&lt;li>A density estimator for the location of ffp&amp;rsquo;s is calculated&lt;/li>
&lt;li>Run meanshift to find all locations&lt;/li>
&lt;/ul>
&lt;h3 id="conditional-regression-forest">Conditional Regression Forest&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Conditional regression tree works alike.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>For &lt;strong>training&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Compute a probability for a concrete head pose&lt;/p>
&lt;/li>
&lt;li>
&lt;p>For each head pose divide the training set in disjoint subsets according to the pose&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Train a regression forest for each subset&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>For &lt;strong>testing&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Estimate the probabilities for each head pose&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Select trees from different regression forests&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Estimate the density function for all facial feature points.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Finalize the exact poition by clustering over all feature candidate votes for a given facial feature point. (e.g., by meanshift)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="experiments-and-results">Experiments and results&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Training set:&lt;/p>
&lt;ul>
&lt;li>13233 face images from LFW Database&lt;/li>
&lt;li>10 annotated facial feature points per face image&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Training&lt;/p>
&lt;ul>
&lt;li>Maximum tree depth = 20&lt;/li>
&lt;li>2500 splitting candidates and 25 thresholds per split&lt;/li>
&lt;li>1500 images to train each tree&lt;/li>
&lt;li>200 patches per image (20 * 20 pixels).&lt;/li>
&lt;li>For head pose two different subsets with 3 and 5 head poses are generated (accuracy 72,5%)&lt;/li>
&lt;li>Required time for face detection and head pose estimation is 33 ms.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Results&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/result.png" alt="result">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="cnn-based-models">CNN based models&lt;/h2>
&lt;p>&lt;strong>Stacked Hourglass Network&lt;/strong> &lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Fully-convolutional neural network&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Repeated down- and upsampling + shortcut connections&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Based on RGB face image, produce one heatmap for each landmark&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Heatmaps are transformed into numerical coordinates using DSNT&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-19%2016.51.43.png" alt="截屏2021-02-19 16.51.43">&lt;/p>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>M. Dantone, J. Gall, G. Fanelli and L. Van Gool, &amp;ldquo;Real-time facial feature detection using conditional regression forests,&amp;rdquo; 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 2012, pp. 2578-2585, doi: 10.1109/CVPR.2012.6247976.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>Newell, A., Yang, K., &amp;amp; Deng, J. (2016). Stacked hourglass networks for human pose estimation. &lt;em>Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)&lt;/em>, &lt;em>9912 LNCS&lt;/em>, 483–499. &lt;a href="https://doi.org/10.1007/978-3-319-46484-8_29">https://doi.org/10.1007/978-3-319-46484-8_29&lt;/a>&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Facial Expression Recognition</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/09-facial-expression-recognition/</link><pubDate>Thu, 18 Feb 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/09-facial-expression-recognition/</guid><description>&lt;h2 id="what-is-facial-expression-analysis">What is facial expression analysis?&lt;/h2>
&lt;h3 id="what-is-facial-expression">What is Facial Expression?&lt;/h3>
&lt;p>Facial expressions are the &lt;strong>facial changes in response to a person‘s internal emotional states, interntions, or social communications.&lt;/strong>&lt;/p>
&lt;h3 id="role-of-facial-expressions">Role of facial expressions&lt;/h3>
&lt;ul>
&lt;li>Almost the &lt;strong>most powerful, natural, and immediate way&lt;/strong> (for human beings) to communicate emotions and intentions&lt;/li>
&lt;li>Face can express emotion &lt;strong>sooner&lt;/strong> than people verbalize or realize feelings&lt;/li>
&lt;li>Faces and facial expressions are an &lt;strong>important aspect&lt;/strong> in interpersonal communication and man-machine interfaces&lt;/li>
&lt;/ul>
&lt;h3 id="facial-expressions">Facial Expressions&lt;/h3>
&lt;ul>
&lt;li>Facial expression(s):
&lt;ul>
&lt;li>
&lt;p>nonverbal communication&lt;/p>
&lt;/li>
&lt;li>
&lt;p>voluntary / involuntary&lt;/p>
&lt;/li>
&lt;li>
&lt;p>results from one or more motions or positions of the muscles of the face&lt;/p>
&lt;/li>
&lt;li>
&lt;p>closely associated with our emotions&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>The fact: Most people&amp;rsquo;s success rate at reading emotions from facial expression is &lt;strong>only a little over 50 percent&lt;/strong>.&lt;/li>
&lt;/ul>
&lt;h4 id="facial-expression-analysis-vs-emotion-analysis">Facial expression analysis vs. Emotion analysis&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Emotion analysis requires &lt;strong>higher level knowledge&lt;/strong>, such as context information.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Besides emotions, facial expressions can also express intention, cognitive processes, physical effort, etc.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="emotions-conveyed-by-facial-expressions">Emotions conveyed by Facial Expressions&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Six basic emotions (assumed to be innate)&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2017.20.59.png" alt="截屏2021-02-19 17.20.59" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="basic-structure-of-facial-expression-analysis-systems">Basic structure of facial expression analysis systems&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2017.21.36.png" alt="截屏2021-02-19 17.21.36" style="zoom:80%;" />
&lt;h2 id="levels-of-description">Levels of description&lt;/h2>
&lt;h3 id="emotions">Emotions&lt;/h3>
&lt;h4 id="discrete-classes">Discrete classes&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Six basic emotions&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2017.27.26.png" alt="截屏2021-02-19 17.27.26" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Positive, neutral, negative&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="continuous-valued-dimensions">Continuous valued dimensions&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Emotions as a continuum along 2/3 dimension&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Circumplex model by Russel&lt;/p>
&lt;ul>
&lt;li>Valence: unpleasant - pleasant&lt;/li>
&lt;li>Arousal: low – high activation&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2017.28.58.png" alt="截屏2021-02-19 17.28.58" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="facial-action-units-aus">Facial Action Units (AUs)&lt;/h3>
&lt;h4 id="facial-action-coding-system-facs">Facial Action Coding System (FACS)&lt;/h4>
&lt;ul>
&lt;li>A human-observer based system designed to &lt;strong>detect subtle changes in facial features&lt;/strong>&lt;/li>
&lt;li>Viewing videotaped facial behavior in &lt;em>slow&lt;/em> motion, trained observer can manually FACS code all possible facial displays&lt;/li>
&lt;li>These facial displays are referred to as &lt;strong>action units (AU)&lt;/strong> and may occur individually or in combinations.&lt;/li>
&lt;/ul>
&lt;h4 id="action-units-aus">Action Units (AUs)&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>There are 44 AUs&lt;/p>
&lt;/li>
&lt;li>
&lt;p>30 AUs related to contractions of special facial muscles&lt;/p>
&lt;ul>
&lt;li>
&lt;p>12 AUs for upper face&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2017.32.01.png" alt="截屏2021-02-19 17.32.01" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>18 AUs for lower face&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2017.32.29.png" alt="截屏2021-02-19 17.32.29" style="zoom:80%;" />
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Anatomic basis of the remaining 14 is unspecified $\rightarrow$ referred to in Facial Action Coding System (FACS) as miscellaneous actions&lt;/p>
&lt;/li>
&lt;li>
&lt;p>For action units that vary in intensity, a 5-point ordinal scale is used to measure the degree of muscle contraction&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="combination-of-aus">Combination of AUs&lt;/h4>
&lt;p>More than 7000 different AU combinations have been observed.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Additive&lt;/strong>: appearance of single AUs does NOT change. E.g.&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2017.36.39.png" alt="截屏2021-02-19 17.36.39" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Nonadditive&lt;/strong>: appearance of single AUs does change. E.g.&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2017.37.05.png" alt="截屏2021-02-19 17.37.05" style="zoom:80%;" />
&lt;/li>
&lt;/ul>
&lt;h4 id="individual-differences-in-subjects">Individual Differences in Subjects&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Variations in appearance&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Face shape,&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Texture&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Color&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Facial and scalp hair&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>due to sex, ethnic background, and age differences&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Variations in expressiveness&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="transitions-among-expressions">Transitions Among Expressions&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Simplifying assumption: &lt;strong>expressions are singular and begin and end with a neutral position&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Transitions from action units or combination of actions to another may involve NO intervening neutral state.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Parsing the stream of behavior is an essential requirement of a robust facial analysis system, and training data are needed that include dynamic combinations of action units, which may be either additive or nonadditive.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="intensity-of-facial-expression">Intensity of Facial Expression&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Facial actions can vary in intensity&lt;/p>
&lt;/li>
&lt;li>
&lt;p>FACS coding uses 5-point intensity scale to describe intensity variation of action units&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Some related action units function as sets to represent intensity variation.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>E.g. in the eye region, action units 41, 42, and 43 or 45 can represent intensity variation from slightly drooped to closed eyes.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2017.43.17.png" alt="截屏2021-02-19 17.43.17" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="relation-to-other-facial-behavior-or-nonfacial-behavior">Relation to other Facial Behavior or Nonfacial Behavior&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Facial expression is one of several channels of nonverbal communication.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The message values of various modes may differ depending on context.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>For robustness, should be integrated with&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Gesture&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Prosody&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Speech&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="different-datasets-and-systems">Different datasets and systems&lt;/h2>
&lt;h3 id="using-geometric-features--ann-2001--early-work">Using geometric features + ANN (2001 / early work)&lt;/h3>
&lt;p>&lt;strong>Recognizing Action Units for Facial Expression Analysis&lt;/strong>&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>An &lt;strong>Automatic Facial Analysis (AFA)&lt;/strong> system to analyze facial expressions based on both &lt;strong>permanent facial features (brows, eyes, mouth)&lt;/strong> and &lt;strong>transient facial features (depending of facial furrows)&lt;/strong> in a nearly frontal-view image sequences.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>A group of action units (neutral expression, six upper face AUs and 10 lower face AUs) are recognized whether they occur alone or in combinations.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="cohn-kanade-au-coded-facial-expression-database">Cohn-Kanade AU-Coded Facial Expression Database&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>100 subjects from varying ethnic backgrounds.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>23 different facial expressions (single action units and combinations of action units)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Frontal faces, small head motion&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Variations in lighting&lt;/p>
&lt;ul>
&lt;li>ambient lighting&lt;/li>
&lt;li>single-high-intensity lamp&lt;/li>
&lt;li>dual high-intensity lamps with reflective umbrellas&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Coded with FACS and assigned emotion-specified labels&lt;/strong> (happy, surprise, anger, disgust, fear, sadness)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Example&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2021.40.16.png" alt="截屏2021-02-19 21.40.16" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="feature-based-automatic-facial-action-analysis-afa-system">Feature-based Automatic Facial Action Analysis (AFA) System&lt;/h4>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2021.41.28.png" alt="截屏2021-02-19 21.41.28" style="zoom:80%;" />
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Feature detection &amp;amp; feature location&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Region of the face and location of individual face features detected automatically in the initial frame using neural network based approach&lt;/li>
&lt;li>Contours of face features and components adjusted manually in the initial frame&lt;/li>
&lt;li>Face features are then tracked automatically
&lt;ul>
&lt;li>&lt;strong>permanent features&lt;/strong> (e.g., brows, eyes, lips)&lt;/li>
&lt;li>&lt;strong>transient features&lt;/strong> (lines and furrows)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Feature extraction&lt;/strong>: Group facial features into separate collections of feature parameters&lt;/p>
&lt;ul>
&lt;li>15 normalized upper face parameters&lt;/li>
&lt;li>9 normalized lower face parameters&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Parameters fed to two neural-network-based classifiers&lt;/strong>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="facial-feature-extraction">Facial Feature Extraction&lt;/h4>
&lt;p>Multistate Facial Component Models of a Frontal Face&lt;/p>
&lt;ul>
&lt;li>Permanent components/features
&lt;ul>
&lt;li>Lip&lt;/li>
&lt;li>Eye&lt;/li>
&lt;li>Brow&lt;/li>
&lt;li>Cheek&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Transient component/features
&lt;ul>
&lt;li>&lt;strong>Furrows&lt;/strong> and &lt;strong>wrinkles&lt;/strong> appear perpendicular to the direction of the motion of the activated muscles&lt;/li>
&lt;li>Classification
&lt;ul>
&lt;li>present (appear, deepen or lengthen)&lt;/li>
&lt;li>absent&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Detection
&lt;ul>
&lt;li>Canny edge detector&lt;/li>
&lt;li>Nasal root / crow’s-feet wrinkles&lt;/li>
&lt;li>Nasolabial furrows&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="facial-feature-representation">Facial Feature Representation&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Face coordinate system&lt;/p>
&lt;ul>
&lt;li>$x = $ line between inner corners of eyes&lt;/li>
&lt;li>$y = $ perpendicular to x&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Group facial features&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>upper face&lt;/strong> features: 15 parameters&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Example&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2021.54.56.png" alt="截屏2021-02-19 21.54.56" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>lower face&lt;/strong> features: 9 parameters&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Example&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2021.55.15.png" alt="截屏2021-02-19 21.55.15" style="zoom:67%;" />
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="au-recognition-by-neural-networks">AU Recognition by Neural Networks&lt;/h4>
&lt;ul>
&lt;li>Three layer neural networks (one hidden layer)&lt;/li>
&lt;li>Standard back-propagation method
&lt;ul>
&lt;li>Separate networks for upper- / lower face&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2021.56.35.png" alt="截屏2021-02-19 21.56.35" style="zoom:80%;" />
&lt;h3 id="using-appearance-based-features--svm-2006">Using appearance-based features + SVM (2006)&lt;/h3>
&lt;p>&lt;strong>Automatic Recognition of Facial Actions in Spontaneous Expression&lt;/strong>&lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2022.40.45.png" alt="截屏2021-02-19 22.40.45" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h4 id="ru-facs-data-set">RU-FACS data set&lt;/h4>
&lt;ul>
&lt;li>Containts spontaneous expressions&lt;/li>
&lt;li>100 subjects&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2022.41.42.png" alt="截屏2021-02-19 22.41.42" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="using-deep-features-cnn--fusion-2013">Using Deep features (CNN) + fusion (2013)&lt;/h3>
&lt;h4 id="emotion-recognition-in-the-wild-challenge-emotiw">&lt;strong>Emotion Recognition in the Wild Challenge (EmotiW)&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>🎯 Goal: Move to more realistic out of the lab data&lt;/p>
&lt;/li>
&lt;li>
&lt;p>AFEW Dataset (Acted Facial Expressions in the Wild)&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Extracted from movies&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Annotated with six basic emotions&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Movie clips from 330 subjects, age range: 1-70&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Semi-automatic annotation pipeline&lt;/p>
&lt;ul>
&lt;li>Recommender sytem + manual annotation&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2022.45.20.png" alt="截屏2021-02-19 22.45.20" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="2013-winner">2013 Winner&lt;/h4>
&lt;p>&lt;strong>Combining Modality Specific Deep Neural Networks for Emotion Recognition in Video&lt;/strong>&lt;sup id="fnref:3">&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref">3&lt;/a>&lt;/sup>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2022.47.35.png" alt="截屏2021-02-19 22.47.35" style="zoom:80%;" />
&lt;h5 id="convolutional-network">&lt;strong>Convolutional Network&lt;/strong>&lt;/h5>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2022.49.28.png" alt="截屏2021-02-19 22.49.28" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;ul>
&lt;li>Inputs are images of size 40x40, cropped randomly&lt;/li>
&lt;li>Four layers, 3 convolutions followed by max or average pooling and a fully-connected layer&lt;/li>
&lt;/ul>
&lt;h5 id="representing-video-sequence">Representing video sequence&lt;/h5>
&lt;ul>
&lt;li>CNN gives 7-dim output per frame&lt;/li>
&lt;li>Multiple frames are averaged into 10 vectors describing the sequence
&lt;ul>
&lt;li>For shorter sequences, frames / vectors get expanded (duplicated)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Results in 70-dim feature vector (10*7)&lt;/li>
&lt;li>Classification with SVM&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2022.56.28.png" alt="截屏2021-02-19 22.56.28" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h5 id="other-features">Other Features&lt;/h5>
&lt;ul>
&lt;li>„Bag of Mouth“&lt;/li>
&lt;li>Audio-features&lt;/li>
&lt;/ul>
&lt;h4 id="typical-pipline">Typical Pipline&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Face detection and alignment&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Extract various features and different representations&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Build multiple classifiers&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Fusion of results&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="other-applications">Other Applications&lt;/h4>
&lt;ul>
&lt;li>Pain Analysis&lt;/li>
&lt;li>Analysis of psychological disorders&lt;/li>
&lt;li>Workload / stress analysis&lt;/li>
&lt;li>Adaptive user interfaces&lt;/li>
&lt;li>Advertisment&lt;/li>
&lt;/ul>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>Y. . -I. Tian, T. Kanade and J. F. Cohn, &amp;ldquo;Recognizing action units for facial expression analysis,&amp;rdquo; in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 97-115, Feb. 2001, doi: 10.1109/34.908962.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>Littlewort, Gwen &amp;amp; Frank, Mark &amp;amp; Lainscsek, Claudia &amp;amp; Fasel, Ian &amp;amp; Movellan, Javier. (2006). Automatic Recognition of Facial Actions in Spontaneous Expressions. Journal of Multimedia. 1. 10.4304/jmm.1.6.22-35.&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:3">
&lt;p>Kahou, Samira Ebrahimi &amp;amp; Pal, Christopher &amp;amp; Bouthillier, Xavier &amp;amp; Froumenty, Pierre &amp;amp; Gulcehre, Caglar &amp;amp; Memisevic, Roland &amp;amp; Vincent, Pascal &amp;amp; Courville, Aaron &amp;amp; Bengio, Y. &amp;amp; Ferrari, Raul &amp;amp; Mirza, Mehdi &amp;amp; Jean, Sébastien &amp;amp; Carrier, Pierre-Luc &amp;amp; Dauphin, Yann &amp;amp; Boulanger-Lewandowski, Nicolas &amp;amp; Aggarwal, Abhishek &amp;amp; Zumer, Jeremie &amp;amp; Lamblin, Pascal &amp;amp; Raymond, Jean-Philippe &amp;amp; Wu, Zhenzhou. (2013). Combining modality specific deep neural networks for emotion recognition in video. ICMI 2013 - Proceedings of the 2013 ACM International Conference on Multimodal Interaction. 543-550. 10.1145/2522848.2531745.&amp;#160;&lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Face</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/face/</link><pubDate>Sat, 19 Dec 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/face/</guid><description/></item><item><title>Modern Face Recognition Overview</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/face/modern-face-recognition/</link><pubDate>Sat, 19 Dec 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/face/modern-face-recognition/</guid><description>&lt;p>Face recognition is a series of several related problems:&lt;/p>
&lt;ol>
&lt;li>Face detection: Look at a picture and find all the faces in it&lt;/li>
&lt;li>Focus on each face and be able to understand that even if a face is turned in a weird direction or in bad lighting, it is still the same person.&lt;/li>
&lt;li>Be able to pick out unique features of the face that you can use to tell it apart from other people (like how big the eyes are, how long the face is, etc.)&lt;/li>
&lt;li>Compare the unique features of that face to all the people you already know to determine the person’s name.&lt;/li>
&lt;/ol>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*WxBM1lB5WzDjrDXYfi9gtw.gif" alt="Image for post">&lt;/p>
&lt;h2 id="step-1-face-detection">Step 1: Face detection&lt;/h2>
&lt;p>&lt;strong>Face detection = locate the faces in a photograph&lt;/strong>&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*izQuwClzcsJoCw5ybQC01Q.png" alt="Image for post">&lt;/p>
&lt;p>One of the methods for face detection is called &lt;strong>Histogram of Oriented Gradients (HOG)&lt;/strong>&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup> invented in 2005.&lt;/p>
&lt;p>To find faces in an image, we’ll start by making our image black and white because we don’t need color data to find faces:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*osGdB2BNMThhk1rTwo07JA.jpeg" alt="Image for post" style="zoom:50%;" />
&lt;p>Then we’ll look at every single pixel in our image one at a time. For every single pixel, we want to &lt;strong>look at the pixels that directly surrounding it&lt;/strong>:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*RZS05e_5XXQdofdRx1GvPA.gif" alt="1*RZS05e_5XXQdofdRx1GvPA">&lt;/p>
&lt;p>Our goal is to figure out how dark the current pixel is compared to the pixels directly surrounding it. Then we want to draw an arrow showing in which direction the image is getting darker:&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*WF54tQnH1Hgpoqk-Vtf9Lg-20210204222605775.gif">&lt;figcaption>
&lt;h4>Looking at just this one pixel and the pixels around it. The image is getting darker towards the upper right.&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>If we repeat that process for every single pixel in the image, we will end up with every pixel being replaced by an arrow. These arrows are called &lt;strong>&lt;mark>gradients&lt;/mark>&lt;/strong> and they show the flow &lt;strong>from light to dark&lt;/strong> across the entire image:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*oTdaElx_M-_z9c_iAwwqcw-20210204222934275.gif" alt="Image for post" style="zoom: 50%;" />
&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">&lt;p>Q: Why should we replace the pixels with gradients?&lt;/p>
&lt;p>A: If we analyze pixels directly, really dark images and really light images of the same person will have totally different pixel values. But by only considering the &lt;em>direction&lt;/em> that brightness changes, both really dark images and really bright images will end up with the &lt;em>same&lt;/em> exact representation. That makes the problem a lot easier to solve! &amp;#x1f44f;&lt;/p>
&lt;/span>
&lt;/div>
&lt;p>But saving the gradient for every single pixel gives us way too much detail. It would be better if we could just see the basic flow of lightness/darkness at a higher level so we could see the basic pattern of the image. To do this,&lt;/p>
&lt;ol>
&lt;li>Break up the image into small squares of 16x16 pixels each&lt;/li>
&lt;li>In each square, count up how many gradients point in each major direction (how many point up, point up-right, point right, etc…).&lt;/li>
&lt;li>Replace that square in the image with the arrow directions that were the &lt;strong>strongest&lt;/strong>.&lt;/li>
&lt;/ol>
&lt;p>The end result is we turn the original image into a very simple representation that captures the basic structure of a face in a simple way:&lt;/p>
&lt;figure>&lt;img src="https://miro.medium.com/max/700/1*uHisafuUw0FOsoZA992Jdg.gif">&lt;figcaption>
&lt;h4>The original image is turned into a HOG representation that captures the major features of the image regardless of image brightnesss.&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>To find faces in this HOG image, all we have to do is &lt;strong>find the part of our image that looks the most similar to a known HOG pattern&lt;/strong> that was extracted from a bunch of other training faces:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*6xgev0r-qn4oR88FrW6fiA-20210204223929488.png" alt="1*6xgev0r-qn4oR88FrW6fiA" style="zoom:67%;" />
&lt;h2 id="step-2-posing-and-projecting-faces">Step 2: Posing and Projecting Faces&lt;/h2>
&lt;p>After isolating the faces in our image, we have to deal with the problem that faces turned different directions look totally different to a computer:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*x-rg0aSpKOer1JF-TejYUg.png" alt="1*x-rg0aSpKOer1JF-TejYUg">&lt;/p>
&lt;p>To account for this, we will try to warp each picture so that &lt;strong>the eyes and lips are always in the sample place in the image&lt;/strong>. This will make it a lot easier for us to compare faces in the next steps.&lt;/p>
&lt;p>To do this, we are going to use an algorithm called &lt;strong>face landmark estimation&lt;/strong> &lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>. The basic idea is we will come up with 68 specific points (called &lt;em>landmarks&lt;/em>) that exist on every face — the top of the chin, the outside edge of each eye, the inner edge of each eyebrow, etc. Then we will train a machine learning algorithm to be able to find these 68 specific points on any face:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*AbEg31EgkbXSQehuNJBlWg.png" alt="1*AbEg31EgkbXSQehuNJBlWg">&lt;/p>
&lt;p>Result of locating the 68 face landmarks on our test image:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*xBJ4H2lbCMfzIfMrOm9BEQ-20210204224628597.jpeg" alt="1*xBJ4H2lbCMfzIfMrOm9BEQ" style="zoom:50%;" />
&lt;p>Now we know where eyes and mouth are, we&amp;rsquo;ll simple rotate, scale, and shear the images so that the eyes and mouth are centered as best as possible. We are only going to use basic image transformations like rotation and scale that preserve parallel lines (called &lt;a href="https://en.wikipedia.org/wiki/Affine_transformation">affine transformations&lt;/a>):&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*igEzGcFn-tjZb94j15tCNA.png" alt="Image for post">&lt;/p>
&lt;p>Now no matter how the face is turned, we are able to center the eyes and mouth are in roughly the same position in the image. This will make our next step a lot more accurate. &amp;#x1f44f;&lt;/p>
&lt;h2 id="step-3-encoding-faces">Step 3: Encoding Faces&lt;/h2>
&lt;p>The simplest approach to face recognition is to directly compare the unknown face we found in Step 2 with all the pictures we have of people that have already been tagged. When we find a previously tagged face that looks very similar to our unknown face, it must be the same person.&lt;/p>
&lt;p>What we need is a way to &lt;strong>extract a few basic measurements from each face&lt;/strong>. Then we could measure our unknown face the same way and find the known face with the closest measurements.&lt;/p>
&lt;h3 id="how-to-measure-a-face">How to measure a face?&lt;/h3>
&lt;p>The solution is to train a deep convolutional neural network which can generate 128 measurements (a.k.a. &lt;strong>Embedding&lt;/strong>) for each face &lt;sup id="fnref:3">&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref">3&lt;/a>&lt;/sup>.&lt;/p>
&lt;p>The training process works by looking at 3 face images at a time:&lt;/p>
&lt;ol>
&lt;li>Load a training face image of a known person&lt;/li>
&lt;li>Load another picture of the &lt;strong>same&lt;/strong> known person&lt;/li>
&lt;li>Load a picture of a totally &lt;strong>different&lt;/strong> person&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Then the algorithm looks at the measurements it is currently generating for each of those three images. It then tweaks the neural network slightly so that it makes sure the measurements it generates for #1 and #2 are slightly closer while making sure the measurements for #2 and #3 are slightly further apart.&lt;/strong>&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*n1R8VMyDRw3RNO3JULYBpQ.png" alt="1*n1R8VMyDRw3RNO3JULYBpQ">&lt;/p>
&lt;p>After repeating this step millions of times for millions of images of thousands of different people, the neural network learns to reliably generate 128 measurements for each person. Any ten different pictures of the same person should give roughly the same measurements.&lt;/p>
&lt;h3 id="encoding-face-image">Encoding face image&lt;/h3>
&lt;p>Once the network has been trained, it can generate measurements for any face, even ones it has never seen before. All we need to do ourselves is run our face images through their pre-trained network to get the 128 measurements for each face. Here’s the measurements for our test image:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*6kMMqLt4UBCrN7HtqNHMKw-20210204225909575.png" alt="Image for post">&lt;/p>
&lt;p>We don&amp;rsquo;t need to care what parts of the face are these 128 numbers measuring exactly. All that we care is that the network generates nearly the same numbers when looking at two different pictures of the same person.&lt;/p>
&lt;h2 id="step-4-finding-the-persons-name-from-the-encoding">Step 4: Finding the person’s name from the encoding&lt;/h2>
&lt;p>This last step is actually the easiest step in the whole process. All we have to do is find the person in our database of known people who has the &lt;em>closest&lt;/em> measurements to our test image.&lt;/p>
&lt;p>We can do that by using any basic machine learning classification algorithm (e.g. SVM). All we need to do is train a classifier that can take in the measurements from a new test image and tells which known person is the closest match.&lt;/p>
&lt;h3 id="example">Example&lt;/h3>
&lt;p>Train a classifier with the embeddings of about 20 pictures each of Will Ferrell, Chad Smith and Jimmy Falon:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*G6jxtXUxDYGY_orEPNzG9Q.jpeg" alt="1*G6jxtXUxDYGY_orEPNzG9Q">&lt;/p>
&lt;p>Then run the classifier on every frame of the famous youtube video of &lt;a href="https://www.youtube.com/watch?v=EsWHyBOk2iQ">Will Ferrell and Chad Smith pretending to be each other&lt;/a> on the Jimmy Fallon show:&lt;/p>
&lt;img src="https://miro.medium.com/max/800/1*woPojJbd6lT7CFZ9lHRVDw.gif" alt="Image for post" style="zoom:67%;" />
&lt;h2 id="i-classfab-fa-githubi-open-source-face-recognition-library">&lt;i class="fab fa-github">&lt;/i> Open Source Face Recognition library&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>&lt;a href="https://github.com/ageitgey/face_recognition">face_recognition&lt;/a>&lt;/strong>: Recognize and manipulate faces from Python or from the command line with the world&amp;rsquo;s simplest face recognition library.&lt;/li>
&lt;li>&lt;strong>&lt;a href="https://github.com/timesler/facenet-pytorch">facenet-pytorch&lt;/a>&lt;/strong>: Face Recognition Using Pytorch&lt;/li>
&lt;/ul>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://medium.com/@ageitgey/machine-learning-is-fun-part-4-modern-face-recognition-with-deep-learning-c3cffc121d78">Machine Learning is Fun! Part 4: Modern Face Recognition with Deep Learning&lt;/a>&lt;/li>
&lt;/ul>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>&lt;a href="http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf">Histograms of Oriented Gradients for Human Detection&lt;/a>&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>&lt;a href="http://www.csc.kth.se/~vahidk/papers/KazemiCVPR14.pdf">One Millisecond Face Alignment with an Ensemble of Regression Trees&lt;/a>&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:3">
&lt;p>&lt;a href="https://www.cv-foundation.org/openaccess/content_cvpr_2015/app/1A_089.pdf">FaceNet: A Unified Embedding for Face Recognition and Clustering&lt;/a>&amp;#160;&lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Eigenface</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/face/eigenface/</link><pubDate>Sat, 19 Dec 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/face/eigenface/</guid><description>&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-07%2016.22.02-20210209112428930.png" alt="截屏2021-02-07 16.22.02">&lt;/p>
&lt;h2 id="google-colab-notebook">Google Colab Notebook&lt;/h2>
&lt;p>&lt;a href="https://colab.research.google.com/drive/1ikyS3qAz1hehQyAKXFcquUthboo75Gai?usp=sharing">Open in Google Colab&lt;/a>&lt;/p></description></item></channel></rss>