<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Lecture | Haobin Tan</title><link>https://haobin-tan.netlify.app/tags/lecture/</link><atom:link href="https://haobin-tan.netlify.app/tags/lecture/index.xml" rel="self" type="application/rss+xml"/><description>Lecture</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Tue, 20 Jul 2021 00:00:00 +0000</lastBuildDate><image><url>https://haobin-tan.netlify.app/media/icon_hu7d15bc7db65c8eaf7a4f66f5447d0b42_15095_512x512_fill_lanczos_center_3.png</url><title>Lecture</title><link>https://haobin-tan.netlify.app/tags/lecture/</link></image><item><title>Computer Vision for Human-Computer Interaction</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/</link><pubDate>Fri, 06 Nov 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/</guid><description>&lt;style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-fymr{border-color:inherit;font-weight:bold;text-align:left;vertical-align:top}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
&lt;/style>
&lt;table class="tg">
&lt;thead>
&lt;tr>
&lt;th class="tg-fymr">Name&lt;/th>
&lt;th class="tg-0pky">Computer Vision for Human-Computer Interaction&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td class="tg-fymr">Semester&lt;/td>
&lt;td class="tg-0pky">WS 20/21&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td class="tg-fymr">Language&lt;/td>
&lt;td class="tg-0pky">English, German&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td class="tg-fymr">Lecturer(s)&lt;/td>
&lt;td class="tg-0pky">&lt;a href="https://cvhci.anthropomatik.kit.edu/people_596.php" target="_blank" rel="noopener noreferrer">Prof. Dr.-Ing. Rainer Stiefelhagen&lt;/a>&lt;br>&lt;a href="https://cvhci.anthropomatik.kit.edu/people_713.php" target="_blank" rel="noopener noreferrer">Dr.-Ing. Muhammad Saquib Sarfraz&lt;/a>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td class="tg-fymr">Credits&lt;/td>
&lt;td class="tg-0pky">6&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td class="tg-fymr">Homepages&lt;/td>
&lt;td class="tg-0pky">&lt;a href="https://cvhci.anthropomatik.kit.edu/600_1979.php">&lt;span style="color:#905">https://cvhci.anthropomatik.kit.edu/600_1979.php&lt;/span>&lt;/a>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table></description></item><item><title>Pattern Recognition</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/02-pattern-recognition/</link><pubDate>Fri, 06 Nov 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/02-pattern-recognition/</guid><description>&lt;h2 id="why-pattern-recognition-and-what-is-it">Why pattern recognition and what is it?&lt;/h2>
&lt;h3 id="what-is-machine-learning">What is machine learning?&lt;/h3>
&lt;ul>
&lt;li>Motivation: &lt;span style="color:red">Some problems are very hard to solve by writing a computer program by hand&lt;/span>&lt;/li>
&lt;li>Learn common patterns based on either
&lt;ul>
&lt;li>
&lt;p>a priori knowledge or&lt;/p>
&lt;/li>
&lt;li>
&lt;p>on statistical information&lt;/p>
&lt;ul>
&lt;li>Important for the adaptability to different tasks/domains&lt;/li>
&lt;li>Try to mimic human learning / better understand human learning&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Machine learning is concerned with developing generic algorithms, that are able to solve problems by &lt;strong>learning from example data&lt;/strong>&lt;/li>
&lt;/ul>
&lt;h2 id="classifiers">Classifiers&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>Given an input pattern $\mathbf{x}$, assign in to a class $\omega\_i$&lt;/p>
&lt;ul>
&lt;li>&lt;em>Example: Given an image, assign label “face” or “non-face”&lt;/em>&lt;/li>
&lt;li>$\mathbf{x}$: can be an image, a video, or (more commonly) any feature vector that can be extracted from them&lt;/li>
&lt;li>$\omega\_i$: desired (discrete) class label
&lt;ul>
&lt;li>If “class label” is real number or vector &amp;ndash;&amp;gt; Regression task&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>ML: Use example patterns with given class labels to automatically learn&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-07%2012.50.02.png" alt="截屏2020-11-07 12.50.02" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>Example&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-07%2012.50.58.png" alt="截屏2020-11-07 12.50.58">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Classification process&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-07%2012.52.02.png" alt="截屏2020-11-07 12.52.02" style="zoom:80%;" />
&lt;/li>
&lt;/ul>
&lt;h3 id="bayes-classification">Bayes Classification&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Given a feature vector $\mathbf{x}$, want to know which class $\omega\_i$ is most likely&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Use Bayes’ rule: Decide for the class $\omega\_i$ with maximum posterior probability&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-07%2012.53.45.png" alt="截屏2020-11-07 12.53.45">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>🔴 Problem: $p(x|\omega\_i)$ (and to a lesser degree $P(\omega\_i)$) is usually &lt;span style="color:red">unknown and often hard to estimate from data&lt;/span>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Priors&lt;/strong> describe what we know about the classes &lt;em>before&lt;/em> observing anything&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Can be used to model prior knowledge&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Sometimes easy to estimate (counting)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Example&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-07%2012.56.58.png" alt="截屏2020-11-07 12.56.58" style="zoom:80%;" />
&lt;/li>
&lt;/ul>
&lt;h3 id="gaussian-mixture-models">Gaussian Mixture Models&lt;/h3>
&lt;h4 id="gaussian-classification">Gaussian classification&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Assumption:
&lt;/p>
$$
\mathrm{p}\left(\mathbf{x} | \omega_{\mathrm{i}}\right) \sim \mathrm{N}(\boldsymbol{\mu}, \mathbf{\Sigma})= \frac{1}{(2 \pi)^{d / 2}|\Sigma|^{1 / 2}} \exp \left[-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right]
$$
&lt;ul>
&lt;li>👆 This makes estimation easier
&lt;ul>
&lt;li>Only $\boldsymbol{\mu}, \mathbf{\Sigma}$ need to be estimated&lt;/li>
&lt;li>To reduce parameters, the covariance matrix can be restricted
&lt;ul>
&lt;li>
&lt;p>Diagonal matrix &amp;ndash;&amp;gt; Dimensions uncorrelated&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Multiple of unit matrix &amp;ndash;&amp;gt; Dimensions uncorrelated with same variance&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>🔴 Problem: if the assumption(s) do not hold, the model does not represent reality well &amp;#x1f622;&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Estimation of $\boldsymbol{\mu}, \mathbf{\Sigma}$ with Maximum (Log-)Likelihood&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Use parameters, that best explain the data (highest likelihood):
&lt;/p>
$$
\begin{aligned}
\operatorname{Lik}(\boldsymbol{\mu}, \mathbf{\Sigma}) &amp;= p(\text{data}|\boldsymbol{\mu}, \mathbf{\Sigma}) \\\\
&amp;= p(\mathbf{x}\_0, \mathbf{x}\_1, \dots, \mathbf{x}\_n|\boldsymbol{\mu}, \mathbf{\Sigma}) \\\\
&amp;= p\left(\mathbf{x}\_{0} | \boldsymbol{\mu}, \mathbf{\Sigma}\right) \cdot p\left(\mathbf{x}\_{1} | \boldsymbol{\mu}, \mathbf{\Sigma}\right) \cdot \ldots \cdot p\left(\mathbf{x}\_{\mathrm{n}} | \boldsymbol{\mu}, \mathbf{\Sigma}\right)
\end{aligned}
$$
$$
\operatorname{LogLik}(\boldsymbol{\mu}, \mathbf{\Sigma}) = \log(\operatorname{Lik}(\boldsymbol{\mu}, \mathbf{\Sigma})) = \sum\_{i=0}^n \log(\mathbf{x}\_i | \boldsymbol{\mu}, \mathbf{\Sigma})
$$
&lt;p>&amp;ndash;&amp;gt; Maimize $\log(\operatorname{Lik}(\boldsymbol{\mu}, \mathbf{\Sigma}))$ over $\boldsymbol{\mu}, \mathbf{\Sigma}$&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="gaussian-mixture-models-gmms">Gaussian Mixture Models (GMMs)&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Approximate true density function using a &lt;strong>weighted sum&lt;/strong> of several Gaussians
&lt;/p>
$$
\mathrm{p}(\mathbf{x})=\sum\_{i} \mathrm{w}\_{i} \frac{1}{(2 \pi)^{\mathrm{d}/2}|\mathbf{\Sigma}|^{1 / 2}} \exp \left[-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^{\top} \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right] \qquad \text{with} \sum\_i w\_i = 1
$$
&lt;/li>
&lt;li>
&lt;p>Any density can be approximated this way with arbitrary precision&lt;/p>
&lt;ul>
&lt;li>But might need many Gaussians&lt;/li>
&lt;li>Difficult to estimate many parameters&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Use &lt;strong>Expectation Maximization (EM) Algorithm&lt;/strong> to estimate parameters of the Gaussians as well as the weights&lt;/p>
&lt;ol>
&lt;li>
&lt;p>Initialize parameters of GMM randomly&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Repeat until convergence&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Expectation (E)&lt;/strong> step:&lt;/p>
&lt;p>Compute the probability $p\_{ij}$ that data point $i$ belongs to Gaussian $j$&lt;/p>
&lt;ul>
&lt;li>Take the value of each Gaussian at point $i$ and normalize so they sum up to one&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Maximization (M)&lt;/strong> step:&lt;/p>
&lt;p>Compute new GMM parameters using soft assignments $p\_{ij}$&lt;/p>
&lt;ul>
&lt;li>Maximum Likelihood with data weighted according to $p\_{ij}$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;/li>
&lt;/ul>
&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">&lt;h4 id="parametric-vs-non-parametric">parametric vs. non-parametric&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Parametric&lt;/strong> classifiers&lt;/p>
&lt;ul>
&lt;li>assume a specific form of probability distribution with some parameters&lt;/li>
&lt;li>only the parameters need to be estimated&lt;/li>
&lt;li>👍 Advantage: Need less training data because less parameters to estimate&lt;/li>
&lt;li>👎 disadvantage: Only work well if model fits data&lt;/li>
&lt;li>Examples: Gaussian and GMMs&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Non-parametric&lt;/strong> classifiers&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Do NOT assume a specific form of probability distribution&lt;/p>
&lt;/li>
&lt;li>
&lt;p>👍 Advantage: Work well for all types of distributions&lt;/p>
&lt;/li>
&lt;li>
&lt;p>👎 disadvantage: Need more data to correctly estimate distribution&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Examples: Parzen windows, k-nearest neighbors&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/span>
&lt;/div>
&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">&lt;h4 id="generative-vs-discriminative">generative vs. discriminative&lt;/h4>
&lt;ul>
&lt;li>A method that models $P(\omega\_i)$ and $p(\mathbf{x}|\omega\_i)$ &lt;em>explicitly&lt;/em> is called a &lt;strong>generative&lt;/strong> model
&lt;ul>
&lt;li>$p(\mathbf{x}|\omega\_i)$ allows to generate new samples of class $\omega\_i$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>The other common approach is called &lt;strong>discriminative&lt;/strong> models
&lt;ul>
&lt;li>directly model $p(\omega\_i|\mathbf{x})$ or just output a decision $\omega\_i$ given an input pattern $\mathbf{x}$&lt;/li>
&lt;li>easier to train because they solve a simpler problem &amp;#x1f44f;&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/span>
&lt;/div>
&lt;h3 id="linear-discriminant-functions">Linear Discriminant Functions&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Separate two classes $\omega\_1, \omega\_2$ with a linear hyperplane
&lt;/p>
$$
y(x)=w^{T} x+w_{0}
$$
&lt;p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-07%2016.45.44.png" alt="截屏2020-11-07 16.45.44" style="zoom:80%;" />&lt;/p>
&lt;ul>
&lt;li>Decide $\omega\_1$ if $y(x) > 0$ else $\omega\_2$&lt;/li>
&lt;li>$w^T$: normal vector of the hyperplane&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Example:&lt;/p>
&lt;ul>
&lt;li>Perceptron (see: &lt;a href="https://haobin-tan.netlify.app/docs/ai/deep-learning/nn-basics/perceptron/">Perceptron&lt;/a>)&lt;/li>
&lt;li>Linear SVM&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="support-vector-machines">Support Vector Machines&lt;/h3>
&lt;p>See: &lt;a href="https://haobin-tan.netlify.app/docs/ai/machine-learning/classification/svm/support-vector-machine/">SVM&lt;/a>&lt;/p>
&lt;h4 id="linear-svms">Linear SVMs&lt;/h4>
&lt;ul>
&lt;li>If the input space is already high-dimensional, linear SVMs can often perform well too&lt;/li>
&lt;li>👍 Advantages:
&lt;ul>
&lt;li>Speed: Only one scalar product for classification&lt;/li>
&lt;li>Memory: Only one vector w needs to be stored&lt;/li>
&lt;li>Training: Training is much faster&lt;/li>
&lt;li>Model selection: Only one parameter to optimize&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="k-nearest-neighbours">K-nearest Neighbours&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>💡 Look at the $k$ closest training samples and assign the most frequent label among them&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Model consists of all training samples&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Pro: No information is lost&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Con: A lot of data to manage&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Naïve implementation: compute distance to each training sample every time&lt;/p>
&lt;ul>
&lt;li>Distance metric is needed (Important design parameter!)
&lt;ul>
&lt;li>$L\_1$, $L\_2$, $L\_{\infty}$, Mahalanobis, &amp;hellip; or&lt;/li>
&lt;li>Problem-specific distances&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>kNN often good classifier, but:&lt;/p>
&lt;ul>
&lt;li>Needs enough data&lt;/li>
&lt;li>Scalability issues&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">More see: &lt;a href="https://haobin-tan.netlify.app/docs/ai/machine-learning/classification/k-nearest-neighbor/">k-NN&lt;/a>&lt;/span>
&lt;/div>
&lt;h2 id="clustering">Clustering&lt;/h2>
&lt;ul>
&lt;li>New problem setting
&lt;ul>
&lt;li>Only data points are given, NO class labels&lt;/li>
&lt;li>Find structures in given data&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Generally no single correct solution possible&lt;/li>
&lt;/ul>
&lt;h3 id="k-means">K-means&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Algorithm&lt;/p>
&lt;ol>
&lt;li>
&lt;p>Randomly initialize k cluster centers&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Repeat until convergence:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Assign all data points to closest cluster center&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Compute new cluster center as mean of assigned data points&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;/li>
&lt;li>
&lt;p>👍 Pros: Simple and efficient&lt;/p>
&lt;/li>
&lt;li>
&lt;p>👎 Cons:&lt;/p>
&lt;ul>
&lt;li>$k$ needs to be known in advance&lt;/li>
&lt;li>Results depend on initialization&lt;/li>
&lt;li>Does not work well for clusters that are not hyperspherical (round) or clusters that overlap&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Very similar to the EM algorithm&lt;/p>
&lt;ul>
&lt;li>Uses hard assignments instead of probabilistic assignments (EM)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="agglomerative-hierarchical-clustering">Agglomerative Hierarchical Clustering&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Algorithm&lt;/p>
&lt;ol>
&lt;li>
&lt;p>Start with one cluster for each data point&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Repeat&lt;/p>
&lt;ul>
&lt;li>Merge two closest clusters&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;/li>
&lt;li>
&lt;p>Several possibilities to measure cluster distance&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Min: minimal distance between elements&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Max: maximal distance between elements&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Avg: average distance between elements&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Mean: distance between cluster means&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Result is a tree called a &lt;strong>dendrogram&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Example:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-07%2017.02.40.png" alt="截屏2020-11-07 17.02.40" style="zoom:67%;" />
&lt;/li>
&lt;/ul>
&lt;h2 id="curse-of-dimensionality">Curse of dimensionality&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>In computer vision, the extracted feature vectors are often &lt;strong>high&lt;/strong>-dimensional&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Many intuitions about linear algebra are no longer valid in high-dimensional spaces 🤪&lt;/p>
&lt;ul>
&lt;li>Classifiers often work better in low-dimensional spaces&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>These problems are called “&lt;strong>curse of dimensionality&lt;/strong>&amp;quot; &amp;#x1f47f;&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="example">Example&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-07%2017.05.38.png" alt="截屏2020-11-07 17.05.38" style="zoom:80%;" />
&lt;h3 id="dimensionality-reduction">Dimensionality reduction&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>PCA: Leave out dimensions and minimize error made&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-07%2017.06.38.png" alt="截屏2020-11-07 17.06.38" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>LDA: Maximize class separability&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-07%2017.06.58.png" alt="截屏2020-11-07 17.06.58" style="zoom:67%;" />&lt;/li>
&lt;/ul></description></item><item><title>Face Detection: Color-Based</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/03-face-detection-color/</link><pubDate>Fri, 06 Nov 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/03-face-detection-color/</guid><description>&lt;h2 id="tldr">TL;DR&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>Different color spaces and classifiers can be used&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Models: histograms, Gaussian Models, Mixture of Gaussians Model&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Histogram-backprojection / Histogram matching&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Bayes classifier&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Discriminative Classifiers (ANN, SVM)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Bayesian classifier and ANN seem to work well&lt;/p>
&lt;ul>
&lt;li>Sufficient training data is needed for modeling the pdf, in particular for Bayesian approach (positive &amp;amp; negative pdfs learned)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Advantages: Fast, rotation &amp;amp; scale invariant, robust against occlusions&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Disadvantages:&lt;/p>
&lt;ul>
&lt;li>Affected by illumination&lt;/li>
&lt;li>Cannot distinguish head and hands&lt;/li>
&lt;li>Skin-colored objects in the background problematic&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Metric: ROC curve used to compare classification results / methods&lt;/p>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="color-based-face-detection-overview">Color-based face detection overview&lt;/h2>
&lt;p>💡 &lt;strong>Idea: human skin has consistent color, which is distinct from many objects&lt;/strong>&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-10%2014.57.37.png" alt="截屏2020-11-10 14.57.37">&lt;/p>
&lt;p>Possible approach:&lt;/p>
&lt;ol>
&lt;li>Find skin colored pixels&lt;/li>
&lt;li>Group skin colored pixels&lt;/li>
&lt;/ol>
&lt;ul>
&lt;li>(and apply some heuristics) to find the face&lt;/li>
&lt;/ul>
&lt;h2 id="color">Color&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Grayscale&lt;/strong> Image: Each pixel represented by &lt;strong>one&lt;/strong> number (typically integer between 0 and 255)&lt;/li>
&lt;li>&lt;strong>Color&lt;/strong> image: Pixels represented by &lt;strong>three&lt;/strong> numbers&lt;/li>
&lt;/ul>
&lt;p>Different representations exist &amp;ndash;&amp;gt; „Color Spaces“&lt;/p>
&lt;h3 id="color-spaces">Color spaces&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>RGB&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>most widely used&lt;/p>
&lt;/li>
&lt;li>
&lt;p>specifies colors in terms of the primary colors &lt;strong>red (R), green (G), and blue (B)&lt;/strong>&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-10%2015.00.08-20201110184617048.png" alt="截屏2020-11-10 15.00.08">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>HSV/HSI&lt;/strong>: &lt;strong>hue (H)&lt;/strong>, &lt;strong>saturation (S)&lt;/strong> and &lt;strong>value(V)/intensity (I)&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Closely related to human perception (hue, colorfulness and brightness)&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-10%2017.27.38.png" alt="截屏2020-11-10 17.27.38">&lt;/p>
&lt;ul>
&lt;li>Hue: &amp;ldquo;color&amp;rdquo;&lt;/li>
&lt;li>Saturation: How &amp;ldquo;pure&amp;rdquo; the color is?&lt;/li>
&lt;li>Value: &amp;ldquo;lightness&amp;rdquo;&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Class Y spaces&lt;/strong>: YCbCr (Digital Video), YIQ (NTSC), YUV (PAL)&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Y channel contains brightness, other two channels store chrominance (U=B-Y, V=R-Y)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Conversion from RGB to Yxx is a linear transformation&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-10%2018.18.27.png" alt="截屏2020-11-10 18.18.27">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Perceptually uniform spaces&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Perceived color difference is uniform to difference in color values&lt;/li>
&lt;li>Euclidian distance can be used for color comparison&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-10%2018.19.07.png" alt="截屏2020-11-10 18.19.07">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Chromatic Color Spaces&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Two color channels containing chrominance (colour) information&lt;/p>
&lt;ul>
&lt;li>HS (taken from HSV)&lt;/li>
&lt;li>UV (taken from YUV)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Normalized rg from RGB:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>r = R / (R+G+B)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>g = G / (R+G+B)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>b = B / (R+G+B)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Sometimes it is argued that chromatic skin color models are more robust&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="problems">Problems&lt;/h4>
&lt;ul>
&lt;li>Reflected color depends on spectrum of the light source (and properties of the object / surface)&lt;/li>
&lt;li>If the light source / illumination changes, the reflected color signal changes!!! 🤪&lt;/li>
&lt;/ul>
&lt;h2 id="how-to-model-skin-color">How to model skin color?&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>&lt;a href="#histogram-as-skin-color-model">Non-parametric models: typically histograms&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="#parametric-models">Parametric models&lt;/a>&lt;/p>
&lt;ul>
&lt;li>Gaussian Model&lt;/li>
&lt;li>Gaussian Mixture Model&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Or just learn decision boundaries between classes (&lt;a href="#discriminative-models--classifiers">discriminative model&lt;/a>)&lt;/p>
&lt;ul>
&lt;li>ANN, SVM, &amp;hellip;&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="histogram-as-skin-color-model">Histogram as skin color model&lt;/h3>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-10%2018.34.57.png" alt="截屏2020-11-10 18.34.57">&lt;/p>
&lt;ul>
&lt;li>👍 Advantages: Works very well in practice&lt;/li>
&lt;li>👎 Disadvantages
&lt;ul>
&lt;li>Memory size quickly gets high&lt;/li>
&lt;li>A large number of labelled skin and non-skin samples is needed!&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="histogram-backprojection">Histogram Backprojection&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>The simplest (and fastest) way to utilize histogram information&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Each pixel in the backprojection is set to the value of the (skin-color) histogram bin indexed by the color of the respective pixel&lt;/p>
&lt;ul>
&lt;li>A color $x$ is considered as skin color if $H\_{+}(x) > \theta$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>E.g.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-22%2022.20.33.png" alt="截屏2021-07-22 22.20.33">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="histogram-matching">Histogram Matching&lt;/h4>
&lt;ul>
&lt;li>Backprojection
&lt;ul>
&lt;li>is good, when the color distribution of the target is monomodal.&lt;/li>
&lt;li>is not optimal, when the target is multi colored! &amp;#x1f622;&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>🔧 Solution: Build a histogram of the image within the search window, and compare it to the target histogram.
&lt;ul>
&lt;li>distance metrics for histograms, e.g.:
&lt;ul>
&lt;li>
&lt;p>Battacharya distance&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Histogram intersection&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Earth-movers distance,&amp;hellip;&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="histogram-backprojection-vs-matching">Histogram Backprojection vs. Matching&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Histogram Backprojection&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Compares color of a single pixel with color model&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Fast and simple&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Can only cope well with mono-modal distributions&lt;/p>
&lt;/li>
&lt;li>
&lt;p>sufficient for skin-color classification&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Histogram Matching / Intersection&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Compares color histogram of image patch with color model&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Better performance&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Can cope with multi-modal distributions&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Computationally expensive&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="parametric-models">Parametric models&lt;/h3>
&lt;h4 id="gaussian-density-models">Gaussian Density Models&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Gaussian Densities&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Assume that the distribution of skin colors p(x) has a parametric functional form&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Most common function: Gaussion function $\mathrm{G}(\mathbf{x} ; \mu, \mathbf{C})$
&lt;/p>
$$
p(x | \text{skin})=G(x ; \mu, C)=\frac{1}{(2 \pi)^{d / 2}|C|^{1 / 2} }\exp \left\\{-1 / 2(x-\mu)^{\top} C^{-1}(x-\mu)\right\\}
$$
&lt;ul>
&lt;li>Mean $\mu$ and covariance matrix $C$ are estimated from a training set of skin colors $S = {x\_1,x\_2,...,x\_N}$:
&lt;ul>
&lt;li>$\mu = E\{x\}$&lt;/li>
&lt;li>$C = E\{(\boldsymbol{x}-\mu)^T(\boldsymbol{x}-\mu)\}$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>A color is considered as skin color if&lt;/p>
&lt;ul>
&lt;li>$p(x|\text{skin}) > \theta$&lt;/li>
&lt;li>$p(x|\text{skin}) > p(x|\text{non-skin})$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="mixture-of-gaussian-models">Mixture of Gaussian Models&lt;/h4>
$$
p(x)=\sum\_{i=1}^{K} \pi\_{i} G\left(x, \mu\_{i}, C\_{i}\right)
$$
&lt;ul>
&lt;li>
&lt;p>Parameter set $\Phi$ can be estimated using the &lt;strong>EM&lt;/strong> algorithm&lt;/p>
&lt;ul>
&lt;li>Iteratively changes parameters so as to maximize the log-likelihood of the training set:
$$
L=\log \prod\_{i=1}^{N} p\left(x\_{i} \mid \Phi\right)
$$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>A color is considered as skin color if&lt;/p>
&lt;ul>
&lt;li>$p(x|\text{skin}) > \theta$&lt;/li>
&lt;li>$p(x|\text{skin}) > p(x|\text{non-skin})$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="bayes-classifier">Bayes Classifier&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Skin Classification using &lt;strong>Bayes Decision Rule&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Minimum cost decision rule&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Classify pixel to skin class if $P(\text{Skin} | x)>P(\text{Non-Skin} | x)$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Decision Rule:
&lt;/p>
$$
\frac{p(\mathbf{x} \mid \text {Skin})}{p(\mathbf{x} \mid \text {Non-Skin})} \geq \frac{P(\text {Non-Skin})}{P(\text {Skin})}
$$
&lt;/li>
&lt;li>
&lt;p>The classconditionals $p(x|\omega)$ can be estimated from the corresponding histograms:
&lt;/p>
$$
p\left(x \mid \omega\_{i}\right)=h\_{i}(x) / \sum\_{x} h\_{i}(x)
$$
&lt;ul>
&lt;li>$h\_i(x)$: count of pixels from class $\omega\_{i}$ that have value $x$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="discriminative-models--classifiers">Discriminative Models / Classifiers&lt;/h3>
&lt;ul>
&lt;li>Artificial Neural Networks&lt;/li>
&lt;li>Support Vector Machine&lt;/li>
&lt;/ul>
&lt;h2 id="performance-measures">Performance Measures&lt;/h2>
&lt;h3 id="for-classification">For classification&lt;/h3>
&lt;p>When comparing recognition hypotheses with ground-truth annotations have to consider four cases:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/confusion-matrix.png" alt="Measuring Performance: The Confusion Matrix – Glass Box" style="zoom: 40%;" />
&lt;blockquote>
&lt;p>More see: &lt;a href="https://haobin-tan.netlify.app/docs/ai/machine-learning/ml-fundamentals/evaluation/">Evaluation&lt;/a>&lt;/p>
&lt;/blockquote>
&lt;h4 id="roc-receiver-operating-characteristic">ROC (Receiver Operating Characteristic)&lt;/h4>
&lt;ul>
&lt;li>Used for the task of classification&lt;/li>
&lt;li>Measures the trade-off between true positive rate and false positive rate&lt;/li>
&lt;/ul>
$$
\begin{array}{l}
\text { true positive rate }=\frac{\mathrm{TP}}{\mathrm{Pos}}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} \\\\
\text { false positive rate }=\frac{\mathrm{FP}}{\mathrm{Neg}}=\frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TN}}
\end{array}
$$
&lt;ul>
&lt;li>
&lt;p>Each prediction hypothesis has generally an associated probability value or score&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The performance values can therefore plotted into a graph for each possible score as a threshold&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Example:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-12%2023.27.18.png" alt="截屏2020-11-12 23.27.18">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="skin-color-analysis-and-comparison">Skin-color: Analysis and Comparison&lt;/h3>
&lt;p>Conclusions &lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Bayesian approach and MLP worked best&lt;/p>
&lt;ul>
&lt;li>Bayesian approach needs much more memory&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Approach is largely unaffected by choice of color space, but&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Results degraded when only chrominance channels were used&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="from-skin-colored-pixels-to-faces">From Skin-Colored Pixels to Faces&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>Skin-colored pixels need to be grouped into object representations&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-13%2014.56.21.png" alt="截屏2020-11-13 14.56.21" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>🔴 Problems:&lt;/p>
&lt;ul>
&lt;li>skin-colored background,&lt;/li>
&lt;li>further skin-colored body parts (hands, arms, &amp;hellip;),&lt;/li>
&lt;li>Noise, &amp;hellip;&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="perceptual-grouping">Perceptual Grouping&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Morphological Operators&lt;/strong>: Operators performing an action on shapes where the input and output is a binary image.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Threshold each pixel‘s skin affiliation &amp;ndash;&amp;gt; Binary Image&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-13%2014.58.11.png" alt="截屏2020-11-13 14.58.11">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Morphological Erosion&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;em>Remove&lt;/em> pixels from edges of objects&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Set pixel value to &lt;strong>min&lt;/strong> value of surrounding pixels&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-13%2015.00.53.png" alt="截屏2020-11-13 15.00.53">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Morphological Dilatation&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;em>Add&lt;/em> pixels to edges of objects&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Set pixel value to &lt;strong>max&lt;/strong> value of surrounding pixels&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-13%2015.41.11.png" alt="截屏2020-11-13 15.41.11">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Morphological Opening&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Apply erosion, then dilatation&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-13%2015.42.38.png" alt="截屏2020-11-13 15.42.38">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Goal:&lt;/p>
&lt;ul>
&lt;li>Smooth outline&lt;/li>
&lt;li>Open small bridges&lt;/li>
&lt;li>Eliminate outliers&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Morphological Closing&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Apply dilatation, then erosion&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-13%2015.45.25.png" alt="截屏2020-11-13 15.45.25">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Goal:&lt;/p>
&lt;ul>
&lt;li>Smooth inner edges&lt;/li>
&lt;li>Connect small distances&lt;/li>
&lt;li>Fill unwanted holes&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Apply morphological closing then morphological opening&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Resulting image is reduced to connected regions of skin color (blobs)&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-13%2015.59.57.png" alt="截屏2020-11-13 15.59.57">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="from-skin-blobs-to-faces">From Skin Blobs To Faces&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Goal: align bounding box around face candidate&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-13%2016.01.23.png" alt="截屏2020-11-13 16.01.23">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Important for:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Face Recognition&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Head Pose Estimation&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Different approaches:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Choose cluster with biggest size&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Ellipse fitting (approximate face region by ellipse)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Heuristics to distinguish between different skin clusters&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Use temporal information (tracking)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Facial Feature Detection&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&amp;hellip;&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>S. L. Phung, A. Bouzerdoum and D. Chai, &amp;ldquo;Skin segmentation using color pixel classification: analysis and comparison,&amp;rdquo; in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 1, pp. 148-154, Jan. 2005, doi: 10.1109/TPAMI.2005.17.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Face Detection: Neural-Network-Based</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/04-face-detection-ann/</link><pubDate>Fri, 13 Nov 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/04-face-detection-ann/</guid><description>&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;ul>
&lt;li>Idea: Use a search-window to scan over an image&lt;/li>
&lt;li>Train a classifier to decide whether the search windows contains a face or not?&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-11-13%2016.16.57.png" alt="截屏2020-11-13 16.16.57" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="detection">Detection&lt;/h2>
&lt;h3 id="simple-neuron-model">Simple neuron model&lt;/h3>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-11-13%2016.20.47.png" alt="截屏2020-11-13 16.20.47" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="topologies">Topologies&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-13%2016.21.15.png" alt="截屏2020-11-13 16.21.15" style="zoom:67%;" />
&lt;h3 id="parameters">Parameters&lt;/h3>
&lt;p>Adjustable Parameters are&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Connection weights (to be learned)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Activation function (fixed)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Number of layers (fixed)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Number of neurons per layer (fixed)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="training">Training&lt;/h3>
&lt;p>Backpropagation with gradient descent&lt;/p>
&lt;h2 id="neural-network-based-face-detection1">Neural Network Based Face Detection&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/h2>
&lt;ul>
&lt;li>Idea: Use an artifical neural network to detect upright frontal faces
&lt;ul>
&lt;li>
&lt;p>Network receives as input a 20x20 pixel region of an image&lt;/p>
&lt;/li>
&lt;li>
&lt;p>output ranges from -1 (no face present) to +1 (face present)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>the neural network „face-filter“ is applied at every location in the image&lt;/p>
&lt;/li>
&lt;li>
&lt;p>to detect faces with different sizes, the input image is repeatedly scaled down&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="network-topology">Network Topology&lt;/h3>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-11-13%2016.28.33.png" alt="截屏2020-11-13 16.28.33" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;ul>
&lt;li>20x20 pixel input retina&lt;/li>
&lt;li>4 types of receptive hidden fields&lt;/li>
&lt;li>One real-valued output&lt;/li>
&lt;/ul>
&lt;h3 id="system-overview">System Overview&lt;/h3>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-11-13%2016.29.19.png" alt="截屏2020-11-13 16.29.19" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="network-training">Network Training&lt;/h3>
&lt;h4 id="training-set">Training Set&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>1050 normalized face images&lt;/p>
&lt;/li>
&lt;li>
&lt;p>15 face images generated by rotating and scaling original face images&lt;/p>
&lt;/li>
&lt;li>
&lt;p>1000 randomly chosen non-face images&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="preprocessing">Preprocessing&lt;/h4>
&lt;ul>
&lt;li>correct for different lighting conditions (overall brightness, shadows)&lt;/li>
&lt;li>rescale images to fixed size&lt;/li>
&lt;/ul>
&lt;h4 id="histogram-equalization">Histogram equalization&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Defines a mapping of gray levels $p$ into gray levels $q$ such that the distribution of $q$ is close to being uniform&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Stretches contrast (expands the range of gray levels)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Transforms different input images so that they have similar intensity distributions (thus reducing the effect of different illumination)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Example&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-13%2016.32.18.png" alt="截屏2020-11-13 16.32.18" style="zoom:67%;" />
&lt;/li>
&lt;li>
&lt;p>Algorithm&lt;/p>
&lt;ul>
&lt;li>
&lt;p>The probability of an occurrence of a pixel of level $i$ in the image:
&lt;/p>
$$
p\left(x\_{i}\right)=\frac{n\_{i}}{n}, \qquad i \in 0, \ldots, L-1
$$
&lt;ul>
&lt;li>$L$: number of gray levels&lt;/li>
&lt;li>$n\_i$: number of occurences of gray level $i$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Define $c$ as the cumulative distribution function:
&lt;/p>
$$
c(i)=\sum\_{j=0}^{i} p\left(x\_{j}\right)
$$
&lt;/li>
&lt;li>
&lt;p>Create a transformation of the form
&lt;/p>
$$
y\_i = T(x\_i) = c(i), \qquad y\_i \in [0, 1]
$$
&lt;p>
will produce a level $y$ for each level $x$ in the original image, such that the cumulative probability function of $y$ will be linearized across the value range.
&lt;/p>
$$
y\_{i}^{\prime}=y\_{i} \cdot(\max -\min )+\min
$$
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="training-procedure">Training Procedure&lt;/h4>
&lt;ol>
&lt;li>
&lt;p>Randomly choose 1000 non-face images&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Train network to produce 1 for faces, -1 for non-faces&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Run network on images containing no faces. Collect subimages in which network incorrectly identifes a face (output &amp;gt; 0)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Select up to 250 of these „false positives“ at random and add them to the training set as negative examples&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h3 id="neural-network-based-face-filter">Neural Network Based Face Filter&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Output of ANN defines a filter for faces&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Search&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Scan input image with search window, apply ANN to search window&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Input image needs to be rescaled in order to detect faces with different size&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Output needs to be post-processed&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Noise removal&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Merging overlapping detections&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Speed up can be achieved&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Increase step size&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Make ANN more flexible to translation&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Hierarchical, pyramidal search&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="localization-and-ground-truth">Localization and Ground-Truth&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>For localization, the test data is mostly annotated with ground-truth bounding boxes&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Comparing hypotheses to Ground-Truth&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Overlap
&lt;/p>
$$
O = \frac{\text{GT } \cap \text{ DET}}{\text{GT } \cup \text{ DET}}
$$
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-11-13%2016.43.11.png" alt="截屏2020-11-13 16.43.11" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;blockquote>
&lt;p>Also called &lt;strong>Intersection over Union (IoU)&lt;/strong>&lt;/p>
&lt;/blockquote>
&lt;/li>
&lt;li>
&lt;p>Often used as threshold: Overlap&amp;gt;50%&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>&lt;em>Neural Network Based Face Detection, by Henry A. Rowley, Shumeet Baluja, and Takeo Kanade. IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 20, number 1, pages 23-38, January 1998.&lt;/em>&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Face Recognition: Traditional Approaches</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/05-face_recognition-traditional/</link><pubDate>Thu, 04 Feb 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/05-face_recognition-traditional/</guid><description>&lt;h2 id="face-recognition-for-human-computer-interaction-hci">Face Recognition for Human-Computer Interaction (HCI)&lt;/h2>
&lt;h3 id="main-problem">Main Problem&lt;/h3>
&lt;blockquote>
&lt;p>The variations between the images of the same face due to illumination and viewing direction are almost always larger than image variations due to change in face identity.&lt;/p>
&lt;p>&amp;ndash; Moses, Adini, Ullman, ECCV‘94&lt;/p>
&lt;/blockquote>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-04%2023.57.03.png" alt="截屏2021-02-04 23.57.03">&lt;/p>
&lt;h3 id="closed-set-vs-open-set-identification">Closed Set vs. Open Set Identification&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Closed-Set Identification&lt;/strong>
&lt;ul>
&lt;li>The system reports which person from the gallery is shown on the test image: Who is he?&lt;/li>
&lt;li>Performance metric: Correct identification rate&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Open-Set Identification&lt;/strong>
&lt;ul>
&lt;li>The system first decides whether the person on the test image is a known or unknown person. If he is a known person who he is?&lt;/li>
&lt;li>Performance metric
&lt;ul>
&lt;li>&lt;strong>False accept&lt;/strong>: The invalid identity is accepted as one of the individuals in the database.&lt;/li>
&lt;li>&lt;strong>False reject&lt;/strong>: An individual is rejected even though he/she is present in the database.&lt;/li>
&lt;li>&lt;strong>False classify&lt;/strong>: An individual in the database is correctly accepted but misclassified as one of the other individuals in the training data&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="authenticationverification">Authentication/Verification&lt;/h3>
&lt;p>A person claims to be a particular member. The system decides if the test image and the training image is the same person: Is he who he claims he is?&lt;/p>
&lt;p>Performance metric:&lt;/p>
&lt;ul>
&lt;li>False Reject Rate (FRR): Rate of rejecting a valid identity&lt;/li>
&lt;li>False Accept Rate (FAR): Rate of incorrectly accepting an invalid identity.&lt;/li>
&lt;/ul>
&lt;h2 id="feature-based-geometrical-approaches">Feature-based (Geometrical) approaches&lt;/h2>
&lt;p>&amp;ldquo;Face Recognition: Features versus Templates&amp;rdquo; &lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-05 00.12.27.png" alt="截屏2021-02-05 00.12.27" style="zoom:67%;" />
&lt;ul>
&lt;li>Eyebrow thickness and vertical position at the eye center position&lt;/li>
&lt;li>A coarse description of the left eyebrow‘s arches&lt;/li>
&lt;li>Nose vertical position and width&lt;/li>
&lt;li>Mouth vertical position, width, height upper and lower lips&lt;/li>
&lt;li>Eleven radii describing the chin shape&lt;/li>
&lt;li>Face width at nose position&lt;/li>
&lt;li>Face width halfway between nose tip and eyes&lt;/li>
&lt;/ul>
&lt;h3 id="classification">Classification&lt;/h3>
&lt;p>&lt;strong>Nearest neighbor classifier&lt;/strong> with &lt;strong>Mahalanobis distance&lt;/strong> as the distance metric:
&lt;/p>
$$
\Delta_{j}(x)=\left(x-m_{j}\right)^{T} \Sigma^{-1}\left(x-m_{j}\right)
$$
&lt;ul>
&lt;li>$x$: input face image&lt;/li>
&lt;li>$m\_j$: average vector representing the $j$-th person&lt;/li>
&lt;li>$\Sigma$: Covariance matrix&lt;/li>
&lt;/ul>
&lt;p>Different people are characterized only by their average feature vector.&lt;/p>
&lt;p>The distribution is common and estimated by using all the examples in the training set.&lt;/p>
&lt;h2 id="appearance-based-approaches">Appearance-based approaches&lt;/h2>
&lt;p>Can be either&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;a href="#holistic-appearance-based-approaches">holistic&lt;/a>&lt;/strong> (process the whole face as the input), or&lt;/li>
&lt;li>&lt;a href="#local-appearance-based-approach">&lt;strong>local / fiducial&lt;/strong>&lt;/a> (process facial features, such as eyes, mouth, etc. seperately)&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-08%2010.28.36.png" alt="截屏2021-02-08 10.28.36">&lt;/p>
&lt;p>Processing steps: align faces with facial landmarks&lt;/p>
&lt;ul>
&lt;li>Use manually labeled or automatically detected eye centers&lt;/li>
&lt;li>Normalize face images to a common coordination, remove translation,, rotation and scaling factors&lt;/li>
&lt;li>Crop off unnecessary background&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-08%2010.30.38.png" alt="截屏2021-02-08 10.30.38">&lt;/p>
&lt;h2 id="holistic-appearance-based-approaches">Holistic appearance-based approaches&lt;/h2>
&lt;h3 id="eigenfaces">Eigenfaces&lt;/h3>
&lt;h4 id="-idea">💡 Idea&lt;/h4>
&lt;p>A face image defines a point in the high dimensional image space.&lt;/p>
&lt;p>Different face images share a number of similarities with each other&lt;/p>
&lt;ul>
&lt;li>They can be described by a relatively low dimensional subspace&lt;/li>
&lt;li>Project the face images into an appropriately chosen subspace and perform classification by similarity computation (distance, angle)
&lt;ul>
&lt;li>Dimensionality reduction procedure used here is called &lt;mark>&lt;strong>Karhunen-Loéve transformation&lt;/strong>&lt;/mark> or &lt;mark>&lt;strong>principal component analysis (PCA)&lt;/strong>&lt;/mark>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="objective">Objective&lt;/h4>
&lt;p>Find the vectors that best account for the distribution of face images within the entire image space&lt;/p>
&lt;h4 id="pca">PCA&lt;/h4>
&lt;blockquote>
&lt;p>For more details see: &lt;a href="https://haobin-tan.netlify.app/docs/ai/machine-learning/unsupervised/pca/">Principle Component Analysis (PCA)&lt;/a>&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>Find direction vectors so as to minimize the average projection error&lt;/li>
&lt;li>Project on the linear subspace spanned by these vectors&lt;/li>
&lt;li>Use covariance matrix to find these direction vectors&lt;/li>
&lt;li>Project on the largest K direction vectors to reduce dimensionality&lt;/li>
&lt;/ul>
&lt;p>PCA for eigenfaces:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-09%2011.11.40.png" alt="截屏2021-02-09 11.11.40">
&lt;/p>
$$
\begin{array}{l}
Y=\left[y\_{1}, y\_{2}, y\_{3}, \ldots, y\_{K}\right] \\\\
m=\frac{1}{K}\sum y \\\\
C=(Y-m)(Y-m)^{T} \\\\
D=U^{T} C U \\\\
\Omega=U^{\top}(y-m)
\end{array}
$$
&lt;p>
where&lt;/p>
&lt;ul>
&lt;li>
&lt;p>$y$: Face image&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$Y$: Face matrix&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$m$: Mean face&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$C$: Covariance matrix&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$D$: Eigenvalues&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$U$: Eigenvectors&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$\Omega$: Representation coefficients&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="training">Training&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Acquire initial set of face images (training set):
&lt;/p>
$$
Y = [y\_1, y\_2, \dots, y\_K]
$$
&lt;/li>
&lt;li>
&lt;p>Calculate the eigenfaces/eigenvectors from the training set, keeping only the $M$ images/vectors corresponding to the highest eigenvalues
&lt;/p>
$$
U = (u\_1, u\_2, \dots, u\_M)
$$
&lt;/li>
&lt;li>
&lt;p>Calculate representation of each known individual $k$ in face space
&lt;/p>
$$
\Omega\_k = U^T(y\_k - m)
$$
&lt;/li>
&lt;/ul>
&lt;h4 id="testing">Testing&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Project input new image &lt;em>y&lt;/em> into face space
&lt;/p>
$$
\Omega = U^T(y - m)
$$
&lt;/li>
&lt;li>
&lt;p>Find most likely candidate class $k$ by distance computation
&lt;/p>
$$
\epsilon\_k = \\|\Omega - \Omega\_k\\| \quad \text{for all } \Omega\_k
$$
&lt;/li>
&lt;/ul>
&lt;h4 id="projections-onto-the-face-space">&lt;strong>Projections onto the face space&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Principal components are called “&lt;strong>eigenfaces&lt;/strong>” and they span the “face space”.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Images can be reconstructed by their projections in face space:&lt;/p>
&lt;/li>
&lt;/ul>
$$
Y\_f = \sum\_{i=1}^{M} \omega\_i u\_i
$$
&lt;p>​ Appearance of faces in face-space does not change a lot&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Difference of mean-adjusted image $(Y-m)$ and projection $Y\_f$ gives a measure of &lt;em>„faceness“&lt;/em>&lt;/p>
&lt;ul>
&lt;li>Distance from face space can be used to detect faces&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Different cases of projections onto face space&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-09%2011.38.26.png" alt="截屏2021-02-09 11.38.26" style="zoom:80%;" />
&lt;ul>
&lt;li>
&lt;p>Case 1: Projection of a &lt;em>known&lt;/em> individual&lt;/p>
&lt;p>$\rightarrow$ Near face space ($\epsilon &lt; \theta\_{\delta}$) and near known face $\Omega\_k$ ($\epsilon\_k &lt; \theta\_{\epsilon}$)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Case 2: Projection of an &lt;em>unkown&lt;/em> individual&lt;/p>
&lt;p>$\rightarrow$ Near face space, far from reference vectors&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Case 3 and 4: not a face (far from face space)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="pca-for-face-matching-and-recognition">PCA for face matching and recognition&lt;/h4>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-09%2011.53.26.png" alt="截屏2021-02-09 11.53.26">&lt;/p>
&lt;ul>
&lt;li>Projects all faces onto a &lt;strong>universal&lt;/strong> eigenspace to “encode” via principal components&lt;/li>
&lt;li>Uses inverse-distance as a similarity measure $S(p,g)$ for matching &amp;amp; recognition&lt;/li>
&lt;/ul>
&lt;h4 id="problems-and-shortcomings">Problems and shortcomings&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Eigenfaces do NOT distinguish between shape and appearance&lt;/p>
&lt;/li>
&lt;li>
&lt;p>PCA does NOT use class information&lt;/p>
&lt;ul>
&lt;li>PCA projections are optimal for reconstruction from a low dimensional basis, they may not be optimal from a discrimination standpoint&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-09%2011.58.01.png" alt="截屏2021-02-09 11.58.01" style="zoom:67%;" />
&lt;/li>
&lt;/ul>
&lt;h3 id="fisherface">Fisherface&lt;/h3>
&lt;h4 id="linear-discriminant-analysis-lda">Linear Discriminant Analysis (LDA)&lt;/h4>
&lt;blockquote>
&lt;p>For more details about LDA, see: &lt;a href="https://haobin-tan.netlify.app/docs/ai/machine-learning/non-parametric/lda-summary/">LDA Summary&lt;/a>)&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>A.k.a. &lt;strong>Fischer‘s Linear Discriminant&lt;/strong>&lt;/li>
&lt;li>Preserves separability of classes&lt;/li>
&lt;li>Maximizes ratio of projected between-classes to projected within-class scatter&lt;/li>
&lt;/ul>
$$
W\_{\mathrm{fld}}=\arg \underset{W}{\max } \frac{\left|W^{T} S\_{B} W\right|}{\left|W^{T} S\_{W} W\right|}
$$
&lt;p>Where&lt;/p>
&lt;ul>
&lt;li>$S\_{B}=\sum\_{i=1}^{c}\left|x\_{i}\right|\left(\mu\_{i}-\mu\right)\left(\mu\_{i}-\mu\right)^{T}$: Between-class scatter
&lt;ul>
&lt;li>$c$: Number of classes&lt;/li>
&lt;li>$\mu\_i$: mean of class $X\_i$&lt;/li>
&lt;li>$|X\_i|$: number of samples of $X\_i$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>$S\_{W}=\sum\_{i=1}^{c} \sum\_{x\_{k} \in X\_{i}}\left(x\_{k}-\mu\_{i}\right)\left(x\_{k}-\mu\_{i}\right)^{T}$: Within-class scatter&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>LDA vs. PCA&lt;/strong>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-09%2012.25.35.png" alt="截屏2021-02-09 12.25.35" style="zoom:67%;" />
&lt;h4 id="lda-for-fisherfaces">LDA for Fisherfaces&lt;/h4>
&lt;p>Fisher’s Linear Discriminant&lt;/p>
&lt;ul>
&lt;li>projects away the within-class variation (lighting, expressions) found in training set&lt;/li>
&lt;li>preserves the separability of the classes.&lt;/li>
&lt;/ul>
&lt;h2 id="local-appearance-based-approach">Local appearance-based approach&lt;/h2>
&lt;p>Local vs Holistic approaches:&lt;/p>
&lt;ul>
&lt;li>Local variations on the facial appearance (different expression,occlsion, lighting)
&lt;ul>
&lt;li>lead to modifications on the entire representation in the holistic approaches&lt;/li>
&lt;li>while in local approaches ONLY the corresponding local region is effected&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Face images contain different statistical illumination (high frequency at the edges and low frequency at smooth regions). It&amp;rsquo;s easier to represent the varying statistics linearly by using local representation.&lt;/li>
&lt;li>Local approaches facilitate the weighting of each local region in terms of their effect on face recognition.&lt;/li>
&lt;/ul>
&lt;h3 id="modular-eigen-spaces">Modular Eigen Spaces&lt;/h3>
&lt;p>Classification using fiducial regions instead of using entire face &lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-09%2012.59.14.png" alt="截屏2021-02-09 12.59.14">&lt;/p>
&lt;h3 id="local-pca-modular-pca">Local PCA (Modular PCA)&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Face images are divided into $N$ smaller sub-images&lt;/p>
&lt;/li>
&lt;li>
&lt;p>PCA is applied on each of these sub-images&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-09%2013.01.08-20210723110742829.png" alt="截屏2021-02-09 13.01.08">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Performed &lt;strong>better&lt;/strong> than global PCA on large variations of illumination and expression&lt;/p>
&lt;/li>
&lt;li>
&lt;p>No imporvements under variation of pose&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="local-feature-based">Local Feature based&lt;/h3>
&lt;p>🎯 Objective: To mitigate the effect of expression, illumination, and occlusion variations by performing local analysis and by fusing the outputs of extracted local features at the feature or at the decision level.&lt;/p>
&lt;h4 id="gabor-filters">Gabor Filters&lt;/h4>
&lt;h4 id="elastic-bunch-graphs-ebg">Elastic Bunch Graphs (EBG)&lt;/h4>
&lt;h4 id="local-binary-pattern-lbp-histogram">Local Binary Pattern (LBP) Histogram&lt;/h4>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>&lt;a href="http://cbcl.mit.edu/people/poggio/journals/brunelli-poggio-IEEE-PAMI-1993.pdf">http://cbcl.mit.edu/people/poggio/journals/brunelli-poggio-IEEE-PAMI-1993.pdf&lt;/a>&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>Pentland, Moghaddam and Starner, &amp;ldquo;View-based and modular eigenspaces for face recognition,&amp;rdquo; &lt;em>1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition&lt;/em>, Seattle, WA, USA, 1994, pp. 84-91, doi: 10.1109/CVPR.1994.323814.&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Face Recognition: Features</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/06-face_recognition-features/</link><pubDate>Mon, 15 Feb 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/06-face_recognition-features/</guid><description>&lt;h2 id="local-appearance-based-face-recognition">Local Appearance-based Face Recognition&lt;/h2>
&lt;p>🎯 Objective: To mitigate the effect of expression, illumination, and occlusion variations by performing local analysis and by fusing the outputs of extracted local features at the feature or at the decision level.&lt;/p>
&lt;p>Some popular facial descriptions achieving good results&lt;/p>
&lt;ul>
&lt;li>Local binary Pattern Histogram (LBPH)&lt;/li>
&lt;li>Gabor Feature&lt;/li>
&lt;li>Discrete Cosine Transform (DCT)&lt;/li>
&lt;li>SIFT&lt;/li>
&lt;li>etc.&lt;/li>
&lt;/ul>
&lt;h3 id="local-binary-pattern-histogram-lbph1">Local binary Pattern Histogram (LBPH)&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-16%2011.10.39.png" alt="截屏2021-02-16 11.10.39" style="zoom:80%;" />
&lt;ul>
&lt;li>
&lt;p>Divide image into cells&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Compare each pixel to each of its neighbors&lt;/p>
&lt;ul>
&lt;li>Where the pixel&amp;rsquo;s value is greater than the threshold value (e.g., center pixel in this example), write &amp;ldquo;1&amp;rdquo;&lt;/li>
&lt;li>Otherwise, write &amp;ldquo;0&amp;rdquo;&lt;/li>
&lt;/ul>
&lt;p>$\rightarrow$ gives a binary number&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Convert binary into decimal&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Compute the histogram, over the cell&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Use the histogram for classification&lt;/p>
&lt;ul>
&lt;li>SVM&lt;/li>
&lt;li>Histogram-distances&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>Tutorials and explanation:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://towardsdatascience.com/face-recognition-how-lbph-works-90ec258c3d6b">Face Recognition: Understanding LBPH Algorithm&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://www.youtube.com/watch?v=h-z9-bMtd7w">how is the LBP |Local Binary Pattern| values calculated? ~ xRay Pixy&lt;/a>&lt;/li>
&lt;/ul>
&lt;/blockquote>
&lt;h3 id="high-dim-dense-local-feature-extraction">&lt;strong>High dim. dense local Feature Extraction&lt;/strong>&lt;/h3>
&lt;ul>
&lt;li>Computing features densely (e.g. on overlapping patches in many scales in the image)&lt;/li>
&lt;li>Problem: very very high dimensionality!!!&lt;/li>
&lt;li>Solution: Encode into a compact form
&lt;ul>
&lt;li>Bag of Visual Word (BoVW) model&lt;/li>
&lt;li>Fisher encoding&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="fisher-vector-encoding">Fisher Vector Encoding&lt;/h4>
&lt;ul>
&lt;li>Aggregates feature vectors into a compact representation&lt;/li>
&lt;li>Fitting a parametric generative model (e.g. Gaussian Mixture Model)&lt;/li>
&lt;li>Encode derivative of the likelihood of model w.r.t its parameters&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-16%2011.38.19.png" alt="截屏2021-02-16 11.38.19">&lt;/p>
&lt;h2 id="face-recognition-across-pose-alignment">Face recognition across pose (Alignment)&lt;/h2>
&lt;p>Problem&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Different view-point / head orientation&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-16%2011.40.44.png" alt="截屏2021-02-16 11.40.44" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>Recoginition results degrade, when images of different head orientation have to be matched 😭&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Major directions to address the face recognition across pose Probelm&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Geometric pose normalization (image affine warps)&lt;/li>
&lt;li>2D specific pose models, image rendering at pixel or feature level (2D+3D approaches)&lt;/li>
&lt;li>3D face Model fitting&lt;/li>
&lt;/ul>
&lt;h3 id="pose-normalization">Pose Normalization&lt;/h3>
&lt;h4 id="-idea">💡 &lt;strong>Idea&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>Find several facial features (mesh)&lt;/li>
&lt;li>Use complete mesh to normalize face&lt;/li>
&lt;/ul>
&lt;p>Here we will use &lt;strong>2D Active Appearance Models&lt;/strong>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-16%2011.51.52.png" alt="截屏2021-02-16 11.51.52" style="zoom:80%;" />
&lt;ul>
&lt;li>
&lt;p>A texture and shape-based parametric model&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Efficient fitting algorithm: &lt;strong>Inverse compositional (IC)&lt;/strong> algorithm&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="model-and-fitting">Model and fitting&lt;/h4>
&lt;p>Independent shape and appearance model
&lt;/p>
$$
\begin{array}{c}
\text{shape:} \quad s=\left(x\_{1}, y\_{1}, x\_{2}, y\_{2}, \cdots, x\_{v}, y\_{v}\right)^{T}=s\_{0}+\sum\_{i=1}^{n} p\_{i} s\_{i} \\\\
\text{appearance:} \quad A(x)=A\_{0}(x)+\sum\_{i=1}^{m} \lambda\_{i} A\_{i}(x) \quad \forall x \in s\_{0}
\end{array}
$$
&lt;p>
Fitting goal:
&lt;/p>
$$
\arg \min \_{p, \lambda} \sum\_{x \in s\_{0}}\left[A\_{0}(x)+\sum\_{i=1}^{m} \lambda_{i} A\_{i}(x)-I(W(x ; p))\right]^{2}
$$
&lt;p>
Fitting examples&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Fitted mesh&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-16%2012.02.54.png" alt="截屏2021-02-16 12.02.54">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Mismatched mesh&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-16%2012.03.27.png" alt="截屏2021-02-16 12.03.27">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Fitted modal can be used to warp image to frontal pose (e.g. using piecewise affine transformation of mesh triangles)&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-16%2012.13.08.png">&lt;figcaption>
&lt;h4>Faces with different poses from FERET data base and their pose- aligned images&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;h4 id="results">Results&lt;/h4>
&lt;ul>
&lt;li>Much better results under pose variations compared to simple affine transform&lt;/li>
&lt;li>Different warping functions can be used
&lt;ul>
&lt;li>Piecewise affine transformation worked best&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Approach works well with local-DCT-based approach
&lt;ul>
&lt;li>but not so well with holistic approaches, such as Eigenfaces (PCA) 🤪&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="face-recogntion-using-3d-models2">Face Recogntion using 3D Models&lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>&lt;/h2>
&lt;ul>
&lt;li>A method for face recognition across variations in pose and illumination.&lt;/li>
&lt;li>Simulates the process of image formation in 3D space.&lt;/li>
&lt;li>Estimates 3D shape and texture of faces from single images by fitting a statistical morphable model of 3D faces to images.&lt;/li>
&lt;li>Faces are represented by model parameters for 3D shape and texture.&lt;/li>
&lt;/ul>
&lt;h4 id="model-based-recognition">Model-based Recognition&lt;/h4>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-16%2012.19.23.png" alt="截屏2021-02-16 12.19.23" style="zoom:67%;" />
&lt;h4 id="face-vectors">Face vectors&lt;/h4>
&lt;p>The morphable face model is based on a vector space representation of faces that is constructed such that &lt;strong>any combination of shape and texture vectors $S\_i$ and $T\_i$ describes a realistic human face&lt;/strong>:
&lt;/p>
$$
S=\sum_{i=1}^{m} a_{i} S_{i} \quad T=\sum_{i=1}^{m} b_{i} T_{i}
$$
&lt;p>
The definition of shape and texture vectors is based on a reference face $\mathbf{I}\_0$.&lt;/p>
&lt;p>The location of the vertices of the mesh in Cartesian coordinates is $(x\_k, y\_k, z\_k)$ with colors $(R\_k, G\_k, B\_k)$&lt;/p>
&lt;p>Reference shape and texture vectors are defined by:
&lt;/p>
$$
\begin{array}{l}
S\_{0}=\left(x\_{1}, y\_{1}, z\_{1}, x\_{2}, \ldots, x\_{n}, y\_{n}, z\_{n}\right)^{T} \\\\
T\_{0}=\left(R\_{1}, G\_{1}, B\_{1}, R\_{2}, \ldots, R\_{n}, G\_{n}, B\_{n}\right)^{T}
\end{array}
$$
&lt;p>
To encode a novel scan $\mathbf{I}$, the flow field from $\mathbf{I}\_0$ to $\mathbf{I}$ is computed.&lt;/p>
&lt;h4 id="pca">PCA&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>PCA is performed on the set of shape and texture vectors separately.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Eigenvectors form an orthogonal basis:
&lt;/p>
$$
\mathbf{S}=\overline{\mathbf{s}}+\sum\_{i=1}^{m-1} \alpha\_{i} \cdot \mathbf{s}\_{i}, \quad \mathbf{T}=\overline{\mathbf{t}}+\sum\_{i=1}^{m-1} \beta\_{i} \cdot \mathbf{t}\_{i}
$$
&lt;/li>
&lt;li>
&lt;p>Example&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-16%2020.36.08.png" alt="截屏2021-02-16 20.36.08" style="zoom:67%;" />
&lt;/li>
&lt;/ul>
&lt;h3 id="model-based-image-analysis">Model-based Image Analysis&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>🎯 Goal: find shape and texture coefficients describing a 3D face model such that rendering produces an image $\mathbf{I}\_{\text{model}}$ that is as similar as possible to $\mathbf{I}\_{\text{input}}$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>For initialization 7 facial feature points, such as the corners of the eyes or tip of the nose, should be labelled manually&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-16%2020.38.43.png" alt="截屏2021-02-16 20.38.43">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Model fitting: Minimize
&lt;/p>
$$
E\_{I}=\sum\_{x, y}\left\\|\mathbf{I}\_{\text {input }}(x, y)-\mathbf{I}\_{\text {model }}(x, y)\right\\|^{2}
$$
&lt;ul>
&lt;li>Shape, texture, transformation, and illumination are optimized for the entire face and refined for each segment.&lt;/li>
&lt;li>Complex iterative optimization procedure&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="databases">Databases&lt;/h2>
&lt;ul>
&lt;li>Necessary to develop and improve algorithms&lt;/li>
&lt;li>Provide common testbeds and benchmarks which allow for comparing different approaches&lt;/li>
&lt;li>Different databases focus on different problems&lt;/li>
&lt;/ul>
&lt;p>Well-known databases for face recognition&lt;/p>
&lt;ul>
&lt;li>FERET&lt;/li>
&lt;li>FRVT&lt;/li>
&lt;li>FRGC&lt;/li>
&lt;li>CMU-PIE&lt;/li>
&lt;li>BANCA&lt;/li>
&lt;li>XM2VTS&lt;/li>
&lt;li>&amp;hellip;&lt;/li>
&lt;/ul>
&lt;h3 id="observations">Observations&lt;/h3>
&lt;ul>
&lt;li>One 3-D image is &lt;em>more powerful&lt;/em> for face recognition than one 2- D image.&lt;/li>
&lt;li>One high resolution 2-D image is &lt;em>more powerful&lt;/em> for face recognition than one 3-D image.&lt;/li>
&lt;li>Using 4 or 5 well-chosen 2-D face images is &lt;em>more powerful&lt;/em> for face recognition than one 3-D face image or multi-modal 3D+2D face.&lt;/li>
&lt;/ul>
&lt;h4 id="wild-face-datasets">Wild Face Datasets&lt;/h4>
&lt;h4 id="labeled-faces-in-the-wild-dataset-lfw">&lt;strong>Labeled Faces In the Wild Dataset (LFW)&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Face Verification: Given a pair of images specify whether they belong to the same person&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-16%2020.44.55.png" alt="截屏2021-02-16 20.44.55" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>13K images, 5.7K people&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Standard benchmark in the community&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Several test protocols depending upon availability of training data within and outside the dataset.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="youtube-faces-dataset-ytf">&lt;strong>YouTube Faces Dataset (YTF)&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Video Face Verification: Given a pair of videos specify whether they belong to the same person&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-16%2020.46.03.png" alt="截屏2021-02-16 20.46.03" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>3425 videos, 1595 people&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Standard benchmark in the community&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Wide pose, expression and illumination variation&lt;/p>
&lt;/li>
&lt;/ul>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>T. Ahonen, A. Hadid and M. Pietikainen, &amp;ldquo;Face Description with Local Binary Patterns: Application to Face Recognition,&amp;rdquo; in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 12, pp. 2037-2041, Dec. 2006, doi: 10.1109/TPAMI.2006.244.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>V. Blanz and T. Vetter, &amp;ldquo;Face recognition based on fitting a 3D morphable model,&amp;rdquo; in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 9, pp. 1063-1074, Sept. 2003, doi: 10.1109/TPAMI.2003.1227983.&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Face Recognition: Deep Learning</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/07-face_recognition-deep_learning/</link><pubDate>Mon, 15 Feb 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/07-face_recognition-deep_learning/</guid><description>&lt;h2 id="deepface-1">DeepFace &lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/h2>
&lt;h3 id="main-idea">Main idea&lt;/h3>
&lt;p>Learn a deep (7 layers) NN (20 million parameters) on 4 million identity labeled face images directly on RGB pixels.&lt;/p>
&lt;h3 id="alignment">Alignment&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Use 6 fiducial points for 2D warp&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Then 67 points for 3D model&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Frontalize the face for input to NN&lt;/p>
&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-16%2020.51.01.png" alt="截屏2021-02-16 20.51.01" style="zoom:67%;" />
&lt;h3 id="representation">Representation&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Output is fed in $k$-way softmax, that generates probability distribution over class labels.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-16%2020.52.06.png" alt="截屏2021-02-16 20.52.06">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>🎯 Goal of training: &lt;strong>maximize the probability of the correct class&lt;/strong>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="facenet2">FaceNet&lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>&lt;/h2>
&lt;h4 id="idea">💡Idea&lt;/h4>
&lt;ul>
&lt;li>Map images to a compact Euclidean space, where &lt;strong>distances correspond to face similarity&lt;/strong>&lt;/li>
&lt;li>Find $f(x)\ \in \mathbb{R}^d$ for image $x$, so that
&lt;ul>
&lt;li>$d^2(f(x\_1), f(x\_2)) \rightarrow \text{small}$, if $x\_1, x\_2 \in \text{same identity}$&lt;/li>
&lt;li>$d^2(f(x\_1), f(x\_3)) \rightarrow \text{large}$, if $x\_1, x\_2 \in \text{different identities}$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="system-architecture">System architecture&lt;/h3>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-16%2021.04.45.png" alt="截屏2021-02-16 21.04.45">&lt;/p>
&lt;ul>
&lt;li>CNN: optimized embedding&lt;/li>
&lt;li>Triplet-based loss function: training&lt;/li>
&lt;/ul>
&lt;h3 id="triplet-loss">Triplet loss&lt;/h3>
&lt;p>Image triplets:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-16%2021.06.14.png" alt="截屏2021-02-16 21.06.14" style="zoom:67%;" />
$$
\begin{array}{c}
\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{p}\right)\right\\|\_{2}^{2}+\alpha&lt;\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{n}\right)\right\\|_{2}^{2} \\\\
\forall\left(f\left(x\_{i}^{a}\right), f\left(x\_{i}^{p}\right), f\left(x\_{i}^{n}\right)\right) \in \mathcal{T}
\end{array}
$$
where
&lt;ul>
&lt;li>
&lt;p>$x\_i^a$: Anchor image&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$x\_i^p$: Positive image&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$x\_i^n$: Negative image&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$\mathcal{T}$: Set of all possible triplets in the training set&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$\alpha$: Margin between positive and negative pairs&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Total Loss function to be minimized:
&lt;/p>
$$
L=\sum\_{i}^{N}\left[\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{p}\right)\right\\|\_{2}^{2}-\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{n}\right)\right\\|\_{2}^{2}+\alpha\right]
$$
&lt;h3 id="triplet-selection">Triplet selection&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Online Generation&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Select only the &lt;strong>semi-hard negatives&lt;/strong> and using all anchor-positive pairs of mini-batch&lt;/p>
&lt;p>$\rightarrow$ Select $x\_i^n$ such that
&lt;/p>
$$
\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{p}\right)\right\\|\_{2}^{2}&lt;\left\\|f\left(x\_{i}^{a}\right)-f\left(x\_{i}^{n}\right)\right\\|\_{2}^{2}
$$
&lt;/li>
&lt;/ul>
&lt;h3 id="results">Results&lt;/h3>
&lt;ul>
&lt;li>LFW: 99.63% $\pm$ 0.09&lt;/li>
&lt;li>Youtube Faces DB: 95.12% $\pm$ 0.39&lt;/li>
&lt;/ul>
&lt;h2 id="deep-face-recognition-3">Deep Face Recognition &lt;sup id="fnref:3">&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref">3&lt;/a>&lt;/sup>&lt;/h2>
&lt;p>&lt;strong>Key Questions&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Can large scale datasets be built with minimal human intervention? &lt;a href="#dataset-collection">Yes&lt;/a>!&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Can we propose a convolutional neural network which can compete with that of internet giants like Google and Facebook? &lt;a href="#convolutional-neural-network">Yes&lt;/a>!&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="dataset-collection">Dataset Collection&lt;/h3>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Candidate list generation&lt;/strong>: &lt;strong>Finding names of celebrities&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Tap the knowledge on the web&lt;/li>
&lt;li>5000 identities&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Manual verification of celebrities: Finding Popular Celebrities&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Collect representative images for each celebrity&lt;/p>
&lt;/li>
&lt;li>
&lt;p>200 images/identity&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Remove people with low representation on Google.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Remove overlap with public benchmarks&lt;/p>
&lt;/li>
&lt;li>
&lt;p>2622 celebrities for the final dataset&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Rank image sets&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>2000 images per identity&lt;/li>
&lt;li>Searching by appending keyword “actor”&lt;/li>
&lt;li>Learning classifier using data obtained the previous step.&lt;/li>
&lt;li>Ranking 2000 images and selecting top 1000 images&lt;/li>
&lt;li>Approx. 2.6 Million images of 2622 celebrities&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Near duplicate removal&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>VLAD descriptor based near duplicate removal&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Manual filtering&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Curating the dataset further using manual checks&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;h3 id="convolutional-neural-network">Convolutional Neural Network&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>The “Very Deep” Architecture&lt;/p>
&lt;ul>
&lt;li>
&lt;p>3 x 3 Convolution Kernels (Very small)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Conv. Stride 1 px.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Relu non-linearity&lt;/p>
&lt;/li>
&lt;li>
&lt;p>No local contrast normalisation&lt;/p>
&lt;/li>
&lt;li>
&lt;p>3 Fully connected layers&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Training&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Random Gaussian Initialization&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Stochastic Gradient Descent with back prop.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Batch Size: 256&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Incremental FC layer training&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Learning Task Specific Embedding&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Learning embedding by minimizing triplet loss
&lt;/p>
$$
\sum\_{(a, p, n) \in T} \max \left\\{0, \alpha-\left\\|\mathbf{x}\_{a}-\mathbf{x}\_{n}\right\\|\_{2}^{2}+\left\\|\mathbf{x}\_{a}-\mathbf{x}\_{p}\right\\|\_{2}^{2}\right\\}
$$
&lt;/li>
&lt;li>
&lt;p>Learning a projection from 4096 to 1024 dimensions&lt;/p>
&lt;/li>
&lt;li>
&lt;p>On line triplet formation at the beginning of each iteration&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Fine tuned on target datasets&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Only the projection layers learnt&lt;/strong>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>Y. Taigman, M. Yang, M. Ranzato and L. Wolf, &amp;ldquo;DeepFace: Closing the Gap to Human-Level Performance in Face Verification,&amp;rdquo; 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 1701-1708, doi: 10.1109/CVPR.2014.220.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>Schroff, Florian &amp;amp; Kalenichenko, Dmitry &amp;amp; Philbin, James. (2015). FaceNet: A unified embedding for face recognition and clustering. 815-823. 10.1109/CVPR.2015.7298682.&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:3">
&lt;p>Omkar M. Parkhi, Andrea Vedaldi and Andrew Zisserman. Deep Face Recognition. In Xianghua Xie, Mark W. Jones, and Gary K. L. Tam, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 41.1-41.12. BMVA Press, September 2015.&amp;#160;&lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Facial Feature Detection</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/08-facial-features-detection/</link><pubDate>Thu, 18 Feb 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/08-facial-features-detection/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;h3 id="what-are-facial-features">What are facial features?&lt;/h3>
&lt;p>Facial features are referred to as &lt;strong>salient parts of a face region which carry meaningful information&lt;/strong>.&lt;/p>
&lt;ul>
&lt;li>E.g. eye, eyeblow, nose, mouth&lt;/li>
&lt;li>A.k.a &lt;mark>&lt;strong>facial landmarks&lt;/strong>&lt;/mark>&lt;/li>
&lt;/ul>
&lt;h3 id="what-is-facial-feature-detection">What is facial feature detection?&lt;/h3>
&lt;p>Facial feature detection is defined as methods of &lt;strong>locating the specific areas of a face&lt;/strong>.&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-18%2023.13.42.png" alt="截屏2021-02-18 23.13.42" style="zoom:80%;" />
&lt;h3 id="applications-of-facial-feature-detection">Applications of facial feature detection&lt;/h3>
&lt;ul>
&lt;li>Face recognition&lt;/li>
&lt;li>Model-based head pose estimation&lt;/li>
&lt;li>Eye gaze tracking&lt;/li>
&lt;li>Facial expression recognition&lt;/li>
&lt;li>Age modeling&lt;/li>
&lt;/ul>
&lt;h3 id="problems-in-facial-feature-detection">Problems in facial feature detection&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Identity variations&lt;/strong>&lt;/p>
&lt;p>Each person has unique facial part&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Expression variations&lt;/strong>&lt;/p>
&lt;p>Some facial features change their state (e.g. eye blinks).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Head rotations&lt;/strong>&lt;/p>
&lt;p>If a head orientation changes, the visual appearance also changes.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Scale variations&lt;/strong>&lt;/p>
&lt;p>Changes in resolution and distance to the camera affect appearance.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Lighting conditions&lt;/strong>&lt;/p>
&lt;p>Light has non-linear effects on the pixel values of a image.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Occlusions&lt;/strong>&lt;/p>
&lt;p>Hair or glasses might hide facial features.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="older-approaches-from-face-detection">Older approaches (from face detection)&lt;/h2>
&lt;ul>
&lt;li>Integral projections + geometric constraints&lt;/li>
&lt;li>Haar-Filter Cascades&lt;/li>
&lt;li>PCA-based methods (Modular Eigenspace)&lt;/li>
&lt;li>Morphable 3D Model&lt;/li>
&lt;/ul>
&lt;h2 id="statistical-appearance-models">Statistical appearance models&lt;/h2>
&lt;ul>
&lt;li>💡 Idea: make use of prior-knowledge, i.e. models, to reduce the complexity of the task&lt;/li>
&lt;li>Needs to be able to deal with variability $\rightarrow$ &lt;strong>deformable models&lt;/strong>&lt;/li>
&lt;li>Use statistical models of shape and texture to find facial landmark points&lt;/li>
&lt;li>Good models should
&lt;ul>
&lt;li>Capture the various characteristics of the object to be detected&lt;/li>
&lt;li>Be a compact representation in order to avoid heavy calculation&lt;/li>
&lt;li>Be robust against noise&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="basic-idea">Basic idea&lt;/h3>
&lt;ol>
&lt;li>&lt;strong>Training&lt;/strong> stage: construction of models&lt;/li>
&lt;li>&lt;strong>Test&lt;/strong> stage: Search the region of interest (ROI)&lt;/li>
&lt;/ol>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2011.30.55.png" alt="截屏2021-02-19 11.30.55" style="zoom:80%;" />
&lt;h3 id="appearance-models">Appearance models&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Represent both &lt;strong>texture&lt;/strong> and &lt;strong>shape&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Statistical model learned from training data&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Modeling shape variability&lt;/p>
&lt;ul>
&lt;li>Landmark points&lt;/li>
&lt;/ul>
$$
x=\left[x\_{1}, y\_{1}, x\_{2}, y\_{2}, \ldots, x\_{n}, y\_{n}\right]^{T}
$$
&lt;ul>
&lt;li>
&lt;p>Model
&lt;/p>
$$
x \approx \bar{x}+P\_{s} b\_{s}
$$
&lt;ul>
&lt;li>$\bar{x}$: Mean vector&lt;/li>
&lt;li>$P\_s$: Eigenvectors of covariance matrix&lt;/li>
&lt;li>$b\_s = P\_s^T(x - \bar{x})$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Modeling intensity variability:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Gray values
&lt;/p>
$$
h=\left[g\_{1}, g\_{2}, \ldots, g\_{k}\right]^{T}
$$
&lt;/li>
&lt;li>
&lt;p>Model
&lt;/p>
$$
h \approx \bar{h} + P\_ib\_i
$$
&lt;ul>
&lt;li>$\bar{h}$: Mean vector&lt;/li>
&lt;li>$P\_s$: Eigenvectors of covariance matrix&lt;/li>
&lt;li>$b\_i = P\_i^T(h - \bar{h})$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="training-of-appearance-models">Training of appearance models&lt;/h3>
&lt;h4 id="1-construct-a-shape-model-with-principal-component-analysis-pca">1. Construct a shape model with Principal component analysis (PCA)&lt;/h4>
&lt;p>A shape is represented with manually labeled points.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-19%2011.40.18.png" alt="截屏2021-02-19 11.40.18">&lt;/p>
&lt;p>The shape model approximates the shape of an object.&lt;/p>
&lt;h5 id="procrustes-analysis">&lt;strong>Procrustes Analysis&lt;/strong>&lt;/h5>
&lt;p>Align the shapes all together to remove translation, rotation and scaling&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-19%2011.51.46.png" alt="截屏2021-02-19 11.51.46">&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-19%2011.52.05.png" alt="截屏2021-02-19 11.52.05">&lt;/p>
&lt;p>&lt;strong>PCA&lt;/strong>&lt;/p>
&lt;p>The positions of labeled points are
&lt;/p>
$$
x = \bar{x}+P\_{s} b\_{s}
$$
&lt;ul>
&lt;li>$\bar{x}$: Mean shape&lt;/li>
&lt;li>$P\_s$: Orthogonal modes of variation obtained by PCA&lt;/li>
&lt;li>$b\_s$: Shape parameters in the projected space&lt;/li>
&lt;/ul>
&lt;p>The shapes are represented with fewer parameters ($\operatorname{Dim}(x) > \operatorname{Dim}(b\_s)$)&lt;/p>
&lt;p>Generating plausible shapes:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2012.38.11.png" alt="截屏2021-02-19 12.38.11" style="zoom:80%;" />
&lt;h4 id="2-construct-a-texture-model-which-represents-grey-scale-or-color-values-at-each-point">2. Construct a texture model which represents grey-scale (or color) values at each point&lt;/h4>
&lt;p>Warp the image so that the labeled points fit on the mean shape&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2012.39.55.png" alt="截屏2021-02-19 12.39.55" style="zoom:80%;" />
&lt;p>Then normalize the intensity on the &lt;em>shape-free&lt;/em> patch.&lt;/p>
&lt;h5 id="texture-warping">Texture warping&lt;/h5>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2012.58.42.png" alt="截屏2021-02-19 12.58.42" />
&lt;h4 id="texture-model">Texture model&lt;/h4>
&lt;p>The pixel values on the shape-free patch
&lt;/p>
$$
g = \bar{g} + P\_g b\_g
$$
&lt;ul>
&lt;li>$\bar{g}$ : Mean of normalized pixel values&lt;/li>
&lt;li>$P\_g$ : Orthogonal modes of variation obtained by PCA&lt;/li>
&lt;li>$b\_g$: Texture parameters in the projected space&lt;/li>
&lt;/ul>
&lt;p>The pixel values (appearance) are presented with fewer parameters ($\operatorname{Dim}(g) > \operatorname{Dim}(b\_g)$)&lt;/p>
&lt;h4 id="3-model-the-correlation-between-shapes-and-grey-level-models">3. Model the correlation between shapes and grey-level models&lt;/h4>
&lt;p>The concatenated vector is
&lt;/p>
$$
b=\left(\begin{array}{c}
W\_{s} b\_{s} \\\\
b\_{g}
\end{array}\right)
$$
&lt;p>
Apply PCA:
&lt;/p>
$$
b=P\_{c} c=\left(\begin{array}{l}
P\_{c s} \\\\
P\_{c g}
\end{array}\right)c
$$
&lt;p>
Now the parameter $\mathbf{c}$ can control both shape and grey-level models&lt;/p>
&lt;ul>
&lt;li>
&lt;p>The shape model
&lt;/p>
$$
x=\bar{x}+P\_{s} W\_{s}^{-1} P\_{c s} c
$$
&lt;/li>
&lt;li>
&lt;p>The grey-level model
&lt;/p>
$$
g=\bar{g}+P\_{g} P\_{c g} c
$$
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Examples of synthesized faces&lt;/strong>&lt;/p>
&lt;p>Various objects can be synthesized by controlling the parameter $\mathbf{c}$&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2013.07.53.png" alt="截屏2021-02-19 13.07.53" style="zoom:80%;" />
&lt;h3 id="dataset-for-building-model">Dataset for Building Model&lt;/h3>
&lt;p>IMM data set from Danish Technical University&lt;/p>
&lt;ul>
&lt;li>
&lt;p>240 images with 640*480 size; 40 individuals, with 36 males and 4 females.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Each Subject 6 shots, with different pose, expressions and illuminations.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Each image is labeled with 58 landmarks; 3 closed and 4 opened point-paths.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-19%2013.09.03.png" alt="截屏2021-02-19 13.09.03">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="image-interpretation-with-models">Image Interpretation with Models&lt;/h3>
&lt;ul>
&lt;li>🎯 &lt;strong>Goal: find the set of parameters which best match the model to the image&lt;/strong>
&lt;ul>
&lt;li>Optimize some cost function&lt;/li>
&lt;li>Difficult optimization problem&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Set of parameters
&lt;ul>
&lt;li>Defines shape, position, appearance&lt;/li>
&lt;li>Can be used for further processing
&lt;ul>
&lt;li>
&lt;p>Position of landmarks&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Face recognition&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Facial expression recognition&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Pose estimation&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Problem: Optimizing the model fit
&lt;ul>
&lt;li>&lt;a href="#active-shape-models-asm">Active Shape Models&lt;/a>&lt;/li>
&lt;li>&lt;a href="#active-appearance-models-aam">Active Appearance Models&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="active-shape-models-asm">Active Shape Models (ASM)&lt;/h3>
&lt;p>Given a rough starting position, create an instance of $\mathbf{X}$ of the model using&lt;/p>
&lt;ul>
&lt;li>shape parameters $b$&lt;/li>
&lt;li>translation $T=(X\_t,Y\_t)$&lt;/li>
&lt;li>scale $s$&lt;/li>
&lt;li>rotation $\theta$&lt;/li>
&lt;/ul>
&lt;p>Iterative approach:&lt;/p>
&lt;ol>
&lt;li>Examine region of the image around $\mathbf{X}\_i$ to find the best nearby match for the point $\mathbf{X}\_i^\prime$&lt;/li>
&lt;li>Update parameters $(b, T, s, \theta)$ to best fit the new points $\mathbf{X}$ (constrain the model parameters to be within three standard deviations)&lt;/li>
&lt;li>Repeat until convergence&lt;/li>
&lt;/ol>
&lt;p>In practice: &lt;strong>search along profile normals&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>The optimal parameters are searched from &lt;strong>multi-resolution&lt;/strong> images hierarchically (faster algorithm)&lt;/p>
&lt;ol>
&lt;li>Search for the object in a coarse image&lt;/li>
&lt;li>Refine the location in a series of higher resolution images.&lt;/li>
&lt;/ol>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-19%2013.31.48.png" alt="截屏2021-02-19 13.31.48">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Example of search&lt;/strong>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2013.32.21.png" alt="截屏2021-02-19 13.32.21" style="zoom:80%;" />
&lt;h4 id="disadvantages">Disadvantages&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Uses mainly shape constraints for search&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Does not take advantage of texture across the target&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="active-appearance-models-aam">Active Appearance Models (AAM)&lt;/h3>
&lt;ul>
&lt;li>Optimize parameters, so as to minimize the difference of a synthesized image and the target image&lt;/li>
&lt;li>Solved using a gradient-descent approach&lt;/li>
&lt;/ul>
&lt;h4 id="fitting-aams">Fitting AAMs&lt;/h4>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2015.51.19.png" alt="截屏2021-02-19 15.51.19" style="zoom:80%;" />
&lt;p>Learning linear relation matrix $\mathbf{R}$ using multi-variate linear regression&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Generate training set by perturbing model parameters for training images&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Include small displacements in position, scale, and orientation&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Record perturbation and image difference&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Experimentally, optimal perturbation around 0.5 standard deviations for each parameter&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="asm-vs-aam">ASM vs. AAM&lt;/h3>
&lt;p>&lt;strong>ASM&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Seeks to match a set of model points to an image, constrained by a statistical model of shape&lt;/li>
&lt;li>Matches model points using an &lt;strong>iterative&lt;/strong> technique (variant of EM-algorithm)&lt;/li>
&lt;li>A search is made around the current position of each point to find a nearby point which best matches texture for the landmark&lt;/li>
&lt;li>Parameters of the shape model are then updated to move the model points closer to the new points in the image&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>AAM&lt;/strong>: matches both position of model points and representation of texture of the object to an image&lt;/p>
&lt;ul>
&lt;li>Uses the difference between current synthesized image and target image to update parameters&lt;/li>
&lt;li>Typically, less landmark points are needed&lt;/li>
&lt;/ul>
&lt;h3 id="summary-of-asm-and-aam">Summary of ASM and AAM&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Statistical appearance models provide a compact representation&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Can model variations such as different identities, facial expression, appearances, etc.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Labeled training images are needed (very time-consuming) 🤪&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Original formulation of ASM and AAM is computationally expensive (i.e. slow) 🤪&lt;/p>
&lt;/li>
&lt;li>
&lt;p>But, efficient extensions and speed-ups exist!&lt;/p>
&lt;ul>
&lt;li>Multi-resolution search&lt;/li>
&lt;li>Constrained AAM search&lt;/li>
&lt;li>Inverse compositional AAMs (CMU)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Usage&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Facial fiducial point detection&lt;/strong>&lt;/li>
&lt;li>Face recognition, pose estimation&lt;/li>
&lt;li>Facial expression analysis&lt;/li>
&lt;li>Audio-visual speech recognition&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="more-modern-approaches-conditional-random-forests-for-real-time-facial-feature-detection1">More Modern Approaches: &lt;strong>Conditional Random Forests&lt;/strong> For Real Time Facial Feature Detection&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/h2>
&lt;h3 id="basics">Basics&lt;/h3>
&lt;h4 id="regression-tree">Regression tree&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Basically like classification decision tree&lt;/p>
&lt;/li>
&lt;li>
&lt;p>In the nodes-decisions are comparison of numbers&lt;/p>
&lt;/li>
&lt;li>
&lt;p>In the leafs-numbers or multidimensional vectors of numbers&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Example&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-19%2016.02.51.png" alt="截屏2021-02-19 16.02.51">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="random-regression-forests">Random regression forests&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Set of random regression trees&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Random&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Different trees trained on random subset of training data&lt;/p>
&lt;/li>
&lt;li>
&lt;p>After training, predictions for unseen samples can be made by averaging the predictions from all the individual regression trees&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-19%2016.03.52.png" alt="截屏2021-02-19 16.03.52">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="basic-idea-1">Basic idea&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Train different set of trees for different head pose.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The leaf nodes accumulates votes for the different facial fiducial points&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-19%2016.21.08.png" alt="截屏2021-02-19 16.21.08">&lt;/p>
&lt;h3 id="regression-forests-training">Regression forests training&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Each Tree is trained from randomly selected set of images.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Extract patches in each image&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Training goal: accumulate probability for a feature point $C\_n$ given a patch $P$ at the leaf node&lt;/p>
&lt;ul>
&lt;li>Each patch is represented by appearance features $I$, and displacement vectors $D$ (offsets) to each of the facial fiducial feature point. I.e. $P = \\{I, D\\}$&lt;/li>
&lt;li>A simple patch comparison is used as Tree-node splitting criterion&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="regression-forests-testing">Regression forests testing&lt;/h3>
&lt;ul>
&lt;li>Given: a random face image&lt;/li>
&lt;li>Extract densely set of patches from the image&lt;/li>
&lt;li>Feed all patches to all trees in the forest&lt;/li>
&lt;li>Get for each patch $P\_i$ a corresponding set of leafs&lt;/li>
&lt;li>A density estimator for the location of ffp&amp;rsquo;s is calculated&lt;/li>
&lt;li>Run meanshift to find all locations&lt;/li>
&lt;/ul>
&lt;h3 id="conditional-regression-forest">Conditional Regression Forest&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Conditional regression tree works alike.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>For &lt;strong>training&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Compute a probability for a concrete head pose&lt;/p>
&lt;/li>
&lt;li>
&lt;p>For each head pose divide the training set in disjoint subsets according to the pose&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Train a regression forest for each subset&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>For &lt;strong>testing&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Estimate the probabilities for each head pose&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Select trees from different regression forests&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Estimate the density function for all facial feature points.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Finalize the exact poition by clustering over all feature candidate votes for a given facial feature point. (e.g., by meanshift)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="experiments-and-results">Experiments and results&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Training set:&lt;/p>
&lt;ul>
&lt;li>13233 face images from LFW Database&lt;/li>
&lt;li>10 annotated facial feature points per face image&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Training&lt;/p>
&lt;ul>
&lt;li>Maximum tree depth = 20&lt;/li>
&lt;li>2500 splitting candidates and 25 thresholds per split&lt;/li>
&lt;li>1500 images to train each tree&lt;/li>
&lt;li>200 patches per image (20 * 20 pixels).&lt;/li>
&lt;li>For head pose two different subsets with 3 and 5 head poses are generated (accuracy 72,5%)&lt;/li>
&lt;li>Required time for face detection and head pose estimation is 33 ms.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Results&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/result.png" alt="result">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="cnn-based-models">CNN based models&lt;/h2>
&lt;p>&lt;strong>Stacked Hourglass Network&lt;/strong> &lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Fully-convolutional neural network&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Repeated down- and upsampling + shortcut connections&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Based on RGB face image, produce one heatmap for each landmark&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Heatmaps are transformed into numerical coordinates using DSNT&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-19%2016.51.43.png" alt="截屏2021-02-19 16.51.43">&lt;/p>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>M. Dantone, J. Gall, G. Fanelli and L. Van Gool, &amp;ldquo;Real-time facial feature detection using conditional regression forests,&amp;rdquo; 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 2012, pp. 2578-2585, doi: 10.1109/CVPR.2012.6247976.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>Newell, A., Yang, K., &amp;amp; Deng, J. (2016). Stacked hourglass networks for human pose estimation. &lt;em>Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)&lt;/em>, &lt;em>9912 LNCS&lt;/em>, 483–499. &lt;a href="https://doi.org/10.1007/978-3-319-46484-8_29">https://doi.org/10.1007/978-3-319-46484-8_29&lt;/a>&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Facial Expression Recognition</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/09-facial-expression-recognition/</link><pubDate>Thu, 18 Feb 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/09-facial-expression-recognition/</guid><description>&lt;h2 id="what-is-facial-expression-analysis">What is facial expression analysis?&lt;/h2>
&lt;h3 id="what-is-facial-expression">What is Facial Expression?&lt;/h3>
&lt;p>Facial expressions are the &lt;strong>facial changes in response to a person‘s internal emotional states, interntions, or social communications.&lt;/strong>&lt;/p>
&lt;h3 id="role-of-facial-expressions">Role of facial expressions&lt;/h3>
&lt;ul>
&lt;li>Almost the &lt;strong>most powerful, natural, and immediate way&lt;/strong> (for human beings) to communicate emotions and intentions&lt;/li>
&lt;li>Face can express emotion &lt;strong>sooner&lt;/strong> than people verbalize or realize feelings&lt;/li>
&lt;li>Faces and facial expressions are an &lt;strong>important aspect&lt;/strong> in interpersonal communication and man-machine interfaces&lt;/li>
&lt;/ul>
&lt;h3 id="facial-expressions">Facial Expressions&lt;/h3>
&lt;ul>
&lt;li>Facial expression(s):
&lt;ul>
&lt;li>
&lt;p>nonverbal communication&lt;/p>
&lt;/li>
&lt;li>
&lt;p>voluntary / involuntary&lt;/p>
&lt;/li>
&lt;li>
&lt;p>results from one or more motions or positions of the muscles of the face&lt;/p>
&lt;/li>
&lt;li>
&lt;p>closely associated with our emotions&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>The fact: Most people&amp;rsquo;s success rate at reading emotions from facial expression is &lt;strong>only a little over 50 percent&lt;/strong>.&lt;/li>
&lt;/ul>
&lt;h4 id="facial-expression-analysis-vs-emotion-analysis">Facial expression analysis vs. Emotion analysis&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Emotion analysis requires &lt;strong>higher level knowledge&lt;/strong>, such as context information.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Besides emotions, facial expressions can also express intention, cognitive processes, physical effort, etc.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="emotions-conveyed-by-facial-expressions">Emotions conveyed by Facial Expressions&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Six basic emotions (assumed to be innate)&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2017.20.59.png" alt="截屏2021-02-19 17.20.59" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="basic-structure-of-facial-expression-analysis-systems">Basic structure of facial expression analysis systems&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2017.21.36.png" alt="截屏2021-02-19 17.21.36" style="zoom:80%;" />
&lt;h2 id="levels-of-description">Levels of description&lt;/h2>
&lt;h3 id="emotions">Emotions&lt;/h3>
&lt;h4 id="discrete-classes">Discrete classes&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Six basic emotions&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2017.27.26.png" alt="截屏2021-02-19 17.27.26" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Positive, neutral, negative&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="continuous-valued-dimensions">Continuous valued dimensions&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Emotions as a continuum along 2/3 dimension&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Circumplex model by Russel&lt;/p>
&lt;ul>
&lt;li>Valence: unpleasant - pleasant&lt;/li>
&lt;li>Arousal: low – high activation&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2017.28.58.png" alt="截屏2021-02-19 17.28.58" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="facial-action-units-aus">Facial Action Units (AUs)&lt;/h3>
&lt;h4 id="facial-action-coding-system-facs">Facial Action Coding System (FACS)&lt;/h4>
&lt;ul>
&lt;li>A human-observer based system designed to &lt;strong>detect subtle changes in facial features&lt;/strong>&lt;/li>
&lt;li>Viewing videotaped facial behavior in &lt;em>slow&lt;/em> motion, trained observer can manually FACS code all possible facial displays&lt;/li>
&lt;li>These facial displays are referred to as &lt;strong>action units (AU)&lt;/strong> and may occur individually or in combinations.&lt;/li>
&lt;/ul>
&lt;h4 id="action-units-aus">Action Units (AUs)&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>There are 44 AUs&lt;/p>
&lt;/li>
&lt;li>
&lt;p>30 AUs related to contractions of special facial muscles&lt;/p>
&lt;ul>
&lt;li>
&lt;p>12 AUs for upper face&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2017.32.01.png" alt="截屏2021-02-19 17.32.01" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>18 AUs for lower face&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2017.32.29.png" alt="截屏2021-02-19 17.32.29" style="zoom:80%;" />
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Anatomic basis of the remaining 14 is unspecified $\rightarrow$ referred to in Facial Action Coding System (FACS) as miscellaneous actions&lt;/p>
&lt;/li>
&lt;li>
&lt;p>For action units that vary in intensity, a 5-point ordinal scale is used to measure the degree of muscle contraction&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="combination-of-aus">Combination of AUs&lt;/h4>
&lt;p>More than 7000 different AU combinations have been observed.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Additive&lt;/strong>: appearance of single AUs does NOT change. E.g.&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2017.36.39.png" alt="截屏2021-02-19 17.36.39" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Nonadditive&lt;/strong>: appearance of single AUs does change. E.g.&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2017.37.05.png" alt="截屏2021-02-19 17.37.05" style="zoom:80%;" />
&lt;/li>
&lt;/ul>
&lt;h4 id="individual-differences-in-subjects">Individual Differences in Subjects&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Variations in appearance&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Face shape,&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Texture&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Color&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Facial and scalp hair&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>due to sex, ethnic background, and age differences&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Variations in expressiveness&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="transitions-among-expressions">Transitions Among Expressions&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Simplifying assumption: &lt;strong>expressions are singular and begin and end with a neutral position&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Transitions from action units or combination of actions to another may involve NO intervening neutral state.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Parsing the stream of behavior is an essential requirement of a robust facial analysis system, and training data are needed that include dynamic combinations of action units, which may be either additive or nonadditive.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="intensity-of-facial-expression">Intensity of Facial Expression&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Facial actions can vary in intensity&lt;/p>
&lt;/li>
&lt;li>
&lt;p>FACS coding uses 5-point intensity scale to describe intensity variation of action units&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Some related action units function as sets to represent intensity variation.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>E.g. in the eye region, action units 41, 42, and 43 or 45 can represent intensity variation from slightly drooped to closed eyes.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2017.43.17.png" alt="截屏2021-02-19 17.43.17" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="relation-to-other-facial-behavior-or-nonfacial-behavior">Relation to other Facial Behavior or Nonfacial Behavior&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Facial expression is one of several channels of nonverbal communication.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The message values of various modes may differ depending on context.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>For robustness, should be integrated with&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Gesture&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Prosody&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Speech&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="different-datasets-and-systems">Different datasets and systems&lt;/h2>
&lt;h3 id="using-geometric-features--ann-2001--early-work">Using geometric features + ANN (2001 / early work)&lt;/h3>
&lt;p>&lt;strong>Recognizing Action Units for Facial Expression Analysis&lt;/strong>&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>An &lt;strong>Automatic Facial Analysis (AFA)&lt;/strong> system to analyze facial expressions based on both &lt;strong>permanent facial features (brows, eyes, mouth)&lt;/strong> and &lt;strong>transient facial features (depending of facial furrows)&lt;/strong> in a nearly frontal-view image sequences.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>A group of action units (neutral expression, six upper face AUs and 10 lower face AUs) are recognized whether they occur alone or in combinations.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="cohn-kanade-au-coded-facial-expression-database">Cohn-Kanade AU-Coded Facial Expression Database&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>100 subjects from varying ethnic backgrounds.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>23 different facial expressions (single action units and combinations of action units)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Frontal faces, small head motion&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Variations in lighting&lt;/p>
&lt;ul>
&lt;li>ambient lighting&lt;/li>
&lt;li>single-high-intensity lamp&lt;/li>
&lt;li>dual high-intensity lamps with reflective umbrellas&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Coded with FACS and assigned emotion-specified labels&lt;/strong> (happy, surprise, anger, disgust, fear, sadness)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Example&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2021.40.16.png" alt="截屏2021-02-19 21.40.16" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="feature-based-automatic-facial-action-analysis-afa-system">Feature-based Automatic Facial Action Analysis (AFA) System&lt;/h4>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2021.41.28.png" alt="截屏2021-02-19 21.41.28" style="zoom:80%;" />
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Feature detection &amp;amp; feature location&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Region of the face and location of individual face features detected automatically in the initial frame using neural network based approach&lt;/li>
&lt;li>Contours of face features and components adjusted manually in the initial frame&lt;/li>
&lt;li>Face features are then tracked automatically
&lt;ul>
&lt;li>&lt;strong>permanent features&lt;/strong> (e.g., brows, eyes, lips)&lt;/li>
&lt;li>&lt;strong>transient features&lt;/strong> (lines and furrows)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Feature extraction&lt;/strong>: Group facial features into separate collections of feature parameters&lt;/p>
&lt;ul>
&lt;li>15 normalized upper face parameters&lt;/li>
&lt;li>9 normalized lower face parameters&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Parameters fed to two neural-network-based classifiers&lt;/strong>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="facial-feature-extraction">Facial Feature Extraction&lt;/h4>
&lt;p>Multistate Facial Component Models of a Frontal Face&lt;/p>
&lt;ul>
&lt;li>Permanent components/features
&lt;ul>
&lt;li>Lip&lt;/li>
&lt;li>Eye&lt;/li>
&lt;li>Brow&lt;/li>
&lt;li>Cheek&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Transient component/features
&lt;ul>
&lt;li>&lt;strong>Furrows&lt;/strong> and &lt;strong>wrinkles&lt;/strong> appear perpendicular to the direction of the motion of the activated muscles&lt;/li>
&lt;li>Classification
&lt;ul>
&lt;li>present (appear, deepen or lengthen)&lt;/li>
&lt;li>absent&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Detection
&lt;ul>
&lt;li>Canny edge detector&lt;/li>
&lt;li>Nasal root / crow’s-feet wrinkles&lt;/li>
&lt;li>Nasolabial furrows&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="facial-feature-representation">Facial Feature Representation&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Face coordinate system&lt;/p>
&lt;ul>
&lt;li>$x = $ line between inner corners of eyes&lt;/li>
&lt;li>$y = $ perpendicular to x&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Group facial features&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>upper face&lt;/strong> features: 15 parameters&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Example&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2021.54.56.png" alt="截屏2021-02-19 21.54.56" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>lower face&lt;/strong> features: 9 parameters&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Example&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2021.55.15.png" alt="截屏2021-02-19 21.55.15" style="zoom:67%;" />
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="au-recognition-by-neural-networks">AU Recognition by Neural Networks&lt;/h4>
&lt;ul>
&lt;li>Three layer neural networks (one hidden layer)&lt;/li>
&lt;li>Standard back-propagation method
&lt;ul>
&lt;li>Separate networks for upper- / lower face&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2021.56.35.png" alt="截屏2021-02-19 21.56.35" style="zoom:80%;" />
&lt;h3 id="using-appearance-based-features--svm-2006">Using appearance-based features + SVM (2006)&lt;/h3>
&lt;p>&lt;strong>Automatic Recognition of Facial Actions in Spontaneous Expression&lt;/strong>&lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2022.40.45.png" alt="截屏2021-02-19 22.40.45" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h4 id="ru-facs-data-set">RU-FACS data set&lt;/h4>
&lt;ul>
&lt;li>Containts spontaneous expressions&lt;/li>
&lt;li>100 subjects&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2022.41.42.png" alt="截屏2021-02-19 22.41.42" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="using-deep-features-cnn--fusion-2013">Using Deep features (CNN) + fusion (2013)&lt;/h3>
&lt;h4 id="emotion-recognition-in-the-wild-challenge-emotiw">&lt;strong>Emotion Recognition in the Wild Challenge (EmotiW)&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>🎯 Goal: Move to more realistic out of the lab data&lt;/p>
&lt;/li>
&lt;li>
&lt;p>AFEW Dataset (Acted Facial Expressions in the Wild)&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Extracted from movies&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Annotated with six basic emotions&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Movie clips from 330 subjects, age range: 1-70&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Semi-automatic annotation pipeline&lt;/p>
&lt;ul>
&lt;li>Recommender sytem + manual annotation&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2022.45.20.png" alt="截屏2021-02-19 22.45.20" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="2013-winner">2013 Winner&lt;/h4>
&lt;p>&lt;strong>Combining Modality Specific Deep Neural Networks for Emotion Recognition in Video&lt;/strong>&lt;sup id="fnref:3">&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref">3&lt;/a>&lt;/sup>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-19%2022.47.35.png" alt="截屏2021-02-19 22.47.35" style="zoom:80%;" />
&lt;h5 id="convolutional-network">&lt;strong>Convolutional Network&lt;/strong>&lt;/h5>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2022.49.28.png" alt="截屏2021-02-19 22.49.28" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;ul>
&lt;li>Inputs are images of size 40x40, cropped randomly&lt;/li>
&lt;li>Four layers, 3 convolutions followed by max or average pooling and a fully-connected layer&lt;/li>
&lt;/ul>
&lt;h5 id="representing-video-sequence">Representing video sequence&lt;/h5>
&lt;ul>
&lt;li>CNN gives 7-dim output per frame&lt;/li>
&lt;li>Multiple frames are averaged into 10 vectors describing the sequence
&lt;ul>
&lt;li>For shorter sequences, frames / vectors get expanded (duplicated)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Results in 70-dim feature vector (10*7)&lt;/li>
&lt;li>Classification with SVM&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-02-19%2022.56.28.png" alt="截屏2021-02-19 22.56.28" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h5 id="other-features">Other Features&lt;/h5>
&lt;ul>
&lt;li>„Bag of Mouth“&lt;/li>
&lt;li>Audio-features&lt;/li>
&lt;/ul>
&lt;h4 id="typical-pipline">Typical Pipline&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Face detection and alignment&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Extract various features and different representations&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Build multiple classifiers&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Fusion of results&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="other-applications">Other Applications&lt;/h4>
&lt;ul>
&lt;li>Pain Analysis&lt;/li>
&lt;li>Analysis of psychological disorders&lt;/li>
&lt;li>Workload / stress analysis&lt;/li>
&lt;li>Adaptive user interfaces&lt;/li>
&lt;li>Advertisment&lt;/li>
&lt;/ul>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>Y. . -I. Tian, T. Kanade and J. F. Cohn, &amp;ldquo;Recognizing action units for facial expression analysis,&amp;rdquo; in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 97-115, Feb. 2001, doi: 10.1109/34.908962.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>Littlewort, Gwen &amp;amp; Frank, Mark &amp;amp; Lainscsek, Claudia &amp;amp; Fasel, Ian &amp;amp; Movellan, Javier. (2006). Automatic Recognition of Facial Actions in Spontaneous Expressions. Journal of Multimedia. 1. 10.4304/jmm.1.6.22-35.&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:3">
&lt;p>Kahou, Samira Ebrahimi &amp;amp; Pal, Christopher &amp;amp; Bouthillier, Xavier &amp;amp; Froumenty, Pierre &amp;amp; Gulcehre, Caglar &amp;amp; Memisevic, Roland &amp;amp; Vincent, Pascal &amp;amp; Courville, Aaron &amp;amp; Bengio, Y. &amp;amp; Ferrari, Raul &amp;amp; Mirza, Mehdi &amp;amp; Jean, Sébastien &amp;amp; Carrier, Pierre-Luc &amp;amp; Dauphin, Yann &amp;amp; Boulanger-Lewandowski, Nicolas &amp;amp; Aggarwal, Abhishek &amp;amp; Zumer, Jeremie &amp;amp; Lamblin, Pascal &amp;amp; Raymond, Jean-Philippe &amp;amp; Wu, Zhenzhou. (2013). Combining modality specific deep neural networks for emotion recognition in video. ICMI 2013 - Proceedings of the 2013 ACM International Conference on Multimodal Interaction. 543-550. 10.1145/2522848.2531745.&amp;#160;&lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>People Detection: Global Approaches</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/10-person_detection-holistic_models/</link><pubDate>Thu, 18 Feb 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/10-person_detection-holistic_models/</guid><description>&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;h3 id="why-people-detection">Why people detection?&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Person Re-Identification&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Person Tracking&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Security (e.g. Border Control)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Automotive (e.g. Collision Prevention)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Interaction (e.g. Xbox Kinect)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Medical (e.g. Patient Monitoring)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Commercial (e.g. Customer Counting)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="why-is-people-detection-difficult">Why is people detection difficult?&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Clothing&lt;/strong>&lt;/p>
&lt;p>Large variety of clothing styles causes greater appearance variety&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Accessories&lt;/strong>
Occlusions by accessories. E.g. backpack, umbrella, handbag, &amp;hellip;&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Articulation&lt;/strong>
Faces are mostly rigid. Persons can take on many different poses&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Clutter&lt;/strong>
People frequently overlap each other in images (crowds)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="categories">Categories&lt;/h2>
&lt;h3 id="still-image-vs-video">Still image vs. video&lt;/h3>
&lt;p>&lt;strong>Still image based&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Mostly based on gray-value information from visual images&lt;/li>
&lt;li>Other possible cues: color, infra-red, radar, stereo&lt;/li>
&lt;li>👍 Advantage: Applicable in wider variety of applications&lt;/li>
&lt;li>👎 Disadvantages
&lt;ul>
&lt;li>Often more difficult (only a single frame)&lt;/li>
&lt;li>Performs poorer than video based techniques&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Video based&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Background modeling&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Temporal information (speed, position in earlier frames)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Optical flow&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Can be (re-)initialized by still image approach&lt;/p>
&lt;/li>
&lt;li>
&lt;p>👎 Disadvantage: Hard to apply in unconstrained scenarios&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="global-vs-parts">Global vs. parts&lt;/h3>
&lt;p>&lt;strong>Global approaches&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Holistic model, e.g. one feature for whole person&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-19%2023.32.50.png" alt="截屏2021-02-19 23.32.50">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>👍 Advantages&lt;/p>
&lt;ul>
&lt;li>typically simple model&lt;/li>
&lt;li>work well for low resolutions&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>👎 Disadvantages&lt;/p>
&lt;ul>
&lt;li>problems with occlusions&lt;/li>
&lt;li>problems with articulations&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Part-based approaches&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Model body sub-parts separately&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-19%2023.33.55.png" alt="截屏2021-02-19 23.33.55">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>👍 Advantages&lt;/p>
&lt;ul>
&lt;li>deal better with moving body parts (poses)&lt;/li>
&lt;li>able to handle occlusions, overlaps&lt;/li>
&lt;li>sharing of training data&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>👎 Disadvantages&lt;/p>
&lt;ul>
&lt;li>require more complex reasoning&lt;/li>
&lt;li>problems with low resolutions&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="discriminative-vs-generative">discriminative vs. generative&lt;/h3>
&lt;p>&lt;strong>Generative model&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Models how data (i.e. person images) is generated&lt;/p>
&lt;/li>
&lt;li>
&lt;p>👍 Advantages&lt;/p>
&lt;ul>
&lt;li>possibly interpretable, i.e. know why reject/accept&lt;/li>
&lt;li>models the object class/can draw samples&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>👎 Disadvantages&lt;/p>
&lt;ul>
&lt;li>model variability unimportant to classification task&lt;/li>
&lt;li>often hard to build good model with few parameters&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Discriminative model&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Can only discriminate for given data, if it is a person or not&lt;/p>
&lt;/li>
&lt;li>
&lt;p>👍 Advantages&lt;/p>
&lt;ul>
&lt;li>appealing when infeasible to model data itself&lt;/li>
&lt;li>currently often excel in practice&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>👎 Disadvantages&lt;/p>
&lt;ul>
&lt;li>often can’t provide uncertainty in predictions&lt;/li>
&lt;li>non-interpretable&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="typical-components-of-global-approaches">Typical components of global approaches&lt;/h2>
&lt;h3 id="detection-via-classification-binary-classifier">Detection via classification (binary classifier)&lt;/h3>
&lt;p>&lt;strong>Sliding window&lt;/strong>: Scan window at different &lt;strong>positions and scales&lt;/strong>&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-20%2013.49.47.png" alt="截屏2021-02-20 13.49.47">&lt;/p>
&lt;h3 id="gradient-based">Gradient based&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Popular and successful in the vision community&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Avoid hard decisions (compared to edge based features)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Examples&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Histogram of Oriented Gradients (HOG)&lt;/strong>&lt;/li>
&lt;li>&lt;strong>Scale-Invariant Feature Transform (SIFT)&lt;/strong>&lt;/li>
&lt;li>&lt;strong>Gradient Location and Orientation Histogram (GLOH)&lt;/strong>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Computing gradients&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Centered
&lt;/p>
$$
f^{\prime}(x)=\lim \_{h \rightarrow 0} \frac{f(x+h)-f(x-h)}{2 h}
$$
&lt;/li>
&lt;li>
&lt;p>Gradient &lt;strong>magnitude&lt;/strong>
&lt;/p>
$$
s = \sqrt{s\_x^2 + s\_y^2}
$$
&lt;/li>
&lt;li>
&lt;p>Gradient &lt;strong>orientation&lt;/strong>
&lt;/p>
$$
\theta=\arctan \left(\frac{s\_{y}}{s\_{x}}\right)
$$
&lt;p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-20%2013.55.54.png" alt="截屏2021-02-20 13.55.54">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Gradient in image&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Image: discrete, 2-dimensional signal&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Use filter mask to compute gradient&lt;/p>
&lt;ul>
&lt;li>
&lt;p>$x$-direction:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-20%2014.50.04.png" alt="截屏2021-02-20 14.50.04" style="zoom: 67%;" />
&lt;/li>
&lt;li>
&lt;p>$y$-direction&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-20%2014.50.15.png" alt="截屏2021-02-20 14.50.15" style="zoom:67%;" />
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="edge-based">Edge based&lt;/h3>
&lt;h3 id="wavelet-based">Wavelet based&lt;/h3>
&lt;h2 id="hog-people-detector-1">HOG people detector &lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/h2>
&lt;blockquote>
&lt;p>More see: &lt;a href="https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/hog/">Histogram of Oriented Gradients (HOG)&lt;/a>&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>Gradient-based feature descriptor developed for people detection&lt;/li>
&lt;li>Global descriptor for the complete body&lt;/li>
&lt;li>High-dimensional (typically ~4000 dimensions)&lt;/li>
&lt;li>Very promising results on challenging data sets&lt;/li>
&lt;/ul>
&lt;h3 id="phases">Phases&lt;/h3>
&lt;h4 id="learning-phase">Learning Phase&lt;/h4>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-20%2017.13.56.png" alt="截屏2021-02-20 17.13.56" style="zoom:80%;" />
&lt;h4 id="detection-phase">Detection Phase&lt;/h4>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-20%2017.14.25.png" alt="截屏2021-02-20 17.14.25" style="zoom:80%;" />
&lt;h3 id="how-hog-descriptor-works">How HOG descriptor works?&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-20%2017.19.38.png" alt="截屏2021-02-20 17.19.38" style="zoom:80%;" />
&lt;ol>
&lt;li>Compute gradients on an image region of 64x128 pixels&lt;/li>
&lt;li>Compute gradient orientation histograms on &lt;em>cells&lt;/em> of 8x8 pixels (in total 8x16 cells).
typical histogram size: 9 bins&lt;/li>
&lt;li>Normalize histograms within overlapping &lt;em>blocks&lt;/em> of 2x2 cells (in total 7x15 blocks)
block descriptor size: 4x9 = 36&lt;/li>
&lt;li>Concatenate block descriptors $\rightarrow$ 7 x 15 x 4 x 9 = 3780 dimensional feature vector&lt;/li>
&lt;/ol>
&lt;h4 id="1-gradients">1. Gradients&lt;/h4>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-20%2017.21.44.png" alt="截屏2021-02-20 17.21.44" style="zoom:80%;" />
&lt;ul>
&lt;li>
&lt;p>Convolution with [-1 0 1] filters (x and y direction)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Compute gradient magnitude and direction&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Per pixel: color channel with greatest magnitude is used for final gradient (color is used!)&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-20%2017.22.45.png" alt="截屏2021-02-20 17.22.45">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="2-cell-histograms">2. &lt;strong>Cell histograms&lt;/strong>&lt;/h4>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-20%2017.23.45.png" alt="截屏2021-02-20 17.23.45" style="zoom:67%;" />
&lt;ul>
&lt;li>9 bins for gradient orientations (0-180 degrees)&lt;/li>
&lt;li>Filled with magnitudes&lt;/li>
&lt;li>Interpolated trilinearly
&lt;ul>
&lt;li>bilinearly into spatial cells&lt;/li>
&lt;li>linearly into orientation bins&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="3-blocks">3. Blocks&lt;/h4>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-20%2017.25.24.png" alt="截屏2021-02-20 17.25.24" style="zoom:80%;" />
&lt;ul>
&lt;li>Overlapping blocks of 2x2 cells&lt;/li>
&lt;li>Cell histograms are concatenated and then normalized&lt;/li>
&lt;li>Normalization
&lt;ul>
&lt;li>different norms possible (L2, L2hys etc.)&lt;/li>
&lt;li>add a normalization epsilon to avoid division by zero&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="4-the-final-hog-descriptor">4. &lt;strong>The final HOG descriptor&lt;/strong>&lt;/h4>
&lt;p>Concatenation of block descriptors&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-20%2017.32.07.png" alt="截屏2021-02-20 17.32.07">&lt;/p>
&lt;p>Visualization&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-20%2017.32.24.png" alt="截屏2021-02-20 17.32.24" style="zoom:80%;" />
&lt;h3 id="from-feature-to-detector">From feature to detector&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Simple linear SVM on top of the HOG Features&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Fast (one inner product per evaluation window)&lt;/p>
&lt;p>for an entire image it’s a vector-matrix multiplication&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Gaussian kernel SVM&lt;/p>
&lt;ul>
&lt;li>
&lt;p>slightly better classification accuracy&lt;/p>
&lt;/li>
&lt;li>
&lt;p>but considerable increase in computation time&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="silhouette-matching-2">Silhouette Matching &lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>&lt;/h2>
&lt;h3 id="idea">Idea&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>🎯 &lt;strong>Goal: align known object shapes with image&lt;/strong>&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-20%2017.38.46.png" alt="截屏2021-02-20 17.38.46">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Requirements for an alignment algorithm&lt;/p>
&lt;ul>
&lt;li>
&lt;p>high detection rate&lt;/p>
&lt;/li>
&lt;li>
&lt;p>few false positives&lt;/p>
&lt;/li>
&lt;li>
&lt;p>robustness&lt;/p>
&lt;/li>
&lt;li>
&lt;p>computationally inexpensive&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="computational-complexity">Computational complexity&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-20%2017.41.11.png" alt="截屏2021-02-20 17.41.11" style="zoom:67%;" />
&lt;p>Complexity is &lt;strong>O(#positions * #templates * #contourpixels * sizeof(searchregion))&lt;/strong>&lt;/p>
&lt;h3 id="distance-transform">Distance transform&lt;/h3>
&lt;p>Used to compare/align two (typically binary) shapes&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-20%2017.44.59.png" alt="截屏2021-02-20 17.44.59">&lt;/p>
&lt;ol>
&lt;li>
&lt;p>Compute the distance from each pixel to the nearest edge pixel&lt;/p>
&lt;ul>
&lt;li>here the euclidean distances are approximated by the &lt;strong>2-3 distance&lt;/strong>&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-20%2017.45.23.png" alt="截屏2021-02-20 17.45.23" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>Overlay second shape over distance transform&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-20%2017.45.42.png" alt="截屏2021-02-20 17.45.42">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Accumulate distances along shape 2&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Find best matching position by an exhaustive search&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>However:&lt;/p>
&lt;ul>
&lt;li>2-3 distance is not symmetric&lt;/li>
&lt;li>2-3 distance has to be normalized w.r.t. the length of the shapes&lt;/li>
&lt;/ul>
&lt;h3 id="chamfer-matching">Chamfer matching&lt;/h3>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-20%2017.47.50.png" alt="截屏2021-02-20 17.47.50">&lt;/p>
&lt;h4 id="efficient-implementation">&lt;strong>Efficient Implementation&lt;/strong>&lt;/h4>
&lt;p>The distance transform can be efficiently computed by two scans over the complete image&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Forward-Scan&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>starts in the upper-left corner and moves from left to right, top to bottom&lt;/p>
&lt;/li>
&lt;li>
&lt;p>uses the following mask&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-20%2017.50.24.png" alt="截屏2021-02-20 17.50.24">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Backward-Scan&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>starts in the lower-right corner and moves from right to left, bottom to top&lt;/p>
&lt;/li>
&lt;li>
&lt;p>uses the following mask&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-20%2017.50.50.png" alt="截屏2021-02-20 17.50.50">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Advantages&lt;/p>
&lt;ul>
&lt;li>Fast&lt;/li>
&lt;li>Good performance on uncluttered images (with few background structures)&lt;/li>
&lt;/ul>
&lt;p>Disadvantages&lt;/p>
&lt;ul>
&lt;li>Bad performance for cluttered images&lt;/li>
&lt;li>Needs a huge number of people silhouettes&lt;/li>
&lt;/ul>
&lt;h3 id="template-hierarchy">Template Hierarchy&lt;/h3>
&lt;ul>
&lt;li>Reduce the number of silhouettes to consider&lt;/li>
&lt;li>The shapes are clustered by similarity&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-20%2017.52.49.png" alt="截屏2021-02-20 17.52.49">&lt;/p>
&lt;h3 id="coarse-to-fine-search">Coarse-To-Fine Search&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Goal: Reduce search effort by discarding unlikely regions with minimal computational effort&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Idea:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>subsample the image and search first at a coarse scale&lt;/p>
&lt;/li>
&lt;li>
&lt;p>only consider regions with a low distance when searching for a match on finer scales&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Need to find reasonable thresholds&lt;/p>
&lt;/li>
&lt;/ul>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>N. Dalal and B. Triggs, &amp;ldquo;Histograms of oriented gradients for human detection,&amp;rdquo; 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), San Diego, CA, USA, 2005, pp. 886-893 vol. 1, doi: 10.1109/CVPR.2005.177.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>D. M. Gavrila and V. Philomin, &amp;ldquo;Real-time object detection for &amp;ldquo;smart&amp;rdquo; vehicles,&amp;rdquo; Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 1999, pp. 87-93 vol.1, doi: 10.1109/ICCV.1999.791202.&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>People Detection: Part-based Approaches</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/11-person_detection-part_based/</link><pubDate>Wed, 07 Jul 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/11-person_detection-part_based/</guid><description>&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;ul>
&lt;li>Model body-parts separately&lt;/li>
&lt;li>Break down an objects’ overall variability into more manageable pieces&lt;/li>
&lt;li>Pieces can be classified by less complex classifiers&lt;/li>
&lt;li>Apply prior knowledge by (manually) splitting the global object into meaningful parts&lt;/li>
&lt;li>Advantages
&lt;ul>
&lt;li>deal better with moving body parts (poses)&lt;/li>
&lt;li>able to handle occlusions, overlaps&lt;/li>
&lt;li>sharing of training data&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Disadvantages
&lt;ul>
&lt;li>require more complex reasoning&lt;/li>
&lt;li>problems with low resolutions&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="part-based-models">Part-based models&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Two main components&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>parts&lt;/strong> (2D image fragments)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>structure&lt;/strong> (configuration of parts) $\rightarrow$ often also &lt;em>part-combination method&lt;/em>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Fixed spatial layout&lt;/p>
&lt;ul>
&lt;li>Local parts are modeled to have a mostly fixed position and orientation with respect to the object or detection window center&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Flexible Spatial Layout&lt;/p>
&lt;ul>
&lt;li>local parts are allowed to shift in location and scale&lt;/li>
&lt;li>can better handle deformations or articulation changes&lt;/li>
&lt;li>well suited for non-rigid objects&lt;/li>
&lt;li>spatial relations are often modeled probabilistically&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="the-mohan-people-detector-1">The Mohan People Detector &lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/h2>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-13%2021.17.24.png" alt="截屏2021-07-13 21.17.24">&lt;/p>
&lt;ul>
&lt;li>4 parts
&lt;ul>
&lt;li>face and shoulder&lt;/li>
&lt;li>legs&lt;/li>
&lt;li>right arm&lt;/li>
&lt;li>left arm&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Fixed layout
&lt;ul>
&lt;li>Body parts are not always at the exact same position&lt;/li>
&lt;li>Allow local shifts: in position and in scale&lt;/li>
&lt;li>Best location has to be found for each detection window&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Combination: Classifier (SVM)&lt;/li>
&lt;li>Detection
&lt;ul>
&lt;li>sliding window approach&lt;/li>
&lt;li>64x128 pixels&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="the-implicit-shape-model-ism-2">The Implicit Shape Model (ISM) &lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>&lt;/h2>
&lt;p>&lt;strong>💡 Main ideas&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Automatically learn a large number of local parts that occur on the object (referred to as visual vocabulary, bag of words or codebook)&lt;/li>
&lt;li>Learn a star-topology structural model
&lt;ul>
&lt;li>features are considered independent given the objects’ center&lt;/li>
&lt;li>likely relative positions are learned from data&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>5 steps&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>&lt;a href="#part-detectionlocalization">Part detection/localization&lt;/a>&lt;/li>
&lt;li>&lt;a href="#part-description">Part description&lt;/a>&lt;/li>
&lt;li>&lt;a href="#learning-part-appearances">Learning part appearance&lt;/a>&lt;/li>
&lt;li>&lt;a href="#learning-the-spatial-layout-of-parts">Learning theh spatial layout of parts&lt;/a>&lt;/li>
&lt;li>&lt;a href="#combination-of-part-detections">Combination of part detections&lt;/a>&lt;/li>
&lt;/ol>
&lt;h3 id="part-detectionlocalization">Part Detection/Localization&lt;/h3>
&lt;p>A good part decomposition needs to be&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Repeatable&lt;/p>
&lt;p>We should be able to find the part despite articulation or image transformations (e.g. invariance to rotation, perspective, lighting)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Distinctive&lt;/p>
&lt;ul>
&lt;li>A part should not be easily confused with other parts the regions should contain an “interesting” structure&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Compact&lt;/p>
&lt;p>No lengthy or strangely shaped parts&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Efficient&lt;/p>
&lt;p>Computationally inexpensive to detect or represent&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Cover&lt;/p>
&lt;p>Parts need to sufficiently cover the object&lt;/p>
&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;h4 id="local-features">Local features&lt;/h4>
&lt;p>Two components of local features:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>key- or interest-points&lt;/strong> (&lt;em>&amp;ldquo;Where is it?&amp;rdquo;&lt;/em>)
&lt;ul>
&lt;li>specify repeatable points on the object&lt;/li>
&lt;li>consist of x-, y-position and scale&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>local (keypoint) descriptors&lt;/strong> (&lt;em>&amp;ldquo;How does it look like?&amp;rdquo;&lt;/em>)
&lt;ul>
&lt;li>describe the area around an interest point&lt;/li>
&lt;li>i.e. define the feature representation of an interest point&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>General approach&lt;/p>
&lt;ul>
&lt;li>Find keypoints using keypoint detector&lt;/li>
&lt;li>Define region around keypoint&lt;/li>
&lt;li>Normalize region&lt;/li>
&lt;li>Compute local descriptor&lt;/li>
&lt;li>Compare descriptors&lt;/li>
&lt;/ul>
&lt;/blockquote>
&lt;h4 id="keypoint-detectors">Keypoint detectors&lt;/h4>
&lt;p>Find reproducible, scale invariant local keypoints in an image&lt;/p>
&lt;p>Keypoint Localization&lt;/p>
&lt;ul>
&lt;li>Goals
&lt;ul>
&lt;li>repeatable detection&lt;/li>
&lt;li>precise localization&lt;/li>
&lt;li>interesting content&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Idea: Look for two-dimensional signal changes&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Hessian Detector&lt;/strong>&lt;/p>
&lt;p>&lt;strong>Search for strong second derivatives in two orthogonal directions&lt;/strong> (Hessian determinant)
&lt;/p>
$$
\operatorname{Hessian}(I)=\left[\begin{array}{ll}
I\_{x x} &amp; I\_{x y} \\\\
I\_{x y} &amp; I\_{y y}
\end{array}\right]
$$
$$
\operatorname{det}(\operatorname{Hessian}(I))=I\_{x x} I\_{y y}-I\_{x y}^{2}
$$
&lt;p>Second Partial Derivative Test: If $det(H)>0$, we have a local minimum or maximum.&lt;/p>
&lt;p>Example:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-13%2022.29.41.png" alt="截屏2021-07-13 22.29.41">&lt;/p>
&lt;p>Responses:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-13%2022.37.39.png" alt="截屏2021-07-13 22.37.39" style="zoom:67%;" />
&lt;p>&lt;strong>Handle scale&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Scale Space&lt;/p>
&lt;p>Not only detect a distinctive position, but also a characteristic scale around an interest point&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-13%2022.41.21.png" alt="截屏2021-07-13 22.41.21">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Scale Invariance&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Same operator responses, if the patch contains the same image up to a scale factor&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-13%2022.44.08.png" alt="截屏2021-07-13 22.44.08">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Automatic Scale Selection: Function responses for increasing scale (scale signature)&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Laplacian-of-Gaussian (LoG)&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-13%2022.47.07.png" alt="截屏2021-07-13 22.47.07" style="zoom:67%;" />
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="part-description">Part Description&lt;/h3>
&lt;p>Distinctly describe local keypoints and achieve orientation invariance&lt;/p>
&lt;h4 id="local-descriptors">Local Descriptors&lt;/h4>
&lt;ul>
&lt;li>Goal: Describe (local) region around a keypoint&lt;/li>
&lt;li>Most available descriptors focus on &lt;em>edge/gradient&lt;/em> information
&lt;ul>
&lt;li>Capture boundary and texture information&lt;/li>
&lt;li>Color still used relatively seldom&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Orientation Invariance&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Compute orientation histogram&lt;/li>
&lt;li>Select dominant orientation&lt;/li>
&lt;li>Normalize: rotate to fixed orientation&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-13%2022.51.44.png" alt="截屏2021-07-13 22.51.44" style="zoom:67%;" />
&lt;ul>
&lt;li>
&lt;p>&lt;strong>The SIFT descriptor&lt;/strong>: Histogram of gradient orientations&lt;/p>
&lt;ul>
&lt;li>
&lt;p>captures important texture information&lt;/p>
&lt;/li>
&lt;li>
&lt;p>robust to small translations / affine deformations&lt;/p>
&lt;/li>
&lt;li>
&lt;p>How it works? (similar to HOG)&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-13%2022.53.18.png" alt="截屏2021-07-13 22.53.18" style="zoom:80%;" />
&lt;ul>
&lt;li>region rescaled to a grid of 16x16 pixels (8x8 in image)&lt;/li>
&lt;li>4x4 regions (2x2 in image) = 16 histograms (concatenated)&lt;/li>
&lt;li>histograms: 8 orientation bins, gradients weighted by gradient magnitude&lt;/li>
&lt;li>final descriptor has 128 dimensions and is normalized to compensate for illumination differences&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>A brief introduction: &lt;a href="https://youtu.be/4AvTMVD9ig0">SIFT - 5 Minutes with Cyrill&lt;/a>&lt;/p>
&lt;p>A nice explanation: (source: &lt;a href="https://gilscvblog.com/2013/08/18/a-short-introduction-to-descriptors/">https://gilscvblog.com/2013/08/18/a-short-introduction-to-descriptors/&lt;/a>)&lt;/p>
&lt;p>SIFT was presented in 1999 by David Lowe and includes both a keypoint detector and descriptor. SIFT is computed as follows:&lt;/p>
&lt;ol>
&lt;li>First, detect keypoints using the SIFT detector, which also detects scale and orientation of the keypoint.&lt;/li>
&lt;li>Next, for a given keypoint, warp the region around it to canonical orientation and scale and resize the region to 16X16 pixels.&lt;/li>
&lt;/ol>
&lt;p>&lt;a href="https://gilscvblog.files.wordpress.com/2013/08/figure3.jpg">&lt;img src="https://gilscvblog.files.wordpress.com/2013/08/figure3.jpg?w=600&amp;amp;h=192" alt="SIFT - warping the region around the keypoint">&lt;/a>&lt;/p>
&lt;ol start="3">
&lt;li>
&lt;p>Compute the gradients for each pixels (orientation and magnitude).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Divide the pixels into 16, 4X4 pixels squares.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>&lt;a href="https://gilscvblog.files.wordpress.com/2013/08/figure4.jpg">&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/figure4.jpg" alt="SIFT - dividing to squares and calculating orientation">&lt;/a>&lt;/p>
&lt;ol start="5">
&lt;li>For each square, compute gradient direction histogram over 8 directions&lt;/li>
&lt;/ol>
&lt;p>&lt;a href="https://gilscvblog.files.wordpress.com/2013/08/figure5.jpg">&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/figure5.jpg" alt="SIFT - calculating histograms of gradient orientation">&lt;/a>&lt;/p>
&lt;ol start="6">
&lt;li>concatenate the histograms to obtain a 128 (16*8) dimensional feature vector:&lt;/li>
&lt;/ol>
&lt;p>&lt;a href="https://gilscvblog.files.wordpress.com/2013/08/figure6.jpg">&lt;img src="https://gilscvblog.files.wordpress.com/2013/08/figure6.jpg?w=600&amp;amp;h=50" alt="SIFT - concatenating histograms from different squares">&lt;/a>&lt;/p>
&lt;p>SIFT descriptor illustration:&lt;/p>
&lt;p>&lt;a href="https://gilscvblog.files.wordpress.com/2013/08/figure7.jpg">&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/figure7.jpg" alt="SIFT descriptors illustration">&lt;/a>&lt;/p>
&lt;p>SIFT is invariant to illumination changes, as gradients are invariant to light intensity shift. It’s also somewhat invariant to rotation, as histograms do not contain any geometric information.&lt;/p>
&lt;/blockquote>
&lt;p>&lt;strong>Shape Context Descriptor&lt;/strong>&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-13%2022.58.08.png" alt="截屏2021-07-13 22.58.08">&lt;/p>
&lt;h4 id="what-local-features-should-i-use">&lt;strong>What Local Features Should I Use?&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>Best choice often application dependent
&lt;ul>
&lt;li>Harris-/Hessian-Laplace/DoG work well for many natural categories&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>More features are better
&lt;ul>
&lt;li>combining several detectors often helps&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="learning-part-appearances">Learning Part Appearances&lt;/h3>
&lt;h4 id="visual-vocabulary">Visual Vocabulary&lt;/h4>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-13%2023.12.03.png" alt="截屏2021-07-13 23.12.03">&lt;/p>
&lt;ol>
&lt;li>Detect keypoints on all person training examples&lt;/li>
&lt;li>Compute local descriptors for all keypoints&lt;/li>
&lt;/ol>
&lt;p>-&amp;gt; Result: Large set of local image descriptors that all occur on people&lt;/p>
&lt;p>Group visually similar local descriptors&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-13%2023.14.16.png" alt="截屏2021-07-13 23.14.16">&lt;/p>
&lt;ul>
&lt;li>
&lt;p>similar local descriptors = parts that are reoccurring&lt;/p>
&lt;/li>
&lt;li>
&lt;p>parts, that occur only rarely are discarded (they could result from noise or background structures)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>result: descriptor groups representing human body parts&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Grouping Algorithms / Clustering&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Partitional Clustering
&lt;ul>
&lt;li>K-Means&lt;/li>
&lt;li>Gaussian Mixture Clustering (EM)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Hierarchical of Agglomerative Clustering
&lt;ul>
&lt;li>Single-Link (minimum)&lt;/li>
&lt;li>Group-Average&lt;/li>
&lt;li>Ward’s method (minimum variance)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="learning-the-spatial-layout-of-parts">Learning the Spatial Layout of Parts&lt;/h3>
&lt;p>&lt;strong>Spatial Occurrence (Star-Model)&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Record spatial occurrence&lt;/p>
&lt;ul>
&lt;li>
&lt;p>match vocabulary entries to training images&lt;/p>
&lt;/li>
&lt;li>
&lt;p>record occurrence distributions with respect to object center (location $(x, y)$ and scale)&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-13%2023.18.32.png" alt="截屏2021-07-13 23.18.32">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Generalized Hough Transform&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>For every feature, store possible “occurrences”&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-13%2023.19.41.png" alt="截屏2021-07-13 23.19.41">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>For new image, let the matched features vote for possible object positions&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-13%2023.20.24.png" alt="截屏2021-07-13 23.20.24">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="combination-of-part-detections">Combination of Part Detections&lt;/h3>
&lt;p>ISM Detection Procedure:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-13%2023.21.20.png" alt="截屏2021-07-13 23.21.20">&lt;/p>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>A. Mohan, C. Papageorgiou and T. Poggio, &amp;ldquo;Example-based object detection in images by components,&amp;rdquo; in &lt;em>IEEE Transactions on Pattern Analysis and Machine Intelligence&lt;/em>, vol. 23, no. 4, pp. 349-361, April 2001, doi: 10.1109/34.917571.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>Leibe, B. &amp;amp; Leonardis, Ales &amp;amp; Schiele, B.. (2004). Combined object categorization and segmentation with an implicit shape model. Proc. 8th Eur. Conf. Comput. Vis. (ECCV). 2.&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>People Detection: Deep Learning Approaches</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/12-person_detection-deep_learning/</link><pubDate>Fri, 16 Jul 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/12-person_detection-deep_learning/</guid><description>&lt;h2 id="deep-learning-for-object-detection">Deep Learning for Object Detection&lt;/h2>
&lt;ul>
&lt;li>People detections is a special case of object detection (one of the most challenging object classes to detect)&lt;/li>
&lt;li>Recently, most detectors are trained for the more challenging task of multi-object detection
&lt;ul>
&lt;li>Goal: Given an image, detect all instances of, say, 1000 different object classes&lt;/li>
&lt;li>“Person” always one of the classes&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Speed&lt;/strong> is an issue
&lt;ul>
&lt;li>&lt;strong>Sliding Window&lt;/strong>: Look at each position, each scale&lt;/li>
&lt;li>&lt;strong>Cascades&lt;/strong> look at each position too
&lt;ul>
&lt;li>They just take a shorter look at most positions/scales&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Region Proposals&lt;/strong>: Avoid useless positions/scales from the beginning&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="region-proposals">Region Proposals&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>💡Idea&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Identify image regions that are likely to contain an object&lt;/strong>&lt;/li>
&lt;li>&lt;strong>Don’t care about the object class in the regions at this point&lt;/strong>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Characterization of a general object&lt;/p>
&lt;ul>
&lt;li>Find “blobby” regions&lt;/li>
&lt;li>Find connected regions that are somehow distinct from their surroundings&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Requirements&lt;/p>
&lt;ul>
&lt;li>FAST!!!&lt;/li>
&lt;li>High recall&lt;/li>
&lt;li>Can allow a relatively high amount of false positives&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>2 main categories&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Grouping methods&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Generate proposals based on hierarchically grouping meaningful image regions&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Often better localization&lt;/p>
&lt;/li>
&lt;li>
&lt;p>E.g. Selective search&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-16%2022.54.32-20210719164944153.png" alt="截屏2021-07-16 22.54.32">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Window scoring methods&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Generate a large amount of windows&lt;/li>
&lt;li>Use a quickly computed cue to discard unlikely windows (“objectness” measure)&lt;/li>
&lt;li>Often faster&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">For more details and comparison, see: &lt;a href="https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/overview-region-based-detectors/">Overview of Region-based Object Detectors&lt;/a>&lt;/span>
&lt;/div>
&lt;h2 id="r-cnn-1">R-CNN &lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/h2>
&lt;h3 id="idea-and-structure">Idea and structure&lt;/h3>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-19%2016.46.22.png" alt="截屏2021-07-19 16.46.22">&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-19%2016.50.11.png" alt="截屏2021-07-19 16.50.11">&lt;/p>
&lt;h3 id="training">Training&lt;/h3>
&lt;ol>
&lt;li>
&lt;p>Train AlexNet on ImageNet (1000 classes)&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-19%2016.53.30.png" alt="截屏2021-07-19 16.53.30">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Re-initialize last layers to a different dimension (depending on the #classes of the new classifier) and train new model&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-19%2016.54.29.png" alt="截屏2021-07-19 16.54.29">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Train a classifier&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Binary&lt;/strong> SVMs (e.g. is human? yes/no) for each object class $\rightarrow$ $C$ SVMs in our case&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The outputs of pool5 of the retrained AlexNet are used as features&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-19%2016.58.56.png" alt="截屏2021-07-19 16.58.56">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Improve the region proposals&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Use a regression model to improve the estimated locatin of the object&lt;/p>
&lt;ul>
&lt;li>Input: features of proposed region (pool5)&lt;/li>
&lt;li>output: x, y, width, height of the estimated region&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-19%2016.59.53.png" alt="截屏2021-07-19 16.59.53">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;h3 id="downsides">Downsides&lt;/h3>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Speed&lt;/strong>: Need to forward-pass &lt;strong>EACH&lt;/strong> region proposal through entire CNN!!!&lt;/p>
&lt;/li>
&lt;li>
&lt;p>SVM &amp;amp; BBox regressor are trained after CNN is fixed&lt;/p>
&lt;ul>
&lt;li>No simultaneous update/adaptation of CNN features possible&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Complexity: multi-stage approach&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>Improvement:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>For 1: Can we make (part of) the CNN run only once for all proposals?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>For 2&amp;amp;3: Can we make the CNN perform these steps?&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="fast-r-cnn-2">Fast R-CNN &lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>&lt;/h2>
&lt;h3 id="overview">Overview&lt;/h3>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-19%2021.02.20.png" alt="截屏2021-07-19 21.02.20">&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-19%2021.03.23.png" alt="截屏2021-07-19 21.03.23" style="zoom:80%;" />
&lt;h3 id="roi-pooling">ROI pooling&lt;/h3>
&lt;ul>
&lt;li>Conv layers don’t care about input size, FC layers do&lt;/li>
&lt;li>&lt;strong>ROI pooling&lt;/strong>: warp the variable size ROIs into in a predefined fix size shape.&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-19%2021.14.14.png" alt="截屏2021-07-19 21.14.14" style="zoom:67%;" />
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*5V5mycIRNu-mK-rPywL57w-20210719211433781.gif" alt="Image for post" style="zoom:67%;" />
&lt;h3 id="end-to-end-training">End-to-end training&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-19%2021.16.14.png" alt="截屏2021-07-19 21.16.14" style="zoom:70%;" />
&lt;ul>
&lt;li>Instead of SVM &amp;amp; Regressor just add corresponding losses and train the system for both (multitask)&lt;/li>
&lt;li>Gradients can backprop. into feature layers through ROI pooling layers (just as with normal maxpool layers)&lt;/li>
&lt;li>End-to-end brings slight improvement 👏&lt;/li>
&lt;li>Softmax (integrated) loss slightly but consistently outperforms external classifier 👏&lt;/li>
&lt;/ul>
&lt;h3 id="fast-r-cnn-vs-r-cnn">&lt;strong>Fast R-CNN vs R-CNN&lt;/strong>&lt;/h3>
&lt;p>Speed:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-19%2021.21.53.png" alt="截屏2021-07-19 21.21.53" style="zoom:67%;" />
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-19%2021.23.51.png" alt="截屏2021-07-19 21.23.51" style="zoom:67%;" />
&lt;h3 id="downsides-1">Downsides&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Majority of time is lost for region proposals&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Model is also not fully end-to-end: proposals come from “outside”&lt;/p>
&lt;p>(Can we include them in the CNN as well? 🤔)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="faster-r-cnn3">Faster R-CNN&lt;sup id="fnref:3">&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref">3&lt;/a>&lt;/sup>&lt;/h2>
&lt;h3 id="overview-1">Overview&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*JQfhkHK6V8NRuh-97Pg4lQ-20210719212516440.png" alt="Image for post" style="zoom:67%;" />
&lt;h3 id="region-proposal-network-rpn">&lt;strong>Region Proposal Network (RPN)&lt;/strong>&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Input: Feature map from larger conv network of size $C \times W \times H$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Output&lt;/p>
&lt;ul>
&lt;li>List of $p$ proposals&lt;/li>
&lt;li>&amp;ldquo;Objectness&amp;rdquo; score of size $p \times 6$
&lt;ul>
&lt;li>$p \times 4$ coordinates (top-left and bottom-right $(x,y) $ coordinates) for bounding box&lt;/li>
&lt;li>$p \times 2$ for objectness (with vs. without object) per location&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>General approach:&lt;/p>
&lt;ul>
&lt;li>Take a mini net (RPN) and slide it over the feature map (stepsize 1)&lt;/li>
&lt;li>At each position evaluate $k$ different window sizes for objectness&lt;/li>
&lt;li>Results in approx. $W \times H \times k$ windows/proposals&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*PszFnq3rqa_CAhBrI94Eeg-20210719214347181.png" alt="Image for post">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Fully convolutional network&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Anchors&lt;/strong>: tackle the scale problem of the feature map&lt;/p>
&lt;ul>
&lt;li>Initial reference boxes consisting of aspect ratio and scale, centered at sliding window&lt;/li>
&lt;li>3 scales and 3 aspect ratios = 9 anchors&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Layers&lt;/p>
&lt;ul>
&lt;li>reg layer: regression of the reference anchor&lt;/li>
&lt;li>cls layer: object/no object score&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="loss">Loss&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Need a label for each anchor to train the objectness classification&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Labelling anchors&lt;/p>
&lt;ul>
&lt;li>Positive: highest IoU with groundtruth &lt;em>or&lt;/em> IoU &amp;gt; 0.7 (can be more than one)
&lt;ul>
&lt;li>Also store the association between anchor and groundtruth box&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Negative: others, if their IoU &amp;lt; 0.3&lt;/li>
&lt;li>Other anchors do not contribute to training&lt;/li>
&lt;/ul>
&lt;p>$\rightarrow$ Convert to classification problem&lt;/p>
&lt;/li>
&lt;li>
&lt;p>RPN multitask loss:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-23%2023.39.45.png" alt="截屏2021-07-23 23.39.45">&lt;/p>
&lt;ul>
&lt;li>$N\_{cls}$: Batch size (256)&lt;/li>
&lt;li>$N\_{reg}$: number of window positions ($\approx$ 2400)&lt;/li>
&lt;li>$\lambda = 10$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="training-1">Training&lt;/h3>
&lt;h4 id="as-in-paper">As in paper&lt;/h4>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-19%2021.54.06.png" alt="截屏2021-07-19 21.54.06" style="zoom:67%;" />
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-19%2021.54.33.png" alt="截屏2021-07-19 21.54.33" style="zoom:67%;" />
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-19%2021.54.59.png" alt="截屏2021-07-19 21.54.59" style="zoom:67%;" />
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-19%2021.55.21.png" alt="截屏2021-07-19 21.55.21" style="zoom:67%;" />
&lt;h3 id="jointly">Jointly&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Train everything in one go&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Combination of four losses&lt;/p>
&lt;ul>
&lt;li>objectness classification&lt;/li>
&lt;li>anchor regression&lt;/li>
&lt;li>object class classification&lt;/li>
&lt;li>detection regression&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/fast_rcnn_loss.png" alt="fast_rcnn_loss" style="zoom: 67%;" />
&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>Why two regression losses?&lt;/p>
&lt;p>Anchor regression directly impacts the feature used for detection. Detection regression merely improves final localization&lt;/p>
&lt;/blockquote>
&lt;h3 id="comparison-between-all-the-r-cnns">Comparison between all the R-CNNs&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-19%2021.58.56.png" alt="截屏2021-07-19 21.58.56" style="zoom:80%;" />
&lt;h2 id="ssd-detector-4">SSD Detector &lt;sup id="fnref:4">&lt;a href="#fn:4" class="footnote-ref" role="doc-noteref">4&lt;/a>&lt;/sup>&lt;/h2>
&lt;h3 id="motivation">Motivation&lt;/h3>
&lt;p>Thus far, deep multiclass detectors rely on variants of three steps:&lt;/p>
&lt;ul>
&lt;li>generate bounding boxes (proposals)&lt;/li>
&lt;li>resample pixels/features in boxes to uniform size&lt;/li>
&lt;li>apply high quality classifier&lt;/li>
&lt;/ul>
&lt;p>Can we avoid / speed up any of those steps to increase overall speed?&lt;/p>
&lt;h3 id="overview-2">Overview&lt;/h3>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-19%2023.05.14.png" alt="截屏2021-07-19 23.05.14">&lt;/p>
&lt;ul>
&lt;li>💡&lt;strong>Core Idea: Use a set of fixed default boxes at each position in a feature map (similar to anchors)&lt;/strong>&lt;/li>
&lt;li>Classify object class and box regression for each default box&lt;/li>
&lt;li>pply boxes at different layers in the ConvNet
&lt;ul>
&lt;li>Use layers of different sizes&lt;/li>
&lt;li>Avoids the need for rescaling&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="structure">Structure&lt;/h3>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-19%2023.05.56.png" alt="截屏2021-07-19 23.05.56">&lt;/p>
&lt;ul>
&lt;li>Detectors at various stages with varying numbers of default boxes&lt;/li>
&lt;li>Resulting number of detections is fixed&lt;/li>
&lt;li>Reduced by non maximum suppression&lt;/li>
&lt;/ul>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>Girshick, R., Donahue, J., Darrell, T., &amp;amp; Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. &lt;em>Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition&lt;/em>, 580–587. &lt;a href="https://doi.org/10.1109/CVPR.2014.81">https://doi.org/10.1109/CVPR.2014.81&lt;/a> ↩&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>Girshick, R. (2015). Fast R-CNN. &lt;em>Proceedings of the IEEE International Conference on Computer Vision&lt;/em>, &lt;em>2015 International Conference on Computer Vision&lt;/em>, &lt;em>ICCV 2015&lt;/em>, 1440–1448. &lt;a href="https://doi.org/10.1109/ICCV.2015.169">https://doi.org/10.1109/ICCV.2015.169&lt;/a> ↩&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:3">
&lt;p>Ren, S., He, K., Girshick, R., &amp;amp; Sun, J. (2017). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. &lt;em>IEEE Transactions on Pattern Analysis and Machine Intelligence&lt;/em>, &lt;em>39&lt;/em>(6), 1137–1149. &lt;a href="https://doi.org/10.1109/TPAMI.2016.2577031">https://doi.org/10.1109/TPAMI.2016.2577031&lt;/a> ↩&amp;#160;&lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:4">
&lt;p>Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu: “SSD: Single Shot MultiBox Detector”, 2016; &lt;a href="http://arxiv.org/abs/1512.02325">arXiv:1512.02325&lt;/a>.&amp;#160;&lt;a href="#fnref:4" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Tracking</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/13-tracking/</link><pubDate>Mon, 19 Jul 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/13-tracking/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;h3 id="tracking-vs-detection">Tracking Vs. Detection&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Detection&lt;/strong>: Find an object in a &lt;strong>single image&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Face, person, body part, facial landmarks, &amp;hellip;&lt;/li>
&lt;li>No assumption about dynamics, temporal consistency made&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Tracking&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>determine a target&amp;rsquo;s locations (and/or rotation, deformation, pose, &amp;hellip;) &lt;strong>over a sequence of images&lt;/strong>&lt;/p>
&lt;p>i.e.: determine the target&amp;rsquo;s &lt;strong>state&lt;/strong> (location and/or rotation, deformation, pose, &amp;hellip;) &lt;strong>over a sequence&lt;/strong> of &lt;strong>observations&lt;/strong> derived from images&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Provides object positions (etc.) in each frame&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="motivation">Motivation&lt;/h3>
&lt;ul>
&lt;li>Use more than one image to analyse the scene&lt;/li>
&lt;li>Use a-priori knowledge to improve analysis
&lt;ul>
&lt;li>system dynamics, imaging / measurment process,&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="target-types">Target types&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Single objects&lt;/strong>: face, person, &amp;hellip;&lt;/li>
&lt;li>&lt;strong>Multiple objects&lt;/strong>: group of people, head and hands, &amp;hellip;&lt;/li>
&lt;li>&lt;strong>Articulated body&lt;/strong>: full body, hand&lt;/li>
&lt;/ul>
&lt;h3 id="sensor-setup">Sensor setup&lt;/h3>
&lt;ul>
&lt;li>Single camera&lt;/li>
&lt;li>Multiple cameras&lt;/li>
&lt;li>Active cameras&lt;/li>
&lt;li>Cameras + microphones&lt;/li>
&lt;/ul>
&lt;h3 id="observations-used-for-tracking">observations used for tracking&lt;/h3>
&lt;ul>
&lt;li>Templates&lt;/li>
&lt;li>Color&lt;/li>
&lt;li>Foreground-Background segmentation Edges&lt;/li>
&lt;li>Dense Disparity&lt;/li>
&lt;li>Optical flow&lt;/li>
&lt;li>Detectors (body, body parts)&lt;/li>
&lt;/ul>
&lt;h2 id="tracking-as-state-estimation">&lt;strong>Tracking as State Estimation&lt;/strong>&lt;/h2>
&lt;ul>
&lt;li>Want to predict state of the system (position, pose, &amp;hellip;)
&lt;ul>
&lt;li>But state cannot directly be measured&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Only certain observations (measurements) can be made
&lt;ul>
&lt;li>But Observations are noisy! (due to measurement errors)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>What is the most likely state $x$ of the system at a given time, given a sequence of observations $Z\_t$ ?
&lt;/p>
$$
\arg \max p\left(x\_{t} \mid Z\_{t}\right)
$$
&lt;ul>
&lt;li>
&lt;p>$x\_t$: state of the system at time $t$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$z\_t$: Observation / measurement about the certain aspects of the system at&lt;/p>
&lt;p>time $t$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Observations up to time $t$: $z\_{1:t}$ or $Z\_t$&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="bayes-filter">Bayes Filter&lt;/h3>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-19%2023.53.57.png" alt="截屏2021-07-19 23.53.57">&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Assume state $x$ to be Markov process
&lt;/p>
$$
p\left(x\_{t} \mid x\_{t-1}, x\_{t-2}, . ., x\_{0}\right)=p\left(x\_{t} \mid x\_{t-1}\right)
$$
&lt;/li>
&lt;li>
&lt;p>States $x$ generate observations $z$
&lt;/p>
$$
p\left(z\_{t} \mid x\_{t}, x\_{t-1}, . ., x\_{0}\right)=p\left(z\_{t} \mid x\_{t}\right)
$$
&lt;/li>
&lt;li>
&lt;p>Want to estimate most likely state $x\_t$ given sequence $Z\_t$:
&lt;/p>
$$
\arg \max p\left(x\_{t} \mid Z\_{t}\right)
$$
&lt;/li>
&lt;li>
&lt;p>Can be estimated &lt;strong>recursively&lt;/strong>&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-20%2010.01.52.png" alt="截屏2021-07-20 10.01.52">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Need:&lt;/p>
&lt;ul>
&lt;li>Process model: $p(x\_t | x\_{t-1})$&lt;/li>
&lt;li>Measurement model: $p(z\_t | x\_t)$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>Helpful resource:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://www.youtube.com/watch?v=qDvd5lu80bA&amp;amp;ab_channel=Udacity">Bayes Filters&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://zhuanlan.zhihu.com/p/37028239">概率机器人——贝叶斯滤波&lt;/a>&lt;/li>
&lt;/ul>
&lt;/blockquote>
&lt;h3 id="kalman-filter">Kalman filter&lt;/h3>
&lt;ul>
&lt;li>An instance of a Bayes filter&lt;/li>
&lt;li>Assumes
&lt;ul>
&lt;li>&lt;em>Linear&lt;/em> state propagation and measurement model&lt;/li>
&lt;li>&lt;em>Gaussian&lt;/em> process and measurement noise&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>The process to be estimated:
&lt;/p>
$$
\begin{array}{ll}
x\_{k}=A x\_{k-1}+w\_{k-1} &amp; \quad p(w) \sim N(0, Q) \\\\
z\_{k}=H x\_{k}+v\_{k} &amp; \quad p(v) \sim N(0, R)
\end{array}
$$
&lt;ul>
&lt;li>$x\_k$: state at time $k$&lt;/li>
&lt;li>$A$: transition matrix&lt;/li>
&lt;li>$z\_k$: obeservation at time $k$&lt;/li>
&lt;li>$H$: measurement matrix&lt;/li>
&lt;li>$p(w) \sim N(0, Q)$: process noise&lt;/li>
&lt;li>$p(v) \sim N(0, R)$: measurement noise&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-20%2010.16.25.png" alt="截屏2021-07-20 10.16.25">&lt;/p>
&lt;p>Note:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>The simple Kalman Filter is NOT applicable, when the process to be estimated is NOT linear or the measurement relationship to the process is NOT linear.&lt;/p>
&lt;p>$\rightarrow$ The &lt;strong>Extended Kalman Filter (EKF)&lt;/strong> linearizes about the current mean and covariance&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="paticle-filter">Paticle Filter&lt;/h3>
&lt;blockquote>
&lt;p>Helpful resources:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://www.youtube.com/watch?v=_LjBba2hnfk&amp;amp;ab_channel=CyrillStachniss">Particle Filters Basic Idea&lt;/a>&lt;/li>
&lt;/ul>
&lt;/blockquote>
&lt;ul>
&lt;li>The Kalman Filter often fails when the measurement density is &lt;em>multimodal / non-Gaussian.&lt;/em>&lt;/li>
&lt;li>A &lt;strong>Particle Filter&lt;/strong> represents and propagates arbitrary probability distributions. They are represented by a &lt;em>set of weighted samples&lt;/em>.
&lt;ul>
&lt;li>The Particle Filtering is a &lt;em>numerical&lt;/em> technique (unlike the Kalman filter which is analytical).&lt;/li>
&lt;li>Like a Kalman Filter, a Particle Filter incorporates a &lt;em>dynamic model&lt;/em> describing system dynamics&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="bayesian-tracking">Bayesian Tracking&lt;/h4>
&lt;p>Bayes rule applied to tracking
&lt;/p>
$$
\arg \max \_{x\_{t}} p\left(x\_{t} \mid Z\_{t}\right)=\arg \max \_{x\_{t}} p\left(z\_{t} \mid x\_{t}\right) p\left(x\_{t} \mid Z\_{t-1}\right)
$$
$$
p\left(x\_{t} \mid Z\_{t-1}\right)=\int\_{x_{t-1}} p\left(x\_{t} \mid x\_{t-1}\right) p\left(x\_{t-1} \mid Z\_{t-1}\right)
$$
&lt;p>Simplifying assumption (Markov):
&lt;/p>
$$
p\left(x\_{t} \mid X\_{t-1}\right)=p\left(x\_{t} \mid x\_{t-1}\right)
$$
&lt;p>
where&lt;/p>
&lt;ul>
&lt;li>$x\_t$: state at time $t$&lt;/li>
&lt;li>$z\_t$: observation at time $t$&lt;/li>
&lt;li>$X\_t$: history of states up to the time $t$&lt;/li>
&lt;li>$Z\_t$: history of observations up to $t$&lt;/li>
&lt;/ul>
&lt;h4 id="observation-and-motion-model">Observation and Motion Model&lt;/h4>
&lt;ul>
&lt;li>$p(z\_t | x\_t)$: The likelihood that the $z\_t$ is observed, given that the true state of the system is represented by $x\_t$&lt;/li>
&lt;li>$p(x\_{t} | x\_{t-1})$: The likelihood that the state of the system is $x\_t$ when the previous state was $x\_{t-1}$&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Factored Sampling&lt;/strong>&lt;/p>
&lt;p>Probability density function is represented by weighted samples (&amp;ldquo;particles“)&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-20%2016.05.42.png" alt="截屏2021-07-20 16.05.42">&lt;/p>
&lt;h4 id="particle-filter-pf">&lt;strong>Particle Filter (PF)&lt;/strong>&lt;/h4>
&lt;p>For a PF tracker, you need&lt;/p>
&lt;ul>
&lt;li>
&lt;p>a set of $N$ weighted samples (particle) at time $k$
&lt;/p>
$$
\left\\{\left(s\_{k}^{(i)}, \pi\_{k}^{(i)}\right) \mid i=1 \dots N\right\\}
$$
&lt;/li>
&lt;li>
&lt;p>the motion model
&lt;/p>
$$
s\_{k}^{(i)} \leftarrow s\_{k-1}^{(i)}
$$
&lt;/li>
&lt;li>
&lt;p>the observation model
&lt;/p>
$$
\pi\_{k}^{(i)} \leftarrow s\_{k}^{(i)}
$$
&lt;/li>
&lt;/ul>
&lt;h4 id="the-condensation-algorithm">&lt;strong>The Condensation Algorithm&lt;/strong>&lt;/h4>
&lt;p>A popular instance of a particle filter in Computer Vision&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Select&lt;/strong>&lt;/p>
&lt;p>Randomly select $N$ new samples $S\_{k}^{(i)}$ from the old sample set $S\_{k-1}^{(i)}$ according to their weights $\pi\_{k-1}^{(i)}$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Predict&lt;/strong>&lt;/p>
&lt;p>Propagate the samples using the motion model&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Measure&lt;/strong>&lt;/p>
&lt;p>Calculate weights for the new samples using the observation model
&lt;/p>
$$
\pi\_{k}^{(i)}=p\left(z\_{k} \mid x\_{k}=s\_{k}^{(i)}\right)
$$
&lt;/li>
&lt;/ol>
&lt;p>Illustration:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-20%2016.16.46.png" alt="截屏2021-07-20 16.16.46">&lt;/p>
&lt;p>How to get the target position?&lt;/p>
&lt;ul>
&lt;li>Cluster the particle set and search for the highest mode&lt;/li>
&lt;li>Just take the strongest particle&lt;/li>
&lt;/ul>
&lt;p>How many particles are needed?&lt;/p>
&lt;ul>
&lt;li>Depends strongly on the dimension of the state space!&lt;/li>
&lt;li>Tracking 1 object in the image plane typically requires 50-500 particles&lt;/li>
&lt;/ul>
&lt;h4 id="problem">Problem&lt;/h4>
&lt;p>&lt;strong>The Dimensionality Problem&lt;/strong>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-20%2016.18.25.png" alt="截屏2021-07-20 16.18.25" style="zoom:67%;" />
&lt;h2 id="examples">Examples&lt;/h2>
&lt;h3 id="tracking-one-face-with-a-particle-filter">Tracking one Face with a Particle Filter&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-20%2016.28.40.png" alt="截屏2021-07-20 16.28.40" style="zoom:67%;" />
&lt;ul>
&lt;li>
&lt;p>State: ($x$, $y$, scale)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Observations: skin color&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Procedure:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>Select and predict samples&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Measurement step&lt;/p>
&lt;ul>
&lt;li>
&lt;p>For each particle&lt;/p>
&lt;ul>
&lt;li>Count supporting skin pixels in box defined by ($x$, $y$, scale)&lt;/li>
&lt;li>Particle weights determined based on skin color support&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Particle with &lt;em>maximum&lt;/em> weight choosen as best solution&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;/li>
&lt;/ul>
&lt;h3 id="tracking-multiple-objects">Tracking multiple objects&lt;/h3>
&lt;p>Two different approaches:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>A dedicated tracker for each of the objects&lt;/strong>
&lt;ul>
&lt;li>Start with one tracker, once an object is tracked, initialize one more tracker to search for more objects&lt;/li>
&lt;li>&lt;span style="color:green">Typically fast and well parallelizable&lt;/span>&lt;/li>
&lt;li>&lt;span style="color:red">Optimal global assignment / tracking difficult to find, Information has to be shared across trackers to find a good assignment&lt;/span>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>A single tracker in a joint state space&lt;/strong>
&lt;ul>
&lt;li>&lt;span style="color:green">Easier to find optimal assignment&lt;/span>&lt;/li>
&lt;li>Number of objects has to be known in advance&lt;/li>
&lt;li>&lt;span style="color:red">State space becomes high dimensional (curse of dimensionality)&lt;/span>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="face-and-head-pose-tracking">Face and Head Pose Tracking&lt;/h3>
&lt;ul>
&lt;li>Particle filter: Head-pose estimation integrated in the tracker&lt;/li>
&lt;li>Observation model
&lt;ul>
&lt;li>Use bank of face detectors for different poses&lt;/li>
&lt;li>Update particle weights with score of matching detector, i.e. the detector with closest angle to hypothesis&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Dynamical model: Gaussian noise, no explicit velocity model&lt;/li>
&lt;li>Occlusion handling
&lt;ul>
&lt;li>Set particle weight to zero, if it is too close to another track’s center&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul></description></item><item><title>Tracking 2</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/14-tracking2/</link><pubDate>Tue, 20 Jul 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/14-tracking2/</guid><description>&lt;h2 id="multi-camera-systems">Multi-Camera Systems&lt;/h2>
&lt;h3 id="type-of-multi-camera-systems">Type of multi-camera systems&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Stereo-camera system&lt;/strong> (narrow baseline)&lt;/p>
&lt;p>​
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-07-20%2017.13.58-20210720171721568.png" alt="截屏2021-07-20 17.13.58" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;ul>
&lt;li>Close distance and equal orientation&lt;/li>
&lt;li>An object’s appearance is almost the same in both cameras&lt;/li>
&lt;li>Allows for calculation of a dense disparity map&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Wide-baseline multi-camera system&lt;/strong>&lt;/p>
&lt;p>​
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-07-20%2017.15.59.png" alt="截屏2021-07-20 17.15.59" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Arbitrary distance and orientation, overlapping field of view&lt;/p>
&lt;/li>
&lt;li>
&lt;p>An object’s appearance is different in each of the cameras&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-07-20%2017.16.15.png" alt="截屏2021-07-20 17.16.15" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Allows for 3D localization of objects in the joint field of view&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Multi-camera network&lt;/strong>&lt;/p>
&lt;p>​ &lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-20%2017.16.55.png" alt="截屏2021-07-20 17.16.55" style="zoom:67%;" />&lt;/p>
&lt;ul>
&lt;li>Non-overlapping field of view&lt;/li>
&lt;li>An object’s appearance differs strongly from one camera to another&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="3d-to-2d-projection-pinhole-camera-model">3D to 2D projection: Pinhole Camera Model&lt;/h3>
&lt;p>Summary:&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-07-24%2018.49.45.png" alt="截屏2021-07-24 18.49.45" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-07-20%2017.19.45.png" alt="截屏2021-07-20 17.19.45" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
$$
z^{\prime} = -f
$$
$$
\frac{y^{\prime}}{-f}=\frac{y}{z} \Rightarrow y^{\prime}=\frac{-f y}{z}
$$
$$
\frac{x^{\prime}}{-f}=\frac{x}{z} \Rightarrow x^{\prime}=\frac{-f x}{z}
$$
&lt;p>Pixel coordinates $(u, v)$ of the projected points on &lt;strong>image plane&lt;/strong>&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-07-20%2018.24.21.png" alt="截屏2021-07-20 18.24.21" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
$$
\begin{array}{l}
\boldsymbol{u}=\boldsymbol{k}\_{u} \boldsymbol{x}^{\prime}+\boldsymbol{u}\_{\mathrm{0}} \\\\
\boldsymbol{v}=-\boldsymbol{k}\_{v} \boldsymbol{y}^{\prime}+\boldsymbol{v}\_{\mathrm{0}}
\end{array}
$$
&lt;p>
where $k\_u$ and $k\_v$ are &lt;strong>scaling factors&lt;/strong> which denote the ratio between world and pixel coordinates.&lt;/p>
&lt;p>In matrix formulation:
&lt;/p>
$$
\left(\begin{array}{l}
u \\\\
v
\end{array}\right)=\left(\begin{array}{cc}
k\_{u} &amp; 0 \\\\
0 &amp; -k\_{v}
\end{array}\right)\left(\begin{array}{l}
x^{\prime} \\\\
y^{\prime}
\end{array}\right)+\left(\begin{array}{l}
u\_{0} \\\\
v\_{0}
\end{array}\right)
$$
&lt;p>
&lt;strong>Perspective Projection&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>internal camera parameters
&lt;/p>
$$
\begin{array}{l}
\alpha\_{u}=k\_{u} f \\\\
\alpha\_{v}=-k\_{v} f \\\\
u\_{0} \\\\
v\_{0}
\end{array}
$$
&lt;ul>
&lt;li>have to be known to perform the projection&lt;/li>
&lt;li>they depend on the camera only&lt;/li>
&lt;li>Perform calibration to estimate&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="calibration">Calibration&lt;/h4>
&lt;p>&lt;strong>Intrinsics parameters&lt;/strong>: describe the optical properties of each camera (“the camera model”)&lt;/p>
&lt;ul>
&lt;li>$f$: focal length&lt;/li>
&lt;li>$c\_x, c\_y$: the principal point (&amp;ldquo;optical center&amp;rdquo;), sometimes also denoted as $u\_0, v\_0$&lt;/li>
&lt;li>$K\_1, \dots, K\_n$: distortion parameters (radial and tangential)&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Extrinsic parameters&lt;/strong>: describe the location of each camera with respect to a global coordinate system&lt;/p>
&lt;ul>
&lt;li>$\mathbf{T}$: translation vector&lt;/li>
&lt;li>$\mathbf{R}$: $3 \times 3$ rotation matrix&lt;/li>
&lt;/ul>
&lt;p>Transformation of world coordinate of point $p^* = (x, y, z)$ to camera coordinate $p$:
&lt;/p>
$$
p = \mathbf{R} (x, y, z)^T + \mathbf{T}
$$
&lt;p>
Calibration steps&lt;/p>
&lt;ol>
&lt;li>For each camera: A calibration target with a known geometry is captured from multiple views&lt;/li>
&lt;li>The corner points are extracted (semi-)automatically&lt;/li>
&lt;li>The locations of the corner points are used to estimate the intrinsics iteratively&lt;/li>
&lt;li>Once the intrinsics are known, a fixed calibration target is captured from all of the camerasextrinsics&lt;/li>
&lt;/ol>
&lt;h3 id="triangulation">Triangulation&lt;/h3>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-07-20%2019.14.22.png" alt="截屏2021-07-20 19.14.22" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;ul>
&lt;li>Assumption: the object location is known in multiple views&lt;/li>
&lt;li>Ideally: The intersection of the lines-of-view determines the 3D location&lt;/li>
&lt;li>Practically: least-squares approximation&lt;/li>
&lt;/ul></description></item><item><title>Body Pose</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/15-pose/</link><pubDate>Wed, 07 Jul 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/15-pose/</guid><description>&lt;h2 id="kinect">Kinect&lt;/h2>
&lt;h3 id="what-is-kinect">What is Kinect?&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-24%2021.16.14.png" alt="截屏2021-07-24 21.16.14" style="zoom:80%;" />
&lt;ul>
&lt;li>Fusion of two groundbreaking new technologies
&lt;ul>
&lt;li>A cheap and fast &lt;strong>RGB-D sensor&lt;/strong>&lt;/li>
&lt;li>A reliable Skeleton Tracking&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="structured-light">Structured light&lt;/h3>
&lt;ul>
&lt;li>Kinect uses Structured Light to simulate a stereo camera system&lt;/li>
&lt;li>Kinect provides an unique texture for every point of the image, therefore only Block-Matching is required&lt;/li>
&lt;/ul>
&lt;h2 id="pose-recognition-for-user-interaction">Pose Recognition for User Interaction&lt;/h2>
&lt;p>Few constrains:&lt;/p>
&lt;ul>
&lt;li>Extremely low latency.&lt;/li>
&lt;li>Low computational power.&lt;/li>
&lt;li>High recognition rate, without false positives.&lt;/li>
&lt;li>Any personalized training step.&lt;/li>
&lt;li>Few people at once.&lt;/li>
&lt;li>Complex poses will be usual.&lt;/li>
&lt;/ul>
&lt;h2 id="pose-recognition1">Pose Recognition&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/h2>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-07-07%2023.38.56.png" alt="截屏2021-07-07 23.38.56" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="1st-step-pixel-classification">1st step: Pixel classification&lt;/h3>
&lt;p>&lt;strong>Speed&lt;/strong> is the key&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Uses only one disparity image.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>It classifies each pixel independently.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The process of feature extraction is &lt;em>simultaneous&lt;/em> to the classification.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Simplest possible feature: difference between two pixels.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Classification is done through &lt;strong>Random Decision Forests&lt;/strong>.&lt;/p>
&lt;ul>
&lt;li>Learning
&lt;ul>
&lt;li>Randomly choose a set of thresholds and features for splits.&lt;/li>
&lt;li>Pick the threshold and feature that provide the &lt;em>largest&lt;/em> information gain.&lt;/li>
&lt;li>Recurse until a certain accuracy is reached or depth is obtained.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-07%2023.41.56.png" alt="截屏2021-07-07 23.41.56" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>Everything has an optimal GPU implementation.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="training">Training&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Key: using a huge amount of training data&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Synthetic Training DB&lt;/strong>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-07%2023.45.46.png" alt="截屏2021-07-07 23.45.46" style="zoom: 67%;" />
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-07%2023.46.11.png" alt="截屏2021-07-07 23.46.11" style="zoom: 67%;" />
&lt;/li>
&lt;li>
&lt;p>Pixel classification results&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-07%2023.47.02.png" alt="截屏2021-07-07 23.47.02" style="zoom:67%;" />
&lt;/li>
&lt;/ul>
&lt;h3 id="joint-estimation">Joint estimation&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Use mean shift clustering on the pixels with &lt;strong>Gaussian kernel&lt;/strong> to infer the center of clusters.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Clustering is done in 3d space but every pixel is &lt;em>weighted&lt;/em> by their world surface area to get depth invariance.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Finally the &lt;em>sum&lt;/em> of the weighted pixels is used as a &lt;strong>confidence measure&lt;/strong>.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Results&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-07%2023.49.21.png" alt="截屏2021-07-07 23.49.21" style="zoom:67%;" />
&lt;/li>
&lt;/ul>
&lt;h3 id="criticism-">Criticism 👎&lt;/h3>
&lt;ul>
&lt;li>Not open source&lt;/li>
&lt;li>Biased towards upper body frontal poses&lt;/li>
&lt;li>Very difficult to improve or adapt.&lt;/li>
&lt;/ul>
&lt;h2 id="pose-estimation-without-kinect-convolutional-pose-machines-2">Pose Estimation without Kinect: Convolutional Pose Machines &lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>&lt;/h2>
&lt;h3 id="pose-machine">Pose Machine&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Unconstrained 2D-pose estimation on real world RGB images.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Outputs confidence maps for every joint of the skeleton.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Works in multiple stages refining the confidence maps.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>💡 &lt;strong>Idea:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Local image evidence is weak (first stage confidence maps)&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Part context can be a strong cue (confidence maps of other body joints)&lt;/strong>&lt;/p>
&lt;p>&lt;strong>➔ Use confidence maps of all body joints of the previous stage to refine current results&lt;/strong>&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-07-08%2000.02.49.png" alt="截屏2021-07-08 00.02.49" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="example">Example&lt;/h4>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-11%2012.38.10.png" alt="截屏2021-07-11 12.38.10" style="zoom:67%;" />
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-11%2012.38.27.png" alt="截屏2021-07-11 12.38.27" style="zoom:67%;" />
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-11%2012.38.40.png" alt="截屏2021-07-11 12.38.40" style="zoom:67%;" />
&lt;h4 id="details">Details&lt;/h4>
&lt;p>We denote the pixel location of the $p$-th anatomical landmark (refer to as a &lt;strong>part&lt;/strong>) $Y\_{p} \in \mathcal{Z} \subset \mathbb{R}^{2}$&lt;/p>
&lt;ul>
&lt;li>$\mathcal{Z}$: set of all $(u, v)$ locations in an image&lt;/li>
&lt;/ul>
&lt;p>🎯 &lt;strong>Goal: to predict the image locations $Y = (Y\_1, \dots, Y\_P)$ for all $P$ parts.&lt;/strong>&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-07-11%2015.22.48.png" alt="截屏2021-07-11 15.22.48" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>In each stage $t \in \{1 \dots T\}$, the classifiers $g\_t$ predict beliefs for assigning a location to each part $Y\_{p}=z, \forall z \in \mathcal{Z}$, based on&lt;/p>
&lt;ul>
&lt;li>features extracted from the image at the location $z$ denoted by $\mathbf{x}\_{z} \in \mathbb{R}^{d}$ and&lt;/li>
&lt;li>contextual information from the preceding classifier in the neighborhood around each $Y\_p$ in stage $t$.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>First stage&lt;/strong>&lt;/p>
&lt;p>A classifier in the first stage $t = 1$ produces the following belief values:
&lt;/p>
$$
g\_1(\mathbf{x}\_z) \rightarrow \\{b\_1^p(Y\_p=z)\\}\_{p \in \\{0 \dots P\\}}
$$
&lt;ul>
&lt;li>$b\_{1}^{p}\left(Y\_{p}=z\right)$: score predicted by the classifier $g\_1$ for assigning the $p$-th part in the first stage at image location $z$.&lt;/li>
&lt;/ul>
&lt;p>We represent all the beliefs of part $p$ evaluated at every location $z = (u, v)^T$ in the image as $\mathbf{b}\_{t}^{p} \in \mathbb{R}^{w \times h}$:
&lt;/p>
$$
\mathbf{b}\_{t}^{p}[u, v]=b\_{t}^{p}\left(Y\_{p}=z\right)
$$
&lt;ul>
&lt;li>$w, h$: width and height of the image, respectively&lt;/li>
&lt;/ul>
&lt;p>For convenience, we denote the collection of belief maps for all the parts as $\mathbf{b}\_{t} \in \mathbb{R}^{w \times h \times(P+1)}$ ($+1$ for background)&lt;/p>
&lt;p>&lt;strong>Subsequent stages&lt;/strong>&lt;/p>
&lt;p>The classifier predicts a belief for assigning a location to each part $Y\_{p}=z, \forall z \in \mathcal{Z}$, based on&lt;/p>
&lt;ul>
&lt;li>features of the image data $\mathbf{x}\_{z}^{t} \in \mathbb{R}^{d}$ and&lt;/li>
&lt;li>contextual information from the preceeding classifier in the neighborhood around each $Y\_p$&lt;/li>
&lt;/ul>
$$
g\_{t}\left(\mathbf{x}\_{z}^{\prime}, \psi\_{t}\left(z, \mathbf{b}\_{t-1}\right)\right) \rightarrow \left \\{b\_{t}^{p}\left(Y\_{p}=z\right)\right\\}\_{p \in \\{0 \ldots P+1\\}}
$$
&lt;ul>
&lt;li>$\psi\_{t>1}(\cdot)$: mapping from the beliefs $b\_{t−1}$ to context features.&lt;/li>
&lt;/ul>
&lt;p>In each stage, the computed beliefs provide an increasingly refined estimate for the location of each part.&lt;/p>
&lt;h3 id="confidence-maps-generation">Confidence maps generation&lt;/h3>
&lt;p>&lt;strong>Fully Convolutional Network (FCN)&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Does not have Fully Connected Layers.&lt;/li>
&lt;li>The same network can be applied to arbitrary image sizes.&lt;/li>
&lt;li>Similar to a sliding window approach, but more efficient&lt;/li>
&lt;/ul>
&lt;h3 id="cpm">CPM&lt;/h3>
&lt;p>The prediction and image feature computation modules of a pose machine can be replaced by a deep convolutional architecture allowing for both image and contextual feature representations to be learned directly from data.&lt;/p>
&lt;p>Advantage of convolutional architectures: completely differentiale $\rightarrow$ enabling end-to-end joint trainining of all stages 👍&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-08%2012.34.37.png"
alt="CPM structure">&lt;figcaption>
&lt;p>CPM structure&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;h3 id="learning-in-cpm">Learning in CPM&lt;/h3>
&lt;p>Potential problem of a network with a large number of layers: &lt;strong>vanishing gradient&lt;/strong>&lt;/p>
&lt;p>Solution&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Define a loss function at the output of each stage $t$ that minimizes the $l_2$ distance between the predicted ($b\_{t}^{p}$) and ideal ($b\_{*}^{p}\left(Y\_{p}=z\right)$) belief maps for each part.&lt;/p>
&lt;ul>
&lt;li>The ideal belief map for a part $p$, $b\_{*}^{p}\left(Y\_{p}=z\right)$, are created by putting Gaussian peaks at ground truth locations of each body part $p$.&lt;/li>
&lt;/ul>
&lt;p>Cost function we aim to minimize at the output of each stage at each level:
&lt;/p>
$$
f\_{t}=\sum\_{p=1}^{P+1} \sum\_{z \in \mathcal{Z}}\left\|b\_{t}^{p}(z)-b\_{*}^{p}(z)\right\|\_{2}^{2} .
$$
&lt;ul>
&lt;li>$P$: all body parts&lt;/li>
&lt;li>$\mathcal{Z}$: set of all image locations in a believe map&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>The overall objective for the full architecture is obtained by adding the losses at each stage:
&lt;/p>
$$
\mathcal{F}=\sum\_{t=1}^{T} f\_{t}
$$
&lt;/li>
&lt;/ul>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>J. Shotton et al., &amp;ldquo;Real-time human pose recognition in parts from single depth images,&amp;rdquo; CVPR 2011, 2011, pp. 1297-1304, doi: 10.1109/CVPR.2011.5995316.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4732, 2016&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Gesture Recognition</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/16-gesture_recognition/</link><pubDate>Tue, 20 Jul 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/16-gesture_recognition/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;h3 id="gesture">Gesture&lt;/h3>
&lt;ul>
&lt;li>a movement usually of the body or limbs that expresses or emphasizes an idea, sentiment, or attitude&lt;/li>
&lt;li>the use of motions of the limbs or body as a means of expression&lt;/li>
&lt;/ul>
&lt;h3 id="automatic-gesture-recognition">Automatic Gesture Recognition&lt;/h3>
&lt;ul>
&lt;li>A gesture recognition system generates a &lt;em>semantic description&lt;/em> for certain body motions&lt;/li>
&lt;li>Gesture recognition exploits the power of &lt;em>non-verbal communication,&lt;/em> which is very common in human-human interaction&lt;/li>
&lt;li>Gesture recognition is often built on top of a &lt;em>human motion tracker&lt;/em>&lt;/li>
&lt;/ul>
&lt;h3 id="applications">Applications&lt;/h3>
&lt;ul>
&lt;li>Multimodal Interaction
&lt;ul>
&lt;li>Gestures + Speech recognition&lt;/li>
&lt;li>Gestures + gaze&lt;/li>
&lt;li>Human-Robot Interaction&lt;/li>
&lt;li>Interaction with Smart Environments&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Understanding Human Interaction&lt;/li>
&lt;/ul>
&lt;h3 id="types-of-gestures">Types of Gestures&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Hand &amp;amp; arm gestures&lt;/p>
&lt;ul>
&lt;li>Pointing Gestures&lt;/li>
&lt;li>Sign Language&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Head gestures&lt;/p>
&lt;ul>
&lt;li>Nodding, head shaking, turning, pointing&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Body gestures&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-07-20%2022.36.21.png" alt="截屏2021-07-20 22.36.21" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="automatic-gesture-recognition-1">Automatic Gesture Recognition&lt;/h2>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-07-20%2022.37.29.png" alt="截屏2021-07-20 22.37.29" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;ul>
&lt;li>Feature Acquisition
&lt;ul>
&lt;li>Appearances: Markers, color, motion, shape, segementation, stereo, local descriptors, space-time interest points, &amp;hellip;&lt;/li>
&lt;li>Model based: body- or hand-models&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Classifiers
&lt;ul>
&lt;li>SVM, ANN, HMMs, Adaboost, Dec. Trees, Deep Learning &amp;hellip;&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="hidden-markov-models-hmms-for-gesture-recognition">Hidden Markov Models (HMMs) for Gesture Recognition&lt;/h3>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-07-20%2022.40.46.png" alt="截屏2021-07-20 22.40.46" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&amp;ldquo;&lt;strong>hidden&lt;/strong>&amp;rdquo;: comes from observing observations and drawing conclusions WITHOUT knowing the &lt;em>hidden&lt;/em> sequence of states&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Markov assumption&lt;/strong> (1st order): the next state depends ONLY on the current state (not on the complete state history)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>A Hidden Markov Model is a five-tuple
&lt;/p>
$$
(S, \pi, \mathbf{A}, B, V)
$$
&lt;ul>
&lt;li>$S = \\{s\_1, s\_2, \dots, s\_n\\}$: set of &lt;strong>states&lt;/strong>&lt;/li>
&lt;li>$\pi$: the &lt;strong>initial probability&lt;/strong> distribution
&lt;ul>
&lt;li>$\pi(s\_i)$ = probability of $s\_i$ being the first state of a state sequence&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>$\mathbf{A} = (a\_{ij})$: the matrix of &lt;strong>state transition probabilities&lt;/strong>
&lt;ul>
&lt;li>$(a\_{ij})$: probability of state $s\_j$ following $s\_i$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>$B = \\{b\_1, b\_2, \dots, b\_n\\}$: the set of &lt;strong>emission probability&lt;/strong> distributions/densities
&lt;ul>
&lt;li>$b\_i(x)$: probability of observing $x$ when the system is in state $s\_i$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>$V$: the observable &lt;strong>feature space&lt;/strong>
&lt;ul>
&lt;li>Can be discrete ($V = \\{x\_1, x\_2, \dots, x\_v\\}$) or continuous ($V = \mathbb{R}^d$)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="properties-of-hmms">&lt;strong>Properties of HMMs&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>For the initial probabilities:
&lt;/p>
$$
\sum\_i \pi(s\_i) = 1
$$
&lt;ul>
&lt;li>Often simplified by
$$
\pi(s\_1) = 1, \quad \pi(s\_i > 1) = 0
$$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>For state transition probabilities:
&lt;/p>
$$
\forall i: \sum\_j a\_{ij} = 1
$$
&lt;ul>
&lt;li>Often: $a\_{ij} = 0$ for most $j$ except for a few states&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>When $V = \\{x\_1, x\_2, \dots, x\_v\\}$ then $b\_i$ are discrete probability distributions, the HMMs are called &lt;strong>discrete HMMs&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>When $V = \mathbb{R}^d$ then $b\_i$ are continuous probability density functions, the HMMs are called &lt;strong>continuous (density) HMMs&lt;/strong>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="hmm-topologies">&lt;strong>HMM Topologies&lt;/strong>&lt;/h4>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-07-20%2023.06.32.png" alt="截屏2021-07-20 23.06.32" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h4 id="the-observation-model">&lt;strong>The Observation Model&lt;/strong>&lt;/h4>
&lt;p>Most popular: &lt;strong>Gaussian mixture models&lt;/strong>
&lt;/p>
$$
P\left(x\_{t} \mid s\_{j}\right)=\sum\_{k=1}^{n\_{j}} c\_{j k} \cdot \frac{1}{\sqrt{(2 \pi)^{n}\left|\Sigma\_{j k}\right|}} e^{-\frac{1}{2}\left(x\_{t}-\mu\_{j k}\right)^{\mathrm{T}} \Sigma\_{j k}^{-1}\left(x\_{t}-\mu\_{j k}\right)}
$$
&lt;ul>
&lt;li>$n\_j$: number of Gaussians (in state $j$)&lt;/li>
&lt;li>$c\_{jk}$: mixture weight for $k$-th Gaussian (in state $j$)&lt;/li>
&lt;li>$\mu\_{jk}$: means of $k$-th Gaussian (in state $j$)&lt;/li>
&lt;li>$\Sigma\_{jk}$: covariane matrix of $k$-th Gaussian (in state $j$)&lt;/li>
&lt;/ul>
&lt;h4 id="three-main-tasks-with-hmms">&lt;strong>Three Main Tasks with HMMs&lt;/strong>&lt;/h4>
&lt;p>Given an HMM $\lambda$ and an observation $x\_1, x\_2, \dots, x\_T$&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>The evaluation problem&lt;/strong>&lt;/p>
&lt;p>compute the probability of the observation $p(x\_1, x\_2, \dots, x\_T | \lambda)$&lt;/p>
&lt;p>$\rightarrow$ &amp;ldquo;Forward Algorithm&amp;rdquo;&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>The decoding problem&lt;/strong>&lt;/p>
&lt;p>compute the most likely state sequence $s\_{q1}, s\_{q2}, \dots, s\_{qT}$, i.e.
&lt;/p>
$$
\operatorname{argmax}\_{q 1, \ldots, q \tau} p\left(q\_{1}, . ., q\_{T} \mid x\_{1}, x\_{2}, \ldots, x\_{T}, \lambda\right)
$$
&lt;p>
$\rightarrow$ &amp;ldquo;Viterbi-Algorithm&amp;rdquo;&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>The learning/optimization problem&lt;/strong>&lt;/p>
&lt;p>Find an HMM $\lambda^\prime$ s.t. $p\left(x\_{1}, x\_{2}, \ldots, x\_{T} \mid \lambda^{\prime}\right)>p\left(x\_{1}, x\_{2}, \ldots, x\_{T} \mid \lambda\right)$&lt;/p>
&lt;p>$\rightarrow$ &amp;ldquo;Baum-Welch-Algo&amp;rdquo;, &amp;ldquo;Viterbi-Learning&amp;rdquo;&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="sign-language-recognition">Sign Language Recognition&lt;/h3>
&lt;ul>
&lt;li>American Sign Language (ASL)
&lt;ul>
&lt;li>6000 gesture describe persons, places and things&lt;/li>
&lt;li>Exact meaning and strong rules of context and grammar for each&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Sign recognition
&lt;ul>
&lt;li>HMM ideal for complex and structured hand gestures of ASL&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="feature-extraction">Feature extraction&lt;/h4>
&lt;ul>
&lt;li>Camera either located as a 1st-person and a 2nd-person view&lt;/li>
&lt;li>Segment hand blobs by a skin color model&lt;/li>
&lt;/ul>
&lt;h4 id="hmm-for-american-sign-language">&lt;strong>HMM for American Sign Language&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Four-State HMM for each word&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-07-20%2023.39.28.png" alt="截屏2021-07-20 23.39.28" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Training&lt;/p>
&lt;ul>
&lt;li>Automatic segmentation of sentences in five portions&lt;/li>
&lt;li>Initial estimates by iterative Viterbi-alignment&lt;/li>
&lt;li>Then Baum-Welch re-estimation&lt;/li>
&lt;li>No context used&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Recognition&lt;/p>
&lt;ul>
&lt;li>With and without part-of-speech grammar&lt;/li>
&lt;li>All features / only relative features used&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="asl-results">ASL Results&lt;/h4>
&lt;p>&lt;strong>Desk-based&lt;/strong>&lt;/p>
&lt;p>348 training and 94 testing sentences without contexts&lt;/p>
&lt;p>Accuracy:
&lt;/p>
$$
Acc = \frac{N-D-S-I}{N}
$$
&lt;ul>
&lt;li>$N$: #Words&lt;/li>
&lt;li>$D$: #Deletions&lt;/li>
&lt;li>$S$: #Substituitions&lt;/li>
&lt;li>$I$: #Insertions&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-07-20%2023.42.19.png" alt="截屏2021-07-20 23.42.19" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>&lt;strong>Wearable-based&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>400 training sentences and 100 for testing&lt;/li>
&lt;li>Test 5-word sentences&lt;/li>
&lt;li>Restricted and unrestricted similar!&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2021-07-20%2023.43.41.png" alt="截屏2021-07-20 23.43.41" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="pointing-gesture-recognition">Pointing Gesture Recognition&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Pointing gestures&lt;/p>
&lt;ul>
&lt;li>
&lt;p>are used to specify objects and locations&lt;/p>
&lt;/li>
&lt;li>
&lt;p>can be needful to resolve ambiguities in verbal statements&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Definition: Pointing gesture = movement of the arm towards a pointing target&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Tasks&lt;/p>
&lt;ul>
&lt;li>Detect occurrence of human pointing gestures in natural arm movements&lt;/li>
&lt;li>Extract the 3D pointing direction&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="interaction-in-a-smart-room">Interaction in a Smart Room&lt;/h3></description></item><item><title>Action &amp; Activity Recognition</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/17-action-activity-recognition/</link><pubDate>Tue, 20 Jul 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/17-action-activity-recognition/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;h3 id="motivation">Motivation&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Gain a higher level understanding of the scene, e.g.&lt;/p>
&lt;ul>
&lt;li>What are these persons doing (walking, sitting, working, hiding)?&lt;/li>
&lt;li>How are they doing it?&lt;/li>
&lt;li>What is going on in the scene (meeting, party, telephone conversation, etc&amp;hellip;)?&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Applications&lt;/p>
&lt;ul>
&lt;li>video indexing/analysis,&lt;/li>
&lt;li>smart-rooms,&lt;/li>
&lt;li>patient monitoring,&lt;/li>
&lt;li>surveillance,&lt;/li>
&lt;li>robots etc.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="actions-activities">Actions, Activities&lt;/h3>
&lt;h4 id="event">Event&lt;/h4>
&lt;ul>
&lt;li>“a thing that happens or takes place”&lt;/li>
&lt;li>Examples
&lt;ul>
&lt;li>Gestures&lt;/li>
&lt;li>Actions (running, drinking, standing up, etc.)&lt;/li>
&lt;li>Activities (preparing a meal, playing a game, etc.)&lt;/li>
&lt;li>Nature event (fire, storm, earthquake, etc.)&lt;/li>
&lt;li>&amp;hellip;&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="human-actions">Human actions&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Def 1: &lt;strong>Physical body motion&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>E.g.: Walking, boxing, clapping, bending, &amp;hellip;&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Def 2: &lt;strong>Interaction with environment on specific purpose&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>E.g.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-21%2009.47.42.png" alt="截屏2021-07-21 09.47.42">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="activities">Activities&lt;/h4>
&lt;ul>
&lt;li>Complex sequence of action,&lt;/li>
&lt;li>Possibly performed by multiple humans,&lt;/li>
&lt;li>Typically longer temporal duration&lt;/li>
&lt;li>Examples
&lt;ul>
&lt;li>Preparing a meal&lt;/li>
&lt;li>Having a meeting&lt;/li>
&lt;li>Shaking hands&lt;/li>
&lt;li>Football team scoring a goal&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="actions--activity-hierarchy">Actions / Activity Hierarchy&lt;/h4>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-21%2009.50.08.png" alt="截屏2021-07-21 09.50.08" style="zoom:67%;" />
&lt;p>Example: Small groups (meetings)&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Individual actions&lt;/strong>: Speaking, writing, listening, walking, standing up, sitting down, “fidgeting”,&amp;hellip;&lt;/li>
&lt;li>&lt;strong>Group activities&lt;/strong>: Meeting start, end, discussion, presentation, monologue, dialogue, white board, note-taking&lt;/li>
&lt;li>Often audio-visual cues&lt;/li>
&lt;/ul>
&lt;h3 id="approaches">Approaches&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Time series classification&lt;/strong> problem similar to speech/gesture recognition&lt;/p>
&lt;ul>
&lt;li>Typical classifiers:
&lt;ul>
&lt;li>HMMs and variants (e.g. Coupled HMMs, Layered HMMs) Dynamic&lt;/li>
&lt;li>Bayesian Networks (DBN)&lt;/li>
&lt;li>Recurrent neural networks&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Classification&lt;/strong> problem similar to object recognition/detection&lt;/p>
&lt;ul>
&lt;li>Typical classifiers:
&lt;ul>
&lt;li>Template matching&lt;/li>
&lt;li>Boosting&lt;/li>
&lt;li>Bag-of-Words SVMs&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Deep Learning approaches:
&lt;ul>
&lt;li>2D CNN (e.g. Two-Stream CNN, Temporal Segment Network)&lt;/li>
&lt;li>3D CNN (e.g. C3D, I3D)&lt;/li>
&lt;li>LSTM on top of 2D/3D CNN&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="recognition-with-local-feature-descriptors">Recognition with local feature descriptors&lt;/h2>
&lt;ul>
&lt;li>Try to model both Space and Time
&lt;ul>
&lt;li>Combine spatial and motion descriptors to model an action&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Action == Space-time objects
&lt;ul>
&lt;li>Transfer object detectors to action recognition&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="space-time-features--boosting">Space-Time Features + Boosting&lt;/h3>
&lt;h4 id="-idea">💡 Idea&lt;/h4>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-21%2010.57.41.png" alt="截屏2021-07-21 10.57.41">&lt;/p>
&lt;ul>
&lt;li>Extract many features describing the relevant content of an image sequence
&lt;ul>
&lt;li>&lt;strong>Histogram of oriented gradients (HOG)&lt;/strong> to describe appearance&lt;/li>
&lt;li>&lt;strong>Histogram of oriented flow (HOF)&lt;/strong> to describe motion in video&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Use &lt;strong>Boosting&lt;/strong> to select and combine good features for classification&lt;/li>
&lt;/ul>
&lt;h4 id="action-features">Action features&lt;/h4>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-21%2011.03.34.png" alt="截屏2021-07-21 11.03.34">&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Action volume = space-time cuboid region around the head (duration of action)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Encoded with block-histogram features $f\_{\theta}(\cdot)$
&lt;/p>
$$
\theta=(x, y, t, d x, d y, d t, \beta, \varphi)
$$
&lt;ul>
&lt;li>Location: $(x, y, t)$&lt;/li>
&lt;li>Space-time extent: $(d x, d y, d t)$&lt;/li>
&lt;li>Type of block: $\beta \in \\{\text{Plain, Temp-2, Spat-4}\\}$&lt;/li>
&lt;li>Type of histogram: $\varphi$
&lt;ul>
&lt;li>Histogram of optical flow (HOF)&lt;/li>
&lt;li>Histogram of oriented gradient (HOG)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Example&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-21%2011.07.18.png" alt="截屏2021-07-21 11.07.18" style="zoom:80%;" />
&lt;/li>
&lt;/ul>
&lt;h5 id="histogram-features">Histogram features&lt;/h5>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-21%2011.12.17.png" alt="截屏2021-07-21 11.12.17" style="zoom:80%;" />
&lt;ul>
&lt;li>(simplified) Histogram of oriented gradient (HOG)
&lt;ul>
&lt;li>Apply gradient operator to each frame within sequence (eg. Sobel)&lt;/li>
&lt;li>Bin gradients discretized in 4 orientations to block-histogram&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Histogram of optical flow (HOF)
&lt;ul>
&lt;li>Calculate optical flow (OF) between frames&lt;/li>
&lt;li>Bin OF vectors discretized in 4 direction bins (+1 bin for no motion) to block-histogram&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Normalized action cuboid has size 14x14x8 with units corresponding to 5x5x5 pixels&lt;/li>
&lt;/ul>
&lt;h4 id="action-learning">Action Learning&lt;/h4>
&lt;ul>
&lt;li>Use &lt;strong>boosting&lt;/strong> method (eg. AdaBoost) to classify features within an action volume&lt;/li>
&lt;li>Features: Block-histogram features&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-21%2010.57.41-20210721152454899.png" alt="截屏2021-07-21 10.57.41">&lt;/p>
&lt;h5 id="boosting">Boosting&lt;/h5>
&lt;ul>
&lt;li>
&lt;p>A &lt;strong>weak classifier&lt;/strong> &lt;em>h&lt;/em> is a classifier with accuracy only slightly better than chance&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Boosting combines a number of weak classifiers so that the ensemble is arbitrarily accurate&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-21%2011.22.36.png" alt="截屏2021-07-21 11.22.36">&lt;/p>
&lt;ul>
&lt;li>Allows the use of simple (weak) classifiers without the loss if accuracy&lt;/li>
&lt;li>Selects features and trains the classifier&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="space-time-interest-points-stip--bag-of-words-bow">Space-Time Interest Points (STIP) + Bag-of-Words (BoW)&lt;/h3>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-21%2011.25.50.png" alt="截屏2021-07-21 11.25.50">&lt;/p>
&lt;p>Inspired by &lt;strong>Bag-of-Words (BoW)&lt;/strong> model for object classification&lt;/p>
&lt;h4 id="bag-of-words-bow-model">&lt;strong>Bag-of-Words (BoW) model&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>&amp;ldquo;Visual Word“ vocabulary learning&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Cluster local features&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Visual Words = Cluster Means&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>BoW feature calculation&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Assign each local feature most similar visual word&lt;/p>
&lt;/li>
&lt;li>
&lt;p>BoW feature = Histogram of visual word occurances within a region&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Histogram can be used to classify objects (wth. SVM)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>Bag of Visual Words (&lt;a href="http://vision.stanford.edu/teaching/cs231a_autumn1112/lecture/lecture15_bow_part-based_cs231a_marked.pdf">Stanford CS231 slides&lt;/a>)&lt;/p>
&lt;ol>
&lt;li>
&lt;p>Feature detection and representation&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-26%2000.30.18.png" alt="截屏2021-07-26 00.30.18" style="zoom:67%;" />
&lt;/li>
&lt;li>
&lt;p>Codewords dictionary formation&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-26%2000.30.38.png" alt="截屏2021-07-26 00.30.38" style="zoom:67%;" />
&lt;/li>
&lt;li>
&lt;p>Bag of word representation&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-26%2000.31.41.png" alt="截屏2021-07-26 00.31.41" style="zoom:67%;" />
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-26%2000.31.55.png" alt="截屏2021-07-26 00.31.55" style="zoom:67%;" />
&lt;/li>
&lt;/ol>
&lt;/blockquote>
&lt;h4 id="space-time-features-detector">&lt;strong>Space-Time Features: Detector&lt;/strong>&lt;/h4>
&lt;p>&lt;strong>Space-Time Interest Points (STIP)&lt;/strong>&lt;/p>
&lt;p>Space-Time Extension of Harris Operator&lt;/p>
&lt;ul>
&lt;li>Space-Time Extension of Harris Operator
&lt;ul>
&lt;li>Add dimensionality of time to the second moment matrix&lt;/li>
&lt;li>Look for maxima in extended Harris corner function H&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Detection depends on spatio-temporal scale&lt;/li>
&lt;li>Extract features at multiple levels of spatio-temporal scales (dense scale sampling)&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-21%2012.45.51.png" alt="截屏2021-07-21 12.45.51">&lt;/p>
&lt;h4 id="space-time-features-descriptor">&lt;strong>Space-Time Features: Descriptor&lt;/strong>&lt;/h4>
&lt;p>Compute histogram descriptors of space-time volumes in neighborhood of detected points&lt;/p>
&lt;ul>
&lt;li>Compute a 4-bin HOG for each cube in 3x3x2 space-time grid&lt;/li>
&lt;li>Compute a 5-bin HOF for each cube in 3x3x2 space-time grid&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-21%2012.47.19.png" alt="截屏2021-07-21 12.47.19">&lt;/p>
&lt;h4 id="action-classification">&lt;strong>Action classification&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Spatio-temporal Bag-of-Words (BoW)&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Build Visual vocabulary of local feature representations using k-means clustering&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Assign each feature in a video to nearest vocabulary word&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Compute histogram of visual word occurrences over space time volume of a video squence&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>SVM classification&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Combine different feature types using multichannel $\chi^{2}$ Kernel&lt;/li>
&lt;li>One-against-all approach in case of multi-class classification&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="dense-trajectories-3">Dense Trajectories &lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/h3>
&lt;ul>
&lt;li>Dense sampling improves results over sparse interest points for image classification&lt;/li>
&lt;li>The 2D space domain and 1D time domain in videos have very different characteristics $\rightarrow$ use them both&lt;/li>
&lt;/ul>
&lt;h4 id="feature-trajectories">&lt;strong>Feature trajectories&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>Efficient for representing videos
&lt;ul>
&lt;li>Extracted using KLT tracker or matching SIFT descriptors between frames&lt;/li>
&lt;li>However, the quantity and quality is generally not enough 🤪&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>State-of-the-art: The state of the art now describe videos by &lt;strong>dense&lt;/strong> trajectories&lt;/li>
&lt;/ul>
&lt;h4 id="dense-trajectories">&lt;strong>Dense Trajectories&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Obtain trajectories by &lt;strong>optical flow&lt;/strong> tracking on densely sampled points&lt;/p>
&lt;ul>
&lt;li>Sampling
&lt;ul>
&lt;li>Sample features points every 5th pixel&lt;/li>
&lt;li>Remove untrackable points (structure / Eigenvalue analysis)&lt;/li>
&lt;li>Sample points on eight different scales&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Tracking
&lt;ul>
&lt;li>Tracking by median filtering in the OF-Field&lt;/li>
&lt;li>Trajectory length is fixed (e.g. 15 frames)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Feature tracking&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Points of subsequent frames are concatenated to form a trajectory&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Trajectories are limited to $L$ frames in order to avoid drift from their initial location&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The shape of a trajectory of length $L$ is described by the sequence
&lt;/p>
$$
S=\left(\Delta P\_{t}, \ldots, \Delta P\_{t+L-1}\right)
$$
&lt;/li>
&lt;li>
&lt;p>The resulting vector is normalized by
&lt;/p>
$$
\begin{array}{c}
\Delta P\_{t}=\left(P\_{t+1}-P\_{t}\right)=\left(x\_{t+1}-x\_{t}, y\_{t+1}-y\_{t}\right) \\\\
S^{\prime}=\frac{\left(\Delta P\_{t}, \ldots, \Delta P\_{t+L-1}\right)}{\sum\_{j=t}^{t+L-1}\left\|\Delta P\_{j}\right\|}
\end{array}
$$
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="trajectory-descriptors">&lt;strong>Trajectory descriptors&lt;/strong>&lt;/h4>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-21%2015.29.28.png" alt="截屏2021-07-21 15.29.28">&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Histogram of Oriented Gradient (HOG)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Histogram of Optical Flow (HOF)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>HOGHOF&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Motion Boundary Histogram (MBH)&lt;/p>
&lt;ul>
&lt;li>Take local gradients of x-y flow components and compute HOG as in static images&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-21%2015.31.09.png" alt="截屏2021-07-21 15.31.09">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>Wang, Heng, et al. &amp;ldquo;Dense trajectories and motion boundary descriptors for action recognition.&amp;rdquo; International journal of computer vision 103.1 (2013): 60-79.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Action &amp; Activity Recognition 2</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/18-action-activity-recognition-2/</link><pubDate>Tue, 20 Jul 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-lecture/18-action-activity-recognition-2/</guid><description>&lt;p>&lt;strong>What is action recognition?&lt;/strong>&lt;/p>
&lt;p>Given an input video/image, perform some appropriate processing, and output the “action label”&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-21%2021.09.22.png" alt="截屏2021-07-21 21.09.22">&lt;/p>
&lt;h2 id="cnns-for-action--activity-recognition-1">CNNs for Action / Activity Recognition &lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/h2>
&lt;p>Why CNN?&lt;/p>
&lt;ul>
&lt;li>Convolutional neural networks report the best performance in static image classification.&lt;/li>
&lt;li>They automatically learn to extract generic features that transfer well across data sets.&lt;/li>
&lt;/ul>
&lt;h3 id="strategies-for-temporal-fusion">Strategies for temporal fusion&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Single Frame CNN (baseline)&lt;/strong>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-21%2021.43.07.png" alt="截屏2021-07-21 21.43.07" style="zoom:50%;" />
&lt;ul>
&lt;li>Network sees one frame at a time&lt;/li>
&lt;li>No temporal information&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Late Fusion CNN&lt;/strong>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-21%2021.43.23.png" alt="截屏2021-07-21 21.43.23" style="zoom:50%;" />
&lt;ul>
&lt;li>Network sees two frames separated by F = 15 frames&lt;/li>
&lt;li>Both frames go into separate pathways&lt;/li>
&lt;li>Only the last layers have access to temporal information&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Early Fusion CNN&lt;/strong>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-21%2021.44.19.png" alt="截屏2021-07-21 21.44.19" style="zoom:50%;" />
&lt;ul>
&lt;li>Modify the convolutional filters in the first layer to incorporate temporal information.
&lt;ul>
&lt;li>Filters of $11 \times 11 \times 3 \times T$ , where $T$ is the temporal context ($T=10$)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Slow Fusion CNN&lt;/strong>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-21%2021.45.52.png" alt="截屏2021-07-21 21.45.52" style="zoom:50%;" />
&lt;ul>
&lt;li>Layers higher in the hierarchy have access to larger temporal context&lt;/li>
&lt;li>Learn motion patterns at different scales&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="multiresolution-cnn">Multiresolution CNN&lt;/h3>
&lt;p>Faster training by reducing input size from $170 \times 170$ to $89 \times 89$&lt;/p>
&lt;p>💡 Idea: takes advantage of the &lt;strong>camera bias&lt;/strong> present in many online videos, since the object of interest often occupies the center region.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-21%2021.48.32.png" alt="截屏2021-07-21 21.48.32">&lt;/p>
&lt;ul>
&lt;li>The &lt;strong>context stream&lt;/strong> receives the downsampled frames at half the original spatial resolution (89 × 89 pixels)&lt;/li>
&lt;li>The &lt;strong>fovea stream&lt;/strong> receives the center 89 × 89 region at the original resolution&lt;/li>
&lt;/ul>
&lt;p>$\rightarrow$ The total input dimensionality is halved.&lt;/p>
&lt;h3 id="evaluation">Evaluation&lt;/h3>
&lt;p>Dataset: Sports-1M (1 Million videos, 487 sport activities classes)&lt;/p>
&lt;h2 id="encoding-image-and-optical-flow-separately-two-stream-cnns-2">Encoding image and optical flow separately (two-stream CNNs) &lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>&lt;/h2>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-21%2022.35.36.png" alt="截屏2021-07-21 22.35.36">&lt;/p>
&lt;h2 id="3d-convolutions-for-action-recognition-c3d">3D convolutions for action recognition (C3D)&lt;/h2>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-21%2022.53.06.png" alt="截屏2021-07-21 22.53.06">&lt;/p>
&lt;p>Notations:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>video clips $\in c \times l \times h \times w$&lt;/p>
&lt;ul>
&lt;li>$c$: #channels&lt;/li>
&lt;li>$l$: length in number of frames&lt;/li>
&lt;li>$h, w$: height and width of the frame&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>3D convolution and pooling $\in d \times k \times k$&lt;/p>
&lt;ul>
&lt;li>$d$: kernel temporal depth&lt;/li>
&lt;li>$k$: kernel spatial size&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>C3D: 3 x 3 x 3 convolutions with stride 1 in space and time&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-21%2022.57.53.png" alt="截屏2021-07-21 22.57.53">&lt;/p>
&lt;h2 id="recurrent-convolutional-networks--cnn-rnn-3">Recurrent Convolutional Networks / CNN-RNN &lt;sup id="fnref:3">&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref">3&lt;/a>&lt;/sup>&lt;/h2>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-21%2023.04.00.png" alt="截屏2021-07-21 23.04.00">&lt;/p>
&lt;h3 id="lrcn">LRCN&lt;/h3>
&lt;ul>
&lt;li>Task-specific instantiation&lt;/li>
&lt;li>Activity recognition (average frame representations)&lt;/li>
&lt;li>Image captioning (feed image info to each RNN)&lt;/li>
&lt;li>Video description (sequence-to-sequence models)&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-21%2023.06.41.png" alt="截屏2021-07-21 23.06.41">&lt;/p>
&lt;h2 id="comparison-of-architectures">Comparison of architectures&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Type of convolutional and layers operators&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;ul>
&lt;li>2D kernels (image-based) vs.&lt;/li>
&lt;li>3D kernels (video-based)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Input streams&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;ul>
&lt;li>RGB (spatial stream), usually used in single-stream networks&lt;/li>
&lt;li>Precomputed optical flow (temporal stream)&lt;/li>
&lt;li>Further streams possible (e.g. depth, human bounding boxes)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Fusion strategy across multiple frames&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;ul>
&lt;li>Feature aggregation over time&lt;/li>
&lt;li>Recurrent layers, such as LSTM&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>$\rightarrow$ Modern architectures are usually a combination of the above!&lt;/strong>&lt;/p>
&lt;p>&lt;strong>Fair comparison of the architectures is difficult!&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>different pre-training of models, some are trained from scratch&lt;/li>
&lt;li>Activity recognition datasets have been too small for analysis of deep learning approaches $\rightarrow$ pre-training matters even more&lt;/li>
&lt;/ul>
&lt;h2 id="evolution-of-activity-recognition-datasets">Evolution of Activity Recognition Datasets&lt;/h2>
&lt;ul>
&lt;li>Construction of large-scale video datasets much harder then for images 🤪&lt;/li>
&lt;li>Common datasets too tiny for proper research of deep methods&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-07-21%2023.15.57.png" alt="截屏2021-07-21 23.15.57" style="zoom:67%;" />
&lt;h2 id="evaluation-of-action-recognition-architectures-4">Evaluation of Action Recognition Architectures &lt;sup id="fnref:4">&lt;a href="#fn:4" class="footnote-ref" role="doc-noteref">4&lt;/a>&lt;/sup>&lt;/h2>
&lt;p>Contributions&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Release of the &lt;em>&lt;strong>Kinetics&lt;/strong>&lt;/em> dataset - a first large-scale dataset for Activity Recognition&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Benchmarking of three „classic“ architectures for activity recognition&lt;/p>
&lt;ul>
&lt;li>Note: fair comparison is still quite difficult, since models still differ in their modalities and pre-training basis&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>New Architecture: &lt;strong>I3D&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>3D CNN based Inception-V1 CNN (Google LeNet)&lt;/li>
&lt;li>&amp;ldquo;Inflation“ of trained 2-D filters in the 3-D Model&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="evaluation-of-3-classic-architectures">Evaluation of 3 &amp;ldquo;classic&amp;rdquo; architectures&lt;/h3>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-21%2023.25.40.png" alt="截屏2021-07-21 23.25.40">&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>ConvNet + LSTM (9M Parameters)&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Underlying CNN for feature extraction: Inception-V1&lt;/li>
&lt;li>LSTM with 512 hidden units (after the last AvgPool layer) + FC layer&lt;/li>
&lt;li>Estimating the action from the resulting prediction &lt;strong>Sequence&lt;/strong>:
&lt;ul>
&lt;li>Training: &lt;strong>output at each time-step used for loss&lt;/strong> calculation&lt;/li>
&lt;li>Testing: &lt;strong>output of the last frame used for final prediction&lt;/strong>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Pre-trained on ImageNet&lt;/strong>&lt;/li>
&lt;li>Preprocessing Steps: down-sampling from 25 to 5 fps&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>3D - ConvNet (79M Parameters)&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Spatio-temporal filters, C3D architecture&lt;/li>
&lt;li>High number of parameters $\rightarrow$ harder to train 🤪&lt;/li>
&lt;li>CNN Input: 16-frame snippets&lt;/li>
&lt;li>Classification: score averaging over each snippet in the video&lt;/li>
&lt;li>&lt;strong>Trained from scratch&lt;/strong>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Two Stream CNN (12 M Parameters)&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Underlying CNN for feature extraction: Inception-V1&lt;/li>
&lt;li>Spatial (RGB) and Temporal (Optical Flow) streams trained separately&lt;/li>
&lt;li>Prediction by score averaging&lt;/li>
&lt;li>CNN Pre-trained on ImageNet&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Evaluation&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Two-Stream are still the clear winners&lt;/strong>&lt;/li>
&lt;li>3D-CNN show poor performance and very high number of parameters
&lt;ul>
&lt;li>Note: this is the only architecture trained from scratch&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="inflated-3d-cnn-i3d">Inflated 3D CNN (I3D)&lt;/h3>
&lt;p>💡 Idea: transfer the knowledge from the image recognition tasks in 3-D CNNs&lt;/p>
&lt;p>&lt;strong>I3-D Architecture&lt;/strong>&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-07-21%2023.45.31.png" alt="截屏2021-07-21 23.45.31">&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Inception-V1 architecture extended to 3D&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Filters and pooling kernels inflated with the time dimension ($N \times N \rightarrow N \times N \times N$)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>👍 Advantage: Pre-training on Image-Net possible (Learned weights of 2-D filters repeated N times along the time dimension)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Note: the 3-D extension is not fully symmetric in respect to pooling (Time dimension is different from the space dimensions)&lt;/p>
&lt;ul>
&lt;li>First two max-pooling layers &lt;strong>do not perform temporal pooling&lt;/strong>&lt;/li>
&lt;li>Late max-pooling layers use symmetric 3x3x3 kernels&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Evaluation&lt;/p>
&lt;ul>
&lt;li>I3D outperforms image-based approaches on each of the streams&lt;/li>
&lt;li>Combination of RGB input and optical flow still very useful&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="the-role-of-pre-training">The role of pre-training&lt;/h3>
&lt;p>&lt;strong>Pre-training on a video dataset (additionally to the Image-Net pre-training)&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Pre-training on MiniKinetics&lt;/li>
&lt;li>For 3D ConvNets, using additional data for pre-training is crucial&lt;/li>
&lt;li>For 2D ConvNets, the difference seems to be smaller&lt;/li>
&lt;/ul>
&lt;p>$\rightarrow$ Pre-training is crucial&lt;/p>
&lt;p>$\rightarrow$ I3D is the new State-of-The art model&lt;/p>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>Karpathy, Andrej, et al. &amp;ldquo;Large-scale video classification with convolutional neural networks.&amp;rdquo; Computer Vision and Pattern Recognition (CVPR), 2014&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>K. Simonyan, and A. Zisserman. Two-Stream Convolutional Networks for Action Recognition in Videos. In &lt;em>NIPS&lt;/em> 2015.&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:3">
&lt;p>J. Donahue, et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description. In &lt;em>CVPR&lt;/em> 2015.&amp;#160;&lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:4">
&lt;p>Carreira, J., &amp;amp; Zisserman, A. (2017). Quo Vadis, action recognition? A new model and the kinetics dataset. &lt;em>Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017&lt;/em>, &lt;em>2017&lt;/em>-&lt;em>January&lt;/em>, 4724–4733. &lt;a href="https://doi.org/10.1109/CVPR.2017.502">https://doi.org/10.1109/CVPR.2017.502&lt;/a>&amp;#160;&lt;a href="#fnref:4" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Clean Code</title><link>https://haobin-tan.netlify.app/docs/cs/software-engineering/implementing-high-quality-systems/clean-code/</link><pubDate>Fri, 06 Nov 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/cs/software-engineering/implementing-high-quality-systems/clean-code/</guid><description>&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>&lt;strong>Readability&lt;/strong> of code is important&lt;/p>
&lt;ul>
&lt;li>Code is much more often read than written&lt;/li>
&lt;li>Your write code for the next human to read it, not for the compiler/interpreter/computer!&lt;/li>
&lt;/ul>
&lt;h2 id="object-oriented-design-ood">Object-Oriented Design (OOD)&lt;/h2>
&lt;blockquote>
&lt;p>A design strategy to build a system “made up of interacting objects that maintain their own local state and provide operations on that state information.”&lt;/p>
&lt;p>[Sommerville]&lt;/p>
&lt;/blockquote>
&lt;p>&lt;strong>SOLID&lt;/strong> principles: Five principles of good OO design&lt;/p>
&lt;ul>
&lt;li>&lt;strong>S&lt;/strong>ingle Responsibility Principle (SRP)&lt;/li>
&lt;li>&lt;strong>O&lt;/strong>pen Closed Principle (OCP)&lt;/li>
&lt;li>&lt;strong>L&lt;/strong>iskov Substitution Principle (LSP)&lt;/li>
&lt;li>&lt;strong>I&lt;/strong>nterface Segregation Principle (ISP)&lt;/li>
&lt;li>&lt;strong>D&lt;/strong>ependency Inversion Principle (DIP)&lt;/li>
&lt;/ul>
&lt;h3 id="single-responsibility-principle-srp">Single Responsibility Principle (SRP)&lt;/h3>
&lt;blockquote>
&lt;p>“There should never be more than one reason for a class to change.“
— R. Martin&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>
&lt;p>Each responsibility deals with &lt;strong>one core concern&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>It may also deal with further (cross-cutting) concerns&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Bad smell: Big class (~ &amp;gt;200 LOC, &amp;gt;15 methods/fields)&lt;/p>
&lt;ul>
&lt;li>Useful refactoring: Extract class&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Benefits:&lt;/p>
&lt;ul>
&lt;li>Code is easier to understand&lt;/li>
&lt;li>Adding/modifying functionality should affect few classes&lt;/li>
&lt;li>Risk of breaking code is minimised&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="insertion-command-query-separation">Insertion: Command-Query-Separation&lt;/h4>
&lt;ul>
&lt;li>Separate commands (actions) from simple queries (requests)&lt;/li>
&lt;li>Reason
&lt;ul>
&lt;li>Commands are expected to have side effects on an object’s state&lt;/li>
&lt;li>Queries should not change the state of an object&lt;/li>
&lt;li>Appropriate designs are simpler to understand and easier to test&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="open-closed-principle-ocp">Open Closed Principle (OCP)&lt;/h3>
&lt;blockquote>
&lt;p>“Software entities (classes, modules, functions, etc.) should be open for extension, but closed for modifi-cation.”
— R. Martin, paraphrasing B. Meyer&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>
&lt;p>💡 Idea: Modify behaviour by adding new code, NOT by changing old code&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Strongly related to the “Information Hiding Principle”&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Example: Drawing a list of shapes using a switch statement&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-java" data-lang="java">&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">Shape&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">shape&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">ShapeList&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">switch&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">shape&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="na">getType&lt;/span>&lt;span class="p">())&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">case&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">SQUARE&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">square&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="na">draw&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">case&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">CIRCLE&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">circle&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="na">draw&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="p">}&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Needs to be modified for new shapes 🤪&lt;/p>
&lt;p>Solution: use abstractions to keep the function open for extension&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-java" data-lang="java">&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">Shape&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">shape&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">ShapeList&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">shape&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="na">draw&lt;/span>&lt;span class="p">();&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;/ul>
&lt;h3 id="liskov-substitution-principle-lsp">Liskov Substitution Principle (LSP)&lt;/h3>
&lt;blockquote>
&lt;p>“Functions that use pointers or references to base classes must be able to &lt;strong>use&lt;/strong> objects of &lt;strong>derived classes without knowing&lt;/strong> it.”&lt;/p>
&lt;p>&lt;em>— R. Martin&lt;/em>&lt;/p>
&lt;/blockquote>
&lt;h4 id="example">Example&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Square &lt;strong>is-a&lt;/strong> Rectangle? Only in a mathematical sense!&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Square &lt;strong>can-NOT-substitute&lt;/strong> Rectangle, because it offers limited behaviour (&lt;code>setWidth&lt;/code> and &lt;code>setHeight&lt;/code> are dependent)&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-15%2014.38.09.png" alt="截屏2020-11-15 14.38.09">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>LSP is related to B. Meyer‘s &lt;strong>Design by Contract&lt;/strong> (DbC):&lt;/p>
&lt;blockquote>
&lt;p>&lt;em>“When redefining a routine [in a derivative], you may only replace its&lt;/em> &lt;strong>precondition&lt;/strong> &lt;em>by a&lt;/em> &lt;strong>weaker&lt;/strong> &lt;em>one, and its&lt;/em> &lt;strong>postcondition&lt;/strong> &lt;em>by a&lt;/em> &lt;strong>stronger&lt;/strong> *one.” *&lt;/p>
&lt;p>&lt;em>— B. Meyer&lt;/em>&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>In our case, rectangle&amp;rsquo;s &lt;code>setWidth&lt;/code> postcondition: &lt;code>width = w&lt;/code> and &lt;code>height = h&lt;/code>&lt;/li>
&lt;li>Square&amp;rsquo;s &lt;code>setWidth&lt;/code> postcondition: &lt;code>width = w&lt;/code> and &lt;code>height = w&lt;/code>&lt;/li>
&lt;li>Only weaker preconditions and stronger postconditions are allowed, as only they preserve substitutability. It is not allowed to change conditions to &lt;em>arbitrarily different&lt;/em> ones&lt;/li>
&lt;/ul>
&lt;p>Possible solution according to Liskov:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Square/Rectangle &lt;strong>can-substitute&lt;/strong> Shape,&lt;/p>
&lt;/li>
&lt;li>
&lt;p>if Shape collects&lt;/p>
&lt;ul>
&lt;li>
&lt;p>less specific behaviour&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-15%2014.43.10.png" alt="截屏2020-11-15 14.43.10">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Alternative: Drop &lt;code>height = h&lt;/code> from Rectangle’s postcondition&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="interface-segregation-principle-isp">Interface Segregation Principle (ISP)&lt;/h3>
&lt;blockquote>
&lt;p>“Clients should not be forced to depend upon interfaces that they do not use.”&lt;/p>
&lt;p>&lt;em>— R. Martin&lt;/em>&lt;/p>
&lt;/blockquote>
&lt;p>Interfaces should be kept as lean as possible&lt;/p>
&lt;ul>
&lt;li>&lt;strong>High cohesion&lt;/strong>: Interfaces should only be concerned with single concepts&lt;/li>
&lt;li>&lt;strong>Interface pollution&lt;/strong>: Interfaces should not depend on other interfaces just because a subclass requires those&lt;/li>
&lt;li>Interfaces should be separated if used by different clients&lt;/li>
&lt;li>Refactorings: Extract interface/superclass&lt;/li>
&lt;/ul>
&lt;p>Example:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-15%2014.48.06.png" alt="截屏2020-11-15 14.48.06" style="zoom:80%;" />
&lt;h3 id="dependency-inversion-principle-dip">Dependency Inversion Principle (DIP)&lt;/h3>
&lt;blockquote>
&lt;p>“&lt;strong>A.&lt;/strong> High level modules should not depend upon low level modules. Both should &lt;strong>depend upon abstractions&lt;/strong>.&lt;/p>
&lt;p>&lt;strong>B.&lt;/strong> Abstractions should not depend upon details. Details should depend upon abstractions.”&lt;/p>
&lt;p>&lt;em>— R. Martin&lt;/em>&lt;/p>
&lt;/blockquote>
&lt;p>Example:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-15%2014.50.48.png" alt="截屏2020-11-15 14.50.48">&lt;/p>
&lt;p>Better design:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-15%2014.51.23.png" alt="截屏2020-11-15 14.51.23">&lt;/p>
&lt;h4 id="why-inversion">&lt;strong>Why “Inversion”?&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>An interface has been used to &lt;em>invert&lt;/em> the dependency between packages&lt;/p>
&lt;/li>
&lt;li>
&lt;p>But in general: Add abstract concept that both classes A and B depend on&lt;/p>
&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-15%2014.53.02.png" alt="截屏2020-11-15 14.53.02" style="zoom:80%;" />
&lt;h2 id="more-principles">More Principles&lt;/h2>
&lt;h3 id="law-of-demeter-dont-talk-to-strangers">Law of Demeter (don’t talk to strangers)&lt;/h3>
&lt;blockquote>
&lt;p>A module should not know about the innards of the objects it manipulates.&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>
&lt;p>Corresponds to the bad smell “Message Chains”:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-java" data-lang="java">&lt;span class="line">&lt;span class="cl">&lt;span class="n">value&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">getClassA&lt;/span>&lt;span class="p">().&lt;/span>&lt;span class="na">getClassB&lt;/span>&lt;span class="p">().&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">...&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="na">getNeededValue&lt;/span>&lt;span class="p">();&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>👆 Ties code to particular class structure, which is likely to break. :cry&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Rule: A method &lt;code>m&lt;/code> of a class &lt;code>C&lt;/code> should only call the methods of&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;code>C&lt;/code>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>An object created by &lt;code>m&lt;/code>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>An object passed as an argument to &lt;code>m&lt;/code>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>An object held in an instance variable of &lt;code>C&lt;/code>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Example:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Violation&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-java" data-lang="java">&lt;span class="line">&lt;span class="cl">&lt;span class="kd">class&lt;/span> &lt;span class="nc">Motor&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="kd">public&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="kt">void&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nf">startEngine&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="c1">// start the motor&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="p">}&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="p">}&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-java" data-lang="java">&lt;span class="line">&lt;span class="cl">&lt;span class="kd">class&lt;/span> &lt;span class="nc">Car&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="kd">public&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">Motor&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">motor&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="kd">public&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nf">Car&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">motor&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">new&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">Motor&lt;/span>&lt;span class="p">();&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="p">}&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="p">}&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-java" data-lang="java">&lt;span class="line">&lt;span class="cl">&lt;span class="kd">class&lt;/span> &lt;span class="nc">Driver&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="kd">public&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="kt">void&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nf">drive&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">Car&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">myCar&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">new&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">Car&lt;/span>&lt;span class="p">();&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">myCar&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="na">motor&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="na">startEngine&lt;/span>&lt;span class="p">();&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c1">// violation!!!&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="p">}&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="p">}&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;li>
&lt;p>Solution&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-java" data-lang="java">&lt;span class="line">&lt;span class="cl">&lt;span class="kd">class&lt;/span> &lt;span class="nc">Car&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="kd">private&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">Motor&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">motor&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="kd">public&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nf">Car&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">motor&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">new&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">Motor&lt;/span>&lt;span class="p">();&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="p">}&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="kd">public&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="kt">void&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nf">getReadyToDrive&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">this&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="na">motor&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="na">startEngine&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="p">}&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="p">}&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-java" data-lang="java">&lt;span class="line">&lt;span class="cl">&lt;span class="kd">class&lt;/span> &lt;span class="nc">Drive&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="kd">public&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="kt">void&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nf">drive&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">Car&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">myCar&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">new&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">Car&lt;/span>&lt;span class="p">();&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">myCar&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="na">getReadyToDrive&lt;/span>&lt;span class="p">();&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="p">}&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="p">}&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;/ul>
&lt;h3 id="boy-scout-rule">Boy Scout Rule&lt;/h3>
&lt;blockquote>
&lt;p>„Leave the campground cleaner than you found it!“
&lt;em>— The Boy Scouts of America&lt;/em>&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>
&lt;p>Code degrades as time passes&lt;/p>
&lt;/li>
&lt;li>
&lt;p>We seldom start with a greenfield&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;em>Being honest:&lt;/em>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;em>To the code&lt;/em>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;em>To your colleagues&lt;/em>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;em>To yourself about the code&lt;/em>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;em>Refactor your code before checking it in&lt;/em>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="principle-of-least-surprise">Principle of Least Surprise&lt;/h3>
&lt;blockquote>
&lt;p>Any function or class should implement the behaviours that another programmer could reasonably expect&lt;/p>
&lt;p>Also called &lt;a href="https://en.wikipedia.org/wiki/Principle_of_least_astonishment">&lt;strong>principle of least astonishment&lt;/strong> (&lt;strong>POLA&lt;/strong>)&lt;/a>&lt;/p>
&lt;p>&amp;ldquo;If a necessary feature has a high astonishment factor, it may be necessary to redesign the feature.&amp;rdquo;&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>If &lt;strong>obvious behaviour&lt;/strong> remains unimplemented, readers and users&amp;hellip;
&lt;ul>
&lt;li>no longer depend on their &lt;strong>intuition&lt;/strong> about function names&lt;/li>
&lt;li>fall back on reading internals&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="coding-conventions">Coding Conventions&lt;/h3>
&lt;h4 id="naming">Naming&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Standardised (with respect to a project or team)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Meaningful, i.e. clear for everyone&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Intention-revealing&lt;/strong>:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-15%2015.36.36.png" alt="截屏2020-11-15 15.36.36" style="zoom:67%;" />
&lt;/li>
&lt;li>
&lt;p>Make meaningful distinction and avoid disinformation&lt;/p>
&lt;ul>
&lt;li>Hints on &lt;strong>context&lt;/strong>&lt;/li>
&lt;li>Hints on &lt;strong>types&lt;/strong>&lt;/li>
&lt;li>Certain &lt;strong>prefixes&lt;/strong>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Avoid &lt;strong>noninformation&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Except for well-accepted cases (e.g. &lt;code>i&lt;/code> as a loop counter)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="commenting">Commenting&lt;/h4>
&lt;blockquote>
&lt;p>“Don’t comment bad code—rewrite it.“&lt;/p>
&lt;p>&lt;em>— B. W. Kernighan, P. J. Plaugher&lt;/em>&lt;/p>
&lt;/blockquote>
&lt;p>Good comments are&lt;/p>
&lt;ul>
&lt;li>&lt;strong>explaining&lt;/strong>
&lt;ul>
&lt;li>Legal issues&lt;/li>
&lt;li>Performance issues&lt;/li>
&lt;li>Train of thought&lt;/li>
&lt;li>Intent&lt;/li>
&lt;li>Algorithms&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Good comments are &lt;strong>warning&lt;/strong>
&lt;ul>
&lt;li>Of consequences&lt;/li>
&lt;li>Over importance&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Good comments are &lt;strong>informative&lt;/strong>
&lt;ul>
&lt;li>Open issues, to-dos&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Whenever possible, use &lt;strong>well-named code&lt;/strong> to tell what is done&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Intermediate variables explaining steps&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Extra methods encapsulating expressions&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-15%2015.44.54.png" alt="截屏2020-11-15 15.44.54">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="formatting">&lt;strong>Formatting&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>Visually representing levels of cohesion&lt;/li>
&lt;li>&lt;strong>Vertical&lt;/strong> openness between concepts,
&lt;ul>
&lt;li>e.g. declarations&lt;/li>
&lt;li>e.g. add blank lines after imports or after a method is finished&lt;/li>
&lt;li>lines that are &lt;strong>related&lt;/strong> should be written &lt;strong>densely together&lt;/strong>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Horizontal&lt;/strong> openness
&lt;ul>
&lt;li>to accentuate operators / operator precedence&lt;/li>
&lt;li>to separate parameters&lt;/li>
&lt;li>use spaces to emphasize elements and indent to make scopes visible&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="dont-repeat-yourself-dry">Don’t repeat yourself (DRY)&lt;/h3>
&lt;blockquote>
&lt;p>Do not duplicate pieces of code!&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>Copy &amp;amp; paste decreases&amp;hellip;
&lt;ul>
&lt;li>&lt;strong>Maintainability&lt;/strong>: Losing track of copies&lt;/li>
&lt;li>&lt;strong>Understandability&lt;/strong>
&lt;ul>
&lt;li>Code is less compact&lt;/li>
&lt;li>An identical concept needs to be understood multiple times&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Evolvability&lt;/strong>
&lt;ul>
&lt;li>Need to find and modify all copies, When removing bugs or changing behaviour&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Duplicated code fosters errors and inconsistencies&lt;/li>
&lt;/ul>
&lt;h3 id="keep-it-simple-stupid-kiss">Keep it simple, stupid (KISS)&lt;/h3>
&lt;blockquote>
&lt;p>“Make everything as &lt;strong>simple&lt;/strong> as possible, but not simpler”&lt;/p>
&lt;p>— &lt;em>Albert Einstein&lt;/em>&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>
&lt;p>Good code is easy to understand by anybody&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Good code addresses the problem adequately&lt;/p>
&lt;/li>
&lt;li>
&lt;p>For example, if an &lt;code>IEnumerable&lt;/code> is suitable, do not use an&lt;code>ICollection&lt;/code> or even an &lt;code>IList&lt;/code>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Techniques which help ensure that your code is understandable by others:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Code reviews&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Pair programming&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="you-aint-gonna-need-it-yagni">You ain’t gonna need it (YAGNI)&lt;/h3>
&lt;blockquote>
&lt;p>Only implement required features!&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>
&lt;p>Featurism is &lt;strong>costly&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>unrequested features need to be tested, documented&lt;/p>
&lt;/li>
&lt;li>
&lt;p>over-engineered systems sacrifice maintainability, as they are overly complex (KISS)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Beware of &lt;strong>optimisations&lt;/strong>!&lt;/p>
&lt;ul>
&lt;li>Often merely treat symptoms&lt;/li>
&lt;li>Too costly to be done prematurely&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="single-level-of-abstraction-sla">Single Level of Abstraction (SLA)&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Newspaper metaphor:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Good newspaper articles are well-ordered&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Navigation with details increasing:&lt;/p>
&lt;ul>
&lt;li>headline (very high abstraction)&lt;/li>
&lt;li>text with synopsis (high abstraction)&lt;/li>
&lt;li>rest (details)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Statements within a function&lt;/strong> should be at the same abstraction level&lt;/p>
&lt;ul>
&lt;li>if not, extract expressions/statements of higher detail into an own method&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Functions in a class&lt;/strong>: The abstraction level should decrease depth- first when reading from top to bottom&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="refactoring">Refactoring&lt;/h2>
&lt;blockquote>
&lt;p>&lt;em>If it stinks, change it.&lt;/em>&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>
&lt;p>Methods tend to grow during development&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Bad odour (smell) of a long method arises&lt;/p>
&lt;/li>
&lt;li>
&lt;p>What to do? Extract cohesive parts into new methods&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-15%2016.15.51.png" alt="截屏2020-11-15 16.15.51">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="what-is-refactoring">What is Refactoring?&lt;/h3>
&lt;blockquote>
&lt;p>A „&lt;strong>disciplined technique for restructuring&lt;/strong> an existing body of code, altering its internal structure &lt;strong>without changing its external behavior&lt;/strong>.“&lt;/p>
&lt;p>&lt;em>— M. Fowler&lt;/em>&lt;/p>
&lt;/blockquote>
&lt;h3 id="the-first-rule-in-refactoring">&lt;strong>The First Rule in Refactoring&lt;/strong>&lt;/h3>
&lt;p>&lt;strong>Refactor with tests only!&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Good tests help to prevent introducing bugs into the program through refactoring&lt;/li>
&lt;/ul>
&lt;h3 id="bad-smells">Bad Smells&lt;/h3>
&lt;p>Bad code smells: symptoms for deeper problems&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Long method&lt;/strong>: having code blocks lead by comments&lt;/p>
&lt;ul>
&lt;li>👨‍⚕️Cure: Extract Method: extract commented block&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Duplicated code&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Feature envy&lt;/strong>: class A excessively calls another class B’s methods&lt;/p>
&lt;ul>
&lt;li>👨‍⚕️Cure: parts of A’s methods want to be in class B
&lt;ol>
&lt;li>Extract Method: extract code block calling class B&lt;/li>
&lt;li>Move Method: move extracted part to class B&lt;/li>
&lt;/ol>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Data class&lt;/strong>: class merely holds data (&amp;ldquo;dumb data holder&amp;rdquo;)&lt;/p>
&lt;ul>
&lt;li>👨‍⚕️Cure: enforce information hiding principle, collect functionality
&lt;ol>
&lt;li>Encapsulate field: getter/setter instead of public access&lt;/li>
&lt;li>Remove setting method: only for read-only values&lt;/li>
&lt;li>Move method: collect functionality implemented elsewhere
&lt;ul>
&lt;li>think about responsibilities of the class&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Large/God class&lt;/strong>: class tries to do too much&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Inappropriate intimacy&lt;/strong>: class has dependencies on implementation details of another class&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&amp;hellip;&lt;/p>
&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>More catelog see: &lt;a href="https://www.refactoring.com/catalog/index.html">https://www.refactoring.com/catalog/index.html&lt;/a>&lt;/p>
&lt;/blockquote>
&lt;h3 id="when-to-refactor">&lt;strong>When to Refactor?&lt;/strong>&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>It is not that simple to find out &lt;strong>when to refactor&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>So-called “&lt;strong>bad smells&lt;/strong>” in code may give a good indication when refactoring is worthwhile&lt;/p>
&lt;/li>
&lt;li>
&lt;p>More general guidelines&lt;/p>
&lt;ul>
&lt;li>when you find yourself looking up details frequently
&lt;ul>
&lt;li>&lt;em>what was the order of the method parameters again?&lt;/em>&lt;/li>
&lt;li>&lt;em>where was this method again and what does it do?&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>when you feel the need to write a &lt;strong>comment&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>try to refactor the code so that the comment becomes superfluous&lt;/strong>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="limitations">Limitations&lt;/h3>
&lt;ul>
&lt;li>May influence performance negatively 🤪
&lt;ul>
&lt;li>However, it is recommended to do the refactoring first&lt;/li>
&lt;li>and the performance tuning on the cleaner code afterwards&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="appendix">Appendix&lt;/h2>
&lt;h3 id="separation-of-concerns-soc">Separation of Concerns (SoC)&lt;/h3>
&lt;blockquote>
&lt;p>Each module should be focused on a single concern.&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>👍 Benefits
&lt;ul>
&lt;li>Loose coupling, high cohesion&lt;/li>
&lt;li>Better testability: each test stays focused on one module&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Some concerns may crosscut a system‘s core concerns
&lt;ul>
&lt;li>Typical crosscutting concerns:
&lt;ul>
&lt;li>Tracing/Logging&lt;/li>
&lt;li>Security&lt;/li>
&lt;li>Transactionality&lt;/li>
&lt;li>Caching&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Aspect Oriented Programming (AOP) provides adequate concepts&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="order-of-implementation">Order of Implementation&lt;/h3>
&lt;p>For the implementation (and unit testing later) always try to &lt;strong>move from the least-coupled to the most-coupled classes&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>avoids unnecessary creation of “stubs”&lt;/li>
&lt;/ul>
&lt;p>Example&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-15%2016.47.49.png" alt="截屏2020-11-15 16.47.49">&lt;/p>
&lt;h3 id="use-a-version-control-system">Use a Version Control System&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>History&lt;/strong> of commented changes&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Shared working in a team, even on same artefacts Branching and merging&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Tagging&lt;/strong> versions as pre-release etc.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Reverting&lt;/strong> to previous revisions&lt;/p>
&lt;ul>
&lt;li>
&lt;p>reduces fears of breaking code&lt;/p>
&lt;/li>
&lt;li>
&lt;p>encourages a programmer‘s willingness to refactor code&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="test-first">Test First&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Test-Driven Development&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Clean tests should follow the &lt;strong>F.I.R.S.T.&lt;/strong> rules&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>F&lt;/strong>ast: to run them frequently&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>I&lt;/strong>ndependent: A failing test does not influence others&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>R&lt;/strong>epeatable: in any environment, so there is no excuse for failing tests&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>S&lt;/strong>elf-Validating: Tests either pass or fail automatically&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>T&lt;/strong>imely: Tests are written right before production code&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Tests should follow same standards as production code&lt;/p>
&lt;ul>
&lt;li>and be executed in a continuous manner
&lt;ul>
&lt;li>so-called continuous integration&lt;/li>
&lt;li>reduces fear of breaking code&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="static-code-analyses">Static Code Analyses&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Classes of metrics&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Duplication (detection of DRY violation)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Unit tests (test coverage should be &amp;gt; 90%)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Complexity (avg. LoC per class)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Potential bugs&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Coding rules&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Comments&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Architecture &amp;amp; design&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="code-reviews">Code Reviews&lt;/h3>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Explaining&lt;/strong> your code to others helps&amp;hellip;&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Detecting errors&lt;/strong> and unclean passages&lt;/li>
&lt;li>&lt;strong>Spreading knowledge&lt;/strong> through a team,&lt;/li>
&lt;li>esp. to less experienced colleagues
&lt;ul>
&lt;li>
&lt;p>about design principles&lt;/p>
&lt;/li>
&lt;li>
&lt;p>about further aspects of the system under development&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Refactoring&lt;/strong> helps to &lt;strong>instantly apply&lt;/strong> suggestions, so follow-up ideas can be given in one session&lt;/p>
&lt;ul>
&lt;li>Works only in small groups with few opinions&lt;/li>
&lt;li>In larger groups, &lt;strong>design reviews&lt;/strong> are better suitable&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul></description></item></channel></rss>