<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Nerual Network Basics | Haobin Tan</title><link>https://haobin-tan.netlify.app/tags/nerual-network-basics/</link><atom:link href="https://haobin-tan.netlify.app/tags/nerual-network-basics/index.xml" rel="self" type="application/rss+xml"/><description>Nerual Network Basics</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Tue, 08 Sep 2020 00:00:00 +0000</lastBuildDate><image><url>https://haobin-tan.netlify.app/media/icon_hu7d15bc7db65c8eaf7a4f66f5447d0b42_15095_512x512_fill_lanczos_center_3.png</url><title>Nerual Network Basics</title><link>https://haobin-tan.netlify.app/tags/nerual-network-basics/</link></image><item><title>Neural Network Basics</title><link>https://haobin-tan.netlify.app/docs/ai/deep-learning/nn-basics/</link><pubDate>Fri, 31 Jul 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/deep-learning/nn-basics/</guid><description/></item><item><title>Perceptron</title><link>https://haobin-tan.netlify.app/docs/ai/deep-learning/nn-basics/perceptron/</link><pubDate>Tue, 01 Sep 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/deep-learning/nn-basics/perceptron/</guid><description>&lt;h2 id="structure">Structure&lt;/h2>
&lt;p>A perceptron is&lt;/p>
&lt;ul>
&lt;li>a single-layer neural network&lt;/li>
&lt;li>used for supervised learning of binary classifiers&lt;/li>
&lt;/ul>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/perceptron.png"
alt="Perceptron">&lt;figcaption>
&lt;p>Perceptron&lt;/p>
&lt;/figcaption>
&lt;/figure>
$$
g(x) = \underbrace{\sum\_{i=0}^n w\_i x\_i}\_{\text{linear separator}} + \underbrace{w\_0}\_{\text{offset/bias}}
$$
&lt;p>Decision for classification
&lt;/p>
$$
\hat{y} = \begin{cases} 1 &amp;\text{if } g(x) > 0 \\\\ -1 &amp;\text{else}\end{cases}
$$
&lt;h2 id="update-rule">Update Rule&lt;/h2>
&lt;p>$w=w+y x$ if prediction is wrong&lt;/p>
&lt;ul>
&lt;li>
&lt;p>If label $y=1$ but predict $\hat{y}=-1$: $w = w + x$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>If label $y=-1$ but predict $\hat{y}=1$: $w = w - x$&lt;/p>
&lt;/li>
&lt;/ul></description></item><item><title>👍 Activation Functions</title><link>https://haobin-tan.netlify.app/docs/ai/deep-learning/nn-basics/activation-functions/</link><pubDate>Mon, 17 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/deep-learning/nn-basics/activation-functions/</guid><description>&lt;p>Activation functions should be&lt;/p>
&lt;ul>
&lt;li>&lt;strong>non-linear&lt;/strong>&lt;/li>
&lt;li>&lt;strong>differentiable&lt;/strong> (since training with Backpropagation)&lt;/li>
&lt;/ul>
&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">&lt;p>Q: Why can’t the mapping between layers be linear?&lt;/p>
&lt;p>A: Compositions of linear functions is still linear, whole network collapses to regression.&lt;/p>
&lt;/span>
&lt;/div>
&lt;h2 id="sigmoid">Sigmoid&lt;/h2>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-17%2011.07.47.png" alt="截屏2020-08-17 11.07.47" style="zoom:50%;" />
$$
\sigma(x)=\frac{1}{1+\exp (-x)}
$$
&lt;ul>
&lt;li>
&lt;p>Squashes numbers to range $[0,1]$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>✅ &lt;span style="color:green">Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron&lt;/span>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>⛔️ &lt;span style="color:red">Problems&lt;/span>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Vanishing gradients&lt;/strong>: functions gradient at either tail of $1$ or $0$ is almost zero&lt;/li>
&lt;li>Sigmoid outputs are &lt;strong>not zero-centered&lt;/strong> (important for initialization)&lt;/li>
&lt;li>$\exp()$ is a bit compute expensive&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Derivative
&lt;/p>
$$
\frac{d}{dx} \sigma(x) = \sigma(x)(1 - \sigma(x))
$$
&lt;p>
(See: &lt;a href="https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x">Derivative of Sigmoid Function&lt;/a>)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Python implementation&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">sigmoid&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="mi">1&lt;/span> &lt;span class="o">/&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">exp&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>
&lt;li>
&lt;p>Derivative&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">dsigmoid&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">y&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="tanh">Tanh&lt;/h2>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-17%2011.08.03.png" alt="截屏2020-08-17 11.08.03" style="zoom:50%;" />
&lt;ul>
&lt;li>
&lt;p>Squashes numbers to range $[-1,1]$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>✅ &lt;span style="color:green">zero centered (nice)&lt;/span> &amp;#x1f44f;&lt;/p>
&lt;/li>
&lt;li>
&lt;p>⛔️ &lt;span style="color:red">&lt;strong>Vanishing gradients&lt;/strong>: still kills gradients when saturated&lt;/span>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Derivative:
&lt;/p>
$$
\frac{d}{dx}\tanh(x) = 1 - [\tanh(x)]^2
$$
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">dtanh&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="mi">1&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="n">y&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">y&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;/ul>
&lt;h2 id="rectified-linear-unit-relu">Rectified Linear Unit (ReLU)&lt;/h2>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-17%2011.08.14.png" alt="截屏2020-08-17 11.08.14" style="zoom: 67%;" />
$$
f(x) = \max(0, x)
$$
&lt;ul>
&lt;li>
&lt;p>✅ &lt;span style="color:green">Advantages&lt;/span>&lt;/p>
&lt;ul>
&lt;li>Does not saturate (in $[0, \infty]$)&lt;/li>
&lt;li>Very computationally efficient&lt;/li>
&lt;li>Converges much faster than sigmoid/tanh in practice&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>⛔️ &lt;span style="color:red">Problems&lt;/span>&lt;/p>
&lt;ul>
&lt;li>Not zero-centred output&lt;/li>
&lt;li>No gradient for $x &lt; 0$ (dying ReLU)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Python implementation&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">numpy&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">np&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">ReLU&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">maximum&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;/ul>
&lt;h2 id="leaky-relu">Leaky ReLU&lt;/h2>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-17%2011.40.32.png" alt="截屏2020-08-17 11.40.32" style="zoom:50%;" />
$$
f(x) = \max(0.1x, x)
$$
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Parametric Rectifier (PReLu)&lt;/strong>
&lt;/p>
$$
f(x) = \max(\alpha x, x)
$$
&lt;ul>
&lt;li>Also learn $\alpha$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>✅ &lt;span style="color:green">Advantages&lt;/span>&lt;/p>
&lt;ul>
&lt;li>Does not saturate&lt;/li>
&lt;li>Computationally efficient&lt;/li>
&lt;li>Converges much faster than sigmoid/tanh in practice!&lt;/li>
&lt;li>will not “die”&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Python implementation&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">numpy&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">np&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">ReLU&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">maximum&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mf">0.1&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;/ul>
&lt;h2 id="exponential-linear-units-elu">Exponential Linear Units (ELU)&lt;/h2>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-17 11.08.41.png" alt="截屏2020-08-17 11.08.41" style="zoom: 50%;" />
$$
f(x) = \begin{cases} x &amp;\text{if }x > 0 \\\\
\alpha(\exp (x)-1) &amp; \text {if }x \leq 0\end{cases}
$$
&lt;ul>
&lt;li>✅ &lt;span style="color:green">Advantages&lt;/span>
&lt;ul>
&lt;li>All benefits of ReLU&lt;/li>
&lt;li>Closer to zero mean outputs&lt;/li>
&lt;li>Negative saturation regime compared with Leaky ReLU (adds some robustness to noise)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>⛔️ &lt;span style="color:red">Problems&lt;/span>
&lt;ul>
&lt;li>Computation requires $\exp()$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="maxout">Maxout&lt;/h2>
$$
f(x) = \max \left(w\_{1}^{T} x+b\_{1}, w\_{2}^{T} x+b\_{2}\right)
$$
&lt;ul>
&lt;li>Generalizes ReLU and Leaky ReLU
&lt;ul>
&lt;li>ReLU is Maxout with $w\_1 =0$ and $b\_1 = 0$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>✅ &lt;span style="color:green">Fixes the dying ReLU problem&lt;/span>&lt;/li>
&lt;li>⛔️ &lt;span style="color:red">Doubles the number of parameters&lt;/span>&lt;/li>
&lt;/ul>
&lt;h2 id="softmax">Softmax&lt;/h2>
&lt;ul>
&lt;li>Softmax: probability that feature $x$ belongs to class $c\_k$
$$
o\_k = \theta\_k^Tx \qquad \forall k = 1, \dots, j
$$&lt;/li>
&lt;/ul>
$$
p\left(y=c\_{k} \mid x ; \boldsymbol{\theta}\right)= p\left(c\_{k} = 1 \mid x ; \boldsymbol{\theta}\right) = \frac{e^{o\_k}}{\sum\_{j} e^{o\_j}}
$$
&lt;ul>
&lt;li>Derivative:
$$
\frac{\partial p(\hat{\mathbf{y}})}{\partial o\_{j}} =y\_{j}-p\left(\hat{y}\_{j}\right)
$$&lt;/li>
&lt;/ul>
&lt;h2 id="advice-in-practice">Advice in Practice&lt;/h2>
&lt;ul>
&lt;li>Use &lt;span style="color:green">&lt;strong>ReLU&lt;/strong>&lt;/span>
&lt;ul>
&lt;li>Be careful with your learning rates / initialization&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Try out &lt;span style="color:green">Leaky ReLU / ELU / Maxout&lt;/span>&lt;/li>
&lt;li>Try out &lt;span style="color:orange">tanh&lt;/span> but don’t expect much&lt;/li>
&lt;li>&lt;span style="color:red">Don’t use sigmoid&lt;/span>&lt;/li>
&lt;/ul>
&lt;h2 id="summary-and-overview">Summary and Overview&lt;/h2>
&lt;p>See: &lt;a href="https://en.wikipedia.org/wiki/Activation_function">Wiki-Activation Function&lt;/a>&lt;/p></description></item><item><title>👍 Loss Functions</title><link>https://haobin-tan.netlify.app/docs/ai/deep-learning/nn-basics/loss-function/</link><pubDate>Mon, 17 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/deep-learning/nn-basics/loss-function/</guid><description>&lt;ul>
&lt;li>Quantifies what it means to have a “good” model&lt;/li>
&lt;li>Different types of loss functions for different tasks, such as:
&lt;ul>
&lt;li>Classification&lt;/li>
&lt;li>Regression&lt;/li>
&lt;li>Metric Learning&lt;/li>
&lt;li>Reinforcement Learning&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="classification">Classification&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>Classification: Predicting a discrete class label&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Negative log-likelihood loss (per sample $x$) / Cross-Entropy loss&lt;/strong>
&lt;/p>
$$
L(\boldsymbol{x}, y)=-\sum\_{j} y_{j} \log p\left(c\_{j} \mid \boldsymbol{x}\right)
$$
&lt;ul>
&lt;li>Used in various multiclass classification methods for NN training&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Hinge Loss&lt;/strong>: used in Support Vector Machines (SVMs)
&lt;/p>
$$
L(x, y)=\sum\_{j} \max \left(0,1-x\_{i} y\_{i}\right)
$$
&lt;/li>
&lt;/ul>
&lt;h2 id="regression">Regression&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>Regression: Predicting a one or multiple continuous quantities $y_1, \dots, y\_n$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Goal: Minimize the distance between the predicted value $\hat{y}\_j$ and true values $y_j$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>L1-Loss (Mean Average Error)&lt;/strong>
&lt;/p>
$$
L(\hat{y}, y)=\sum\_{j}\left(\hat{y}\_{j}-x\_{j}\right)
$$
&lt;/li>
&lt;li>
&lt;p>&lt;strong>L2-Loss (Mean Square Error, MSE)&lt;/strong>
&lt;/p>
$$
L(\hat{y}, y)=\sum\_{j}\left(\hat{y}\_{j}-x\_{j}\right)^2
$$
&lt;/li>
&lt;/ul>
&lt;h2 id="metric-learning--similarity-learning">Metric Learning / Similarity Learning&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>A model for measuring the distance (or similarity) between objects&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Triplet Loss&lt;/strong>&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-08-17%2012.20.34.png" alt="截屏2020-08-17 12.20.34" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
$$
\sum_{(a, p, n) \in T} \max \left\\{0, \alpha-\left\|\mathbf{x}\_{a}-\mathbf{x}\_{n}\right\|\_{2}^{2}+\left\|\mathbf{x}\_{a}-\mathbf{x}\_{p}\right\|\_{2}^{2}\right\\}
$$
&lt;/li>
&lt;/ul></description></item><item><title>Multilayer Perceptron and Backpropagation</title><link>https://haobin-tan.netlify.app/docs/ai/deep-learning/nn-basics/back-prop/</link><pubDate>Mon, 17 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/deep-learning/nn-basics/back-prop/</guid><description>&lt;h2 id="multi-layer-perceptron-mlp">Multi-Layer Perceptron (MLP)&lt;/h2>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-17%2017.57.47.png" alt="截屏2020-08-17 17.57.47" style="zoom: 50%;" />
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Input layer $I \in R^{D\_{I} \times N}$&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>How we initially represent the features&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Mini-batch processing with $N$ inputs&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Weight matrices&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Input to Hidden: $W\_{H} \in R^{D\_{I} \times D\_{H}}$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Hidden to Output: $W\_{O} \in R^{D\_{H} \times D\_{O}}$&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Hidden layer(s) $H \in R^{D\_{H} \times N}$&lt;/strong>
&lt;/p>
$$
H = W\_{H}I + b\_H
$$
$$
\widehat{H}\_j = f(H\_j)
$$
&lt;ul>
&lt;li>$f$: non-linear activation function&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Output layer $O \in R^{D\_{O} \times N}$&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>The value of the target function that the network approximates
$$
O = W\_O \widehat{H} + b\_O
$$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="backpropagation">Backpropagation&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>Loss function
&lt;/p>
$$
L = \frac{1}{2}(O - Y)^2
$$
&lt;ul>
&lt;li>
&lt;p>Achieve minimum $L$: &lt;strong>Stochastic Gradient Descent&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>
&lt;p>Calculate $\frac{\delta L}{\delta w}$ for each parameter $w$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Update the parameters
&lt;/p>
$$
w:=w-\alpha \frac{\delta L}{\delta w}
$$
&lt;/li>
&lt;/ol>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Compute the gradients: &lt;strong>Backpropagation (Backprop)&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Output layer&lt;/li>
&lt;/ul>
$$
\begin{aligned}
\frac{\delta L}{\delta O} &amp;=(O-Y) \\\\ \\\\
\frac{\delta L}{\delta \widehat{H}} &amp;=W\_{O}^{T} \color{red}{\frac{\delta L}{\delta O}} \\\\ \\\\
\frac{\delta L}{\delta W\_{O}} &amp;= {\color{red}{\frac{\delta L}{\delta O} }}\widehat{H}^{T} \\\\ \\\\
\frac{\delta L}{\delta b\_{O}} &amp;= \color{red}{\frac{\delta L}{\delta O}}
\end{aligned}
$$
&lt;p>​ (&lt;span style="color:red">Red&lt;/span>: terms previously computed)&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Hidden layer (assuming $\widehat{H}=\operatorname{sigmoid}(H)$ )
&lt;/p>
$$
\frac{\delta L}{\delta H}={\color{red}{\frac{\delta L}{\delta \hat{H}}}} \odot \widehat{H} \odot(1-\widehat{H})
$$
&lt;p>
($\odot$: element-wise multiplication)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Input layer&lt;/p>
&lt;/li>
&lt;/ul>
$$
\begin{array}{l}
\frac{\delta L}{\delta W\_{H}}={\color{red}{\frac{\delta L}{\delta H}}} I^{T} \\\\ \\\\
\frac{\delta L}{\delta b\_{H}}={\color{red}{\frac{\delta L}{\delta H}}}
\end{array}
$$
&lt;/li>
&lt;/ul>
&lt;h2 id="gradients-for-vectorized-operations">Gradients for vectorized operations&lt;/h2>
&lt;p>When dealing with matrix and vector operations, we must pay closer attention to dimensions and transpose operations.&lt;/p>
&lt;p>&lt;strong>Matrix-Matrix multiply gradient&lt;/strong>. Possibly the most tricky operation is the matrix-matrix multiplication (which generalizes all matrix-vector and vector-vector) multiply operations:&lt;/p>
&lt;p>(Example from &lt;a href="https://cs231n.github.io/optimization-2/#staged">cs231n&lt;/a>)&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># forward pass&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">W&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">random&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">randn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">10&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">X&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">random&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">randn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">3&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">D&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">W&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">dot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">X&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># now suppose we had the gradient on D from above in the circuit&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">dD&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">random&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">randn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="n">D&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">shape&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="c1"># same shape as D&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">dW&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">dD&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">dot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">X&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">T&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="c1">#.T gives the transpose of the matrix&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">dX&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">W&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">T&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">dot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">dD&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">&lt;p>&lt;strong>Tip: use dimension analysis!&lt;/strong>&lt;/p>
&lt;p>&lt;em>Note that you do not need to remember the expressions for &lt;code>dW&lt;/code> and &lt;code>dX&lt;/code> because they are easy to re-derive based on dimensions.&lt;/em>&lt;/p>
&lt;p>&lt;em>For instance, we know that the gradient on the weights &lt;code>dW&lt;/code> must be of the same size as &lt;code>W&lt;/code> after it is computed, and that it must depend on matrix multiplication of &lt;code>X&lt;/code> and &lt;code>dD&lt;/code> (as is the case when both &lt;code>X,W&lt;/code> are single numbers and not matrices). There is always exactly one way of achieving this so that the dimensions work out.&lt;/em>&lt;/p>
&lt;p>&lt;em>For example, &lt;code>X&lt;/code> is of size [10 x 3] and &lt;code>dD&lt;/code> of size [5 x 3], so if we want &lt;code>dW&lt;/code> and &lt;code>W&lt;/code> has shape [5 x 10], then the only way of achieving this is with &lt;code>dD.dot(X.T)&lt;/code>, as shown above.&lt;/em>&lt;/p>
&lt;/span>
&lt;/div>
&lt;p>For discussion of math details see: &lt;a href="https://math.stackexchange.com/questions/1866757/not-understanding-derivative-of-a-matrix-matrix-product">Not understanding derivative of a matrix-matrix product.&lt;/a>&lt;/p></description></item><item><title>Softmax and Its Derivative</title><link>https://haobin-tan.netlify.app/docs/ai/deep-learning/nn-basics/softmax/</link><pubDate>Tue, 08 Sep 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/deep-learning/nn-basics/softmax/</guid><description>&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/softmax.png"
alt="Softmax">&lt;figcaption>
&lt;p>Softmax&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;p>We use softmax activation function to predict the probability assigned to $n$ classes. For example, the probability of assigning input sample to $j$-th class is:
&lt;/p>
$$
p\_j = \operatorname{softmax}(z\_j) = \frac{e^{z\_j}}{\sum\_{k=1}^n e^{z\_k}}
$$
&lt;p>
Furthermore, we use One-Hot encoding to represent the groundtruth $y$, which means
&lt;/p>
$$
\sum\_{k=1}^n y\_k = 1
$$
&lt;p>
Loss function (Cross-Entropy):
&lt;/p>
$$
\begin{aligned}
L &amp;= -\sum\_{k=1}^n y\_k \log(p\_k) \\\\
&amp;= - \left(y\_j \log(p\_j) + \sum\_{k \neq j}y\_k \log(p\_k)\right)
\end{aligned}
$$
&lt;p>
Gradient w.r.t $z\_j$:&lt;/p>
$$
\begin{aligned}
\frac{\partial}{\partial z\_j}L
&amp;= \frac{\partial}{\partial z\_j} \left(-\sum\_{k=1}^n y\_k \log(p\_k)\right) \\\\
&amp;= -\frac{\partial}{\partial z\_j} \left(y\_j \log(p\_j) + \sum\_{k \neq j}y\_k \log(p\_k)\right) \\\\
&amp;= -\left(\frac{\partial}{\partial z\_j} y\_j \log(p\_j) + \frac{\partial}{\partial z\_j}\sum\_{k \neq j}y\_k \log(p\_k)\right)
\end{aligned}
$$
&lt;ul>
&lt;li>
&lt;p>$k=j$
&lt;/p>
$$
\begin{aligned}
\frac{\partial}{\partial z\_j} y\_j \log(p\_j)
&amp;= \frac{y\_j}{p\_j} \cdot \left(\frac{\partial}{\partial z\_j} p\_j\right) \\\\
&amp;= \frac{y\_j}{p\_j} \cdot \left(\frac{\partial}{\partial z\_j} \frac{e^{z\_j}}{\sum\_{k=1}^n e^{z\_k}}\right) \\\\
&amp;= \frac{y\_j}{p\_j} \cdot \frac{(\frac{\partial}{\partial z\_j}e^{z\_j})\sum\_{k=1}^n e^{z\_k} - e^{z\_j}(\frac{\partial}{\partial z\_j}\sum\_{k=1}^n e^{z\_k}) }{(\sum\_{k=1}^n e^{z\_k})^2} \\\\
&amp;= \frac{y\_j}{p\_j} \cdot \frac{e^{z\_j}\sum\_{k=1}^n e^{z\_k} - e^{z\_j}e^{z\_j}}{(\sum\_{k=1}^n e^{z\_k})^2} \\\\
&amp;= \frac{y\_j}{p\_j} \cdot \frac{e^{z\_j}(\sum\_{k=1}^n e^{z\_k} - e^{z\_j})}{(\sum\_{k=1}^n e^{z\_k})^2} \\\\
&amp;= \frac{y\_j}{p\_j} \cdot \underbrace{\frac{e^{z\_j}}{\sum\_{k=1}^n e^{z\_k}}}\_{=p\_j} \cdot \frac{\sum\_{k=1}^n e^{z\_k} - e^{z\_j}}{\sum\_{k=1}^n e^{z\_k}}\\\\
&amp;= \frac{y\_j}{p\_j} \cdot p\_j \cdot \underbrace{\left( \frac{\sum\_{k=1}^n e^{z\_k} }{\sum\_{k=1}^n e^{z\_k}} - \frac{e^{z\_j} }{\sum\_{k=1}^n e^{z\_k}}\right)}\_{=1 - p\_j} \\\\
&amp;= y\_j(1-p\_j)
\end{aligned}
$$
&lt;/li>
&lt;li>
&lt;p>$\forall k \neq j$
&lt;/p>
$$
\begin{aligned}
\frac{\partial}{\partial z\_j}\sum\_{k \neq j}y\_k \log(p\_k)
&amp;= \sum\_{k \neq j} \frac{\partial}{\partial z\_j}y\_k \log(p\_k) \\\\
&amp;= \sum\_{k \neq j} \frac{y\_k}{p\_k} \cdot \frac{(\overbrace{\frac{\partial}{\partial z\_j} e^{z\_k}}^{=0})(\sum\_i e^{z\_i}) - e^{z\_k}(\overbrace{\frac{\partial}{\partial z\_j}\sum\_i e^{z\_i}}^{=e^{z\_j}})}{(\sum\_{k=1}^n e^{z\_k})^2} \\\\
&amp;= \sum\_{k \neq j} \frac{y\_k}{p\_k} \frac{-e^{z\_k} e^{z\_j}}{(\sum\_{k=1}^n e^{z\_k})^2} \\\\
&amp;= -\sum\_{k \neq j} \frac{y\_k}{p\_k} \underbrace{\frac{e^{z\_k}}{\sum\_{k=1}^n e^{z\_k}}}\_{=p\_k} \underbrace{\frac{e^{z\_j}}{\sum\_{k=1}^n e^{z\_k}}}\_{=p\_j} \\\\
&amp;= -\sum\_{k \neq j}y\_kp\_j
\end{aligned}
$$
&lt;/li>
&lt;/ul>
&lt;p>Therefore,
&lt;/p>
$$
\begin{aligned}
\frac{\partial}{\partial z\_j}L
&amp;= -\left(\frac{\partial}{\partial z\_j} y\_j \log(p\_j) + \frac{\partial}{\partial z\_j}\sum\_{k \neq j}y\_k \log(p\_k)\right) \\\\
&amp;= -\left(y\_j(1-p\_j) - \sum\_{k \neq j}y\_kp\_j\right) \\\\
&amp;= -\left(y\_j-y\_jp\_j - \sum\_{k \neq j}y\_kp\_j\right) \\\\
&amp;= -\left(y\_j- (y\_jp\_j + \sum\_{k \neq j}y\_kp\_j)\right) \\\\
&amp;= -\left(y\_j- \sum\_{k=1}^ny\_kp\_j\right)\\\\
&amp;\overset{\sum\_{k=1}^{n} y\_k = 1}{=} -\left(y\_j- p\_j\right) \\\\
&amp;= p\_j - y\_j
\end{aligned}
$$
&lt;h2 id="useful-resources">Useful resources&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>&lt;a href="https://towardsdatascience.com/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1">Derivative of the Softmax Function and the Categorical Cross-Entropy Loss&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/">The Softmax function and its derivative&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://gombru.github.io/2018/05/23/cross_entropy_loss/">Understanding Categorical Cross-Entropy Loss, Binary Cross-Entropy Loss, Softmax Loss, Logistic Loss, Focal Loss and all those confusing names&lt;/a>&lt;/p>
&lt;/li>
&lt;/ul></description></item><item><title>Generalization</title><link>https://haobin-tan.netlify.app/docs/ai/deep-learning/nn-basics/generalization/</link><pubDate>Mon, 17 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/deep-learning/nn-basics/generalization/</guid><description>&lt;p>&lt;strong>Generalization: Ability to Apply what was learned during Training to &lt;em>new (Test)&lt;/em> Data&lt;/strong>&lt;/p>
&lt;h2 id="reasons-for-bad-generalization">Reasons for bad generalization&lt;/h2>
&lt;ul>
&lt;li>Overfitting/Overtraining (trained too long)&lt;/li>
&lt;li>Too little training material&lt;/li>
&lt;li>Too many Parameters (weights) or inappropriate network architecture error&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-17%2012.59.51.png" alt="截屏2020-08-17 12.59.51" style="zoom:67%;" />
&lt;h2 id="prevent-overfitting">Prevent Overfitting&lt;/h2>
&lt;ul>
&lt;li>The obviously best approach: Collect More Data! &amp;#x1f4aa;&lt;/li>
&lt;li>If Data is Limited
&lt;ul>
&lt;li>Simplest Method for Best Generalization: &lt;strong>Early Stopping&lt;/strong>&lt;/li>
&lt;li>Optimize Parameters/Arcitecture
&lt;ul>
&lt;li>Architectural Learning&lt;/li>
&lt;li>Choose Best Architecture by Repeated Experimentation on Cross Validation Set&lt;/li>
&lt;li>Reduce Architecture Starting from Large&lt;/li>
&lt;li>&lt;a href="#constructive-methods">Grow Architecture Starting from Small&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="destructive-methods">Destructive Methods&lt;/h3>
&lt;p>&lt;strong>Reduce&lt;/strong> Complexity of Network through &lt;strong>Regularization&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Weight Decay&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Weight Elimination&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="#optimal-brain-damage">Optimal Brain Damage&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Optimal Brain Surgeon&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="optimal-brain-damage">Optimal Brain Damage&lt;/h4>
&lt;ul>
&lt;li>💡Idea: Certain connections are removed from the network to reduce complexity and to avoide overfitting&lt;/li>
&lt;li>Remove those connections that have the &lt;strong>least&lt;/strong> effect on the Error (MSE, ..), i.e. are the least important.
&lt;ul>
&lt;li>But this is time consuming (difficult) 🤪&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="constructive-methods">Constructive Methods&lt;/h3>
&lt;p>Iteratively &lt;strong>Increasing/Growing&lt;/strong> a Network (construktive) starting from a very small one&lt;/p>
&lt;ul>
&lt;li>&lt;a href="#cascade-correlation">Cascade Correlation&lt;/a>&lt;/li>
&lt;li>Meiosis Netzwerke&lt;/li>
&lt;li>ASO (Automativ Structure Optimization)&lt;/li>
&lt;/ul>
&lt;h4 id="cascade-correlation">Cascade Correlation&lt;/h4>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-17%2013.22.42.png" alt="截屏2020-08-17 13.22.42" style="zoom:50%;" />
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-17%2013.23.01.png" alt="截屏2020-08-17 13.23.01" style="zoom:50%;" />
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-17%2013.23.22.png" alt="截屏2020-08-17 13.23.22" style="zoom:50%;" />
&lt;ul>
&lt;li>
&lt;p>Adding a hidden unit&lt;/p>
&lt;ul>
&lt;li>Input connections from all input units and from all already existing hidden units&lt;/li>
&lt;li>First only these connections are adapted&lt;/li>
&lt;li>Maximize the correlation between the activation of the candidate units and the residual error of the net&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Not necessary to determine the number of hidden units empirically&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Can produce deep networks without dramatic slowdown (bottom up, constructive learning)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>At each point only one layer of connections is trained&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Learning is fast&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Learning is incremental&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="dropout">Dropout&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Popular and very effective method for generalization&lt;/p>
&lt;/li>
&lt;li>
&lt;p>💡Idea&lt;/p>
&lt;ul>
&lt;li>&lt;em>Randomly&lt;/em> drop out (zero) hidden units and input features during training&lt;/li>
&lt;li>Prevents feature co-adaptation&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Illustration&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-17%2013.27.32.png" alt="截屏2020-08-17 13.27.32" style="zoom:67%;" />
&lt;/li>
&lt;li>
&lt;p>Dropout training &amp;amp; test&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-17%2013.28.32.png" alt="截屏2020-08-17 13.28.32" style="zoom:67%;" />
&lt;/li>
&lt;/ul></description></item><item><title>Dropout</title><link>https://haobin-tan.netlify.app/docs/ai/deep-learning/nn-basics/dropout/</link><pubDate>Sun, 16 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/deep-learning/nn-basics/dropout/</guid><description>&lt;h2 id="model-overfitting">Model Overfitting&lt;/h2>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-23%2022.00.46.png" alt="截屏2020-08-23 22.00.46" style="zoom:50%;" />
&lt;p>In order to give more &amp;ldquo;capacity&amp;rdquo; to capture different features, we give neural nets a lot of neurons. But this can cause overfitting.&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-23%2021.59.37.png" alt="截屏2020-08-23 21.59.37" style="zoom: 50%;" />
&lt;p>Reason: Co-adaptation&lt;/p>
&lt;ul>
&lt;li>Neurons become dependent on others&lt;/li>
&lt;li>Imagination: neuron $H\_i$ captures a particular feature $X$ which however, is very frequenly seen with some inputs.
&lt;ul>
&lt;li>If $H\_i$ receives bad inputs (partial of the combination), then there is a chance that the feature is ignored 🤪&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Solution: Dropout!&lt;/strong> &amp;#x1f4aa;&lt;/p>
&lt;h2 id="dropout">Dropout&lt;/h2>
&lt;p>With dropout the layer inputs become more sparse, forcing the network weights to become more robust.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-08-23%2022.06.16.png" alt="截屏2020-08-23 22.06.16" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>Dropout a neuron = all the inputs and outputs to this neuron will be disabled at the current iteration.&lt;/p>
&lt;h3 id="training">Training&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Given&lt;/p>
&lt;ul>
&lt;li>input $X \in \mathbb{R}^D$&lt;/li>
&lt;li>weights $W$&lt;/li>
&lt;li>survival rate $p$
&lt;ul>
&lt;li>Usually $p=0.5$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Sample mask $M \in \{0, 1\}^D$ with $M\_i \sim \operatorname{Bernoulli}(p)$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Dropped input:
&lt;/p>
$$
\hat{X} = X \circ M
$$
&lt;/li>
&lt;li>
&lt;p>Perform backward pass and mask the gradients:
&lt;/p>
$$
\frac{\delta L}{\delta X}=\frac{\delta L}{\delta \hat{X}} \circ M
$$
&lt;/li>
&lt;/ul>
&lt;h3 id="evaluationtestinginference">Evaluation/Testing/Inference&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>ALL input neurons $X$ are presented WITHOUT masking&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Because each neuron appears with probability $p$ in training&lt;/p>
&lt;p>$\to$ So we have to scale $X$ with $p$ (or scale $\hat{X}$ with $\frac{1}{1-p}$ during training) to match its expectation&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="why-dropout-works">Why Dropout works?&lt;/h2>
&lt;ul>
&lt;li>Intuition: Dropout prevents the network to be too dependent on a small number of neurons, and forces every neuron to be able to operate independently.&lt;/li>
&lt;li>Each of the “dropped” instance is a different network configuration&lt;/li>
&lt;li>$2^n$ different networks sharing weights&lt;/li>
&lt;li>The inference process can be understood as an &lt;strong>ensemble of $2^n$ different configuration&lt;/strong>&lt;/li>
&lt;li>This interpretation is in-line with &lt;em>Bayesian Neural Networks&lt;/em>&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-23%2022.20.36.png" alt="截屏2020-08-23 22.20.36" style="zoom: 50%;" /></description></item><item><title>👍 Data Augmentation</title><link>https://haobin-tan.netlify.app/docs/ai/deep-learning/nn-basics/data-augmentation/</link><pubDate>Sun, 16 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/deep-learning/nn-basics/data-augmentation/</guid><description>&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Overfitting happens because of having &lt;strong>too few examples&lt;/strong> to train on, resulting in a model that has poor generalization performance &amp;#x1f622;. If we had infinite training data, we wouldn’t overfit because we would see every possible instance.&lt;/p>
&lt;p>However, in most machine learning applications, especially in image classification tasks, obtaining new training data is not easy. Therefore we need to make do with the training set at hand. &amp;#x1f4aa;&lt;/p>
&lt;p>&lt;strong>Data augmentation is a way to generate more training data from our current set. It enriches or “augments” the training data by generating new examples via random transformation of existing ones. This way we artificially boost the size of the training set, reducing overfitting. So data augmentation can also be considered as a regularization technique.&lt;/strong>&lt;/p>
&lt;p>Data augmentation is done dynamically during training time. We need to generate realistic images, and the transformations should be learnable, simply adding noise won’t help. Common transformations are&lt;/p>
&lt;ul>
&lt;li>rotation&lt;/li>
&lt;li>shifting&lt;/li>
&lt;li>resizing&lt;/li>
&lt;li>exposure adjustment&lt;/li>
&lt;li>contrast change&lt;/li>
&lt;li>etc.&lt;/li>
&lt;/ul>
&lt;p>This way we can generate a lot of new samples from a single training example.&lt;/p>
&lt;p>&lt;strong>Notice that data augmentation is ONLY performed on the training data, we don’t touch the validation or test set.&lt;/strong>&lt;/p>
&lt;h2 id="popular-augmentation-techniques">Popular Augmentation Techniques&lt;/h2>
&lt;h3 id="flip">Flip&lt;/h3>
&lt;figure>&lt;img src="https://nanonets.com/blog/content/images/2018/11/1_-beH1nNqlm_Wj-0PcWUKTw.jpeg"
alt="Left: original image. Middle: image flipped horizontally. Right: image flipped vertically">&lt;figcaption>
&lt;p>Left: original image. Middle: image flipped horizontally. Right: image flipped vertically&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;h3 id="rotation">Rotation&lt;/h3>
&lt;figure>&lt;img src="https://cdn-images-1.medium.com/max/720/1*i_F6aNKj3yggkcNXQxYA4A.jpeg"
alt="Example of square images rotated at right angles. From left to right: The images are rotated by 90 degrees clockwise with respect to the previous one.">&lt;figcaption>
&lt;p>Example of square images rotated at right angles. From left to right: The images are rotated by 90 degrees clockwise with respect to the previous one.&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;p>Note: image dimensions may not be preserved after rotation&lt;/p>
&lt;ul>
&lt;li>If image is a square, rotating it at right angles will preserve the image size.&lt;/li>
&lt;li>If image is a rectangle, rotating it by 180 degrees would preserve the size.&lt;/li>
&lt;/ul>
&lt;h3 id="scale">Scale&lt;/h3>
&lt;figure>&lt;img src="https://cdn-images-1.medium.com/max/720/1*INLTn7GWM-m69GUwFzPOaQ.jpeg"
alt="Left: original image. Middle: image scaled outward by 10%. Right: image scaled outward by 20%">&lt;figcaption>
&lt;p>Left: original image. Middle: image scaled outward by 10%. Right: image scaled outward by 20%&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;p>The image can be scaled outward or inward. While scaling outward, the final image size will be larger than the original image size. Most image frameworks cut out a section from the new image, with size equal to the original image.&lt;/p>
&lt;h3 id="crop">Crop&lt;/h3>
&lt;figure>&lt;img src="https://cdn-images-1.medium.com/max/720/1*ypuimiaLtg_9KaQwltrxJQ.jpeg"
alt="Left: original image. Middle: a square section cropped from the top-left. Right: a square section cropped from the bottom-right. The cropped sections were resized to the original image size.">&lt;figcaption>
&lt;p>Left: original image. Middle: a square section cropped from the top-left. Right: a square section cropped from the bottom-right. The cropped sections were resized to the original image size.&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;p>Random cropping&lt;/p>
&lt;ol>
&lt;li>Randomly sample a section from the original image&lt;/li>
&lt;li>Resize this section to the original image size&lt;/li>
&lt;/ol>
&lt;h3 id="translation">Translation&lt;/h3>
&lt;figure>&lt;img src="https://cdn-images-1.medium.com/max/720/1*L07HTRw7zuHGT4oYEMlDig.jpeg"
alt="Left: original image. Middle: the image translated to the right. Right: the image translated upwards.">&lt;figcaption>
&lt;p>Left: original image. Middle: the image translated to the right. Right: the image translated upwards.&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;p>&lt;strong>Translation = moving the image along the X or Y direction (or both)&lt;/strong>&lt;/p>
&lt;p>This method of augmentation is very useful as most objects can be located at almost anywhere in the image. This forces your convolutional neural network to look everywhere.&lt;/p>
&lt;h3 id="gaussian-noise">Gaussian Noise&lt;/h3>
&lt;figure>&lt;img src="https://cdn-images-1.medium.com/max/720/1*cx24OpSNOwgg7ULUHKiGnA.png"
alt="Left: original image. Middle: image with added Gaussian noise. Right: image with added salt and pepper noise.">&lt;figcaption>
&lt;p>Left: original image. Middle: image with added Gaussian noise. Right: image with added salt and pepper noise.&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;p>One reason of overfitting ist that neural network tries to learn high frequency features (patterns that occur a lot) that may not be useful.&lt;/p>
&lt;p>&lt;strong>Gaussian noise&lt;/strong>, which has zero mean, essentially has data points in all frequencies, effectively distorting the high frequency features. This also means that lower frequency components (usually, your intended data) are also distorted, but your neural network can learn to look past that. Adding just the right amount of noise can enhance the learning capability.&lt;/p>
&lt;p>A toned down version of this is the &lt;strong>salt and pepper noise&lt;/strong>, which presents itself as random black and white pixels spread through the image. This is similar to the effect produced by adding Gaussian noise to an image, but may have a lower information distortion level.&lt;/p>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2#9722">Applied Deep Learning - Part 4: Convolutional Neural Networks&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://nanonets.com/blog/data-augmentation-how-to-use-deep-learning-when-you-have-limited-data-part-2/">Data Augmentation | How to use Deep Learning when you have Limited Data — Part 2&lt;/a>&lt;/li>
&lt;/ul></description></item></channel></rss>