<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Efficient Training | Haobin Tan</title><link>https://haobin-tan.netlify.app/tags/efficient-training/</link><atom:link href="https://haobin-tan.netlify.app/tags/efficient-training/index.xml" rel="self" type="application/rss+xml"/><description>Efficient Training</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Sun, 16 Aug 2020 00:00:00 +0000</lastBuildDate><image><url>https://haobin-tan.netlify.app/media/icon_hu7d15bc7db65c8eaf7a4f66f5447d0b42_15095_512x512_fill_lanczos_center_3.png</url><title>Efficient Training</title><link>https://haobin-tan.netlify.app/tags/efficient-training/</link></image><item><title>Efficient Training</title><link>https://haobin-tan.netlify.app/docs/ai/deep-learning/efficient-training/</link><pubDate>Sun, 16 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/deep-learning/efficient-training/</guid><description/></item><item><title>👍 Optimization Algorithms</title><link>https://haobin-tan.netlify.app/docs/ai/deep-learning/efficient-training/optimization-algo/</link><pubDate>Fri, 31 Jul 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/deep-learning/efficient-training/optimization-algo/</guid><description> &lt;!-- TODO: Add links to upload pdfs when staticref shortcode is available (Check the old blog) -->
&lt;h2 id="resource">Resource&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="http://colah.github.io/posts/2015-08-Backprop/">Calculus on Computational Graphs: Backpropagation&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>👍 Batch Normalization</title><link>https://haobin-tan.netlify.app/docs/ai/deep-learning/efficient-training/batch-normalization/</link><pubDate>Sun, 16 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/deep-learning/efficient-training/batch-normalization/</guid><description>&lt;h2 id="tldr">TL;DR&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>Problem: During training, updating a lower layer changes the input distribution for the next layer → next layer constantly needs to adapt to changing inputs&lt;/p>
&lt;/li>
&lt;li>
&lt;p>💡Idea: mean/variance normalization step between layers&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Faster / more effective training&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Gradients less dependent on scale of parameters&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Allow higher learning rates&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Combat saturation problem&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>How:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Mean over the batch dimension
&lt;/p>
$$
\mu=\frac{1}{M} \sum\_{i=1}^{M} X\_{i,:}
$$
&lt;/li>
&lt;li>
&lt;p>Variance of the mini-batch
&lt;/p>
$$
\sigma^{2}=\frac{1}{M} \sum\_{i=1}^{M}\left(X\_{i}-\mu\right)^{2}
$$
&lt;/li>
&lt;li>
&lt;p>Normalization
&lt;/p>
$$
\hat{X}=\frac{X-\mu}{\sqrt{\sigma^{2}+\epsilon}}
$$
&lt;/li>
&lt;li>
&lt;p>Scale and shift ($\gamma$ and $\beta$ are network parameters)
&lt;/p>
$$
X^{N}=\gamma \circ X+\beta
$$
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="motivation-feature-scaling">Motivation: Feature scaling&lt;/h2>
&lt;p>Make different features have the same scaling (normalizing the data)&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-20%2015.33.11.png" alt="截屏2020-05-20 15.33.11" style="zoom:67%;" />
&lt;ul>
&lt;li>$x_i^r$: the $i$-th feature of the $r$-th input sample/instance&lt;/li>
&lt;/ul>
&lt;p>In general, gradient descent converges &lt;strong>much faster&lt;/strong> with feature scaling than without it.&lt;/p>
&lt;p>Illustration:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-20%2015.35.33.png" alt="截屏2020-05-20 15.35.33" style="zoom: 33%;" />
&lt;h3 id="in-hidden-layer">In hidden layer&lt;/h3>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-05-20%2015.38.08.png" alt="截屏2020-05-20 15.38.08">&lt;/p>
&lt;p>From the point of view of Layer 2, its input is $a^1$, which is the output of Layer 1. As feature scaling helps a lot in training (gradient descent will converge much faster), can we also apply feature scaling for $a^1$ and the other hidden layer&amp;rsquo;s output (such as $a^2$)?&lt;/p>
&lt;h2 id="internal-covariate-shift">Internal Covariate Shift&lt;/h2>
&lt;p>In &lt;a href="https://arxiv.org/abs/1502.03167">Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift&lt;/a>, the author&amp;rsquo;s definition is:&lt;/p>
&lt;blockquote>
&lt;p>We define Internal Covariate Shift as the change in the distribution of network activations due to the change in network parameters during training.&lt;/p>
&lt;/blockquote>
&lt;p>In neural networks, the output of the first layer feeds into the second layer, the output of the second layer feeds into the third, and so on. When the parameters of a layer change, so does the distribution of inputs to subsequent layers.&lt;/p>
&lt;p>These shifts in input distributions can be problematic for neural networks, especially deep neural networks that could have a large number of layers.&lt;/p>
&lt;p>A common solution is to use small learning rate, but the training would then be slower. 😢&lt;/p>
&lt;h2 id="batch-nomalization-bn">Batch Nomalization (BN)&lt;/h2>
&lt;p>💪 &lt;strong>Aim: solve internal covariate shift&lt;/strong>&lt;/p>
&lt;h3 id="batch">Batch&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-20%2015.55.17.png" alt="截屏2020-05-20 15.55.17" style="zoom: 40%;" />
&lt;h3 id="batch-normalization">Batch normalization&lt;/h3>
&lt;p>Usually we apply BN on the input of the activation function (i.e., &lt;strong>before&lt;/strong> activation function )&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/dropin.jpg" alt="How does Batch Normalization Help Optimization? – gradient science" style="zoom: 15%;" />
&lt;p>Take the first hidden layer as example:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-20%2016.14.48.png" alt="截屏2020-05-20 16.14.48" style="zoom: 33%;" />
&lt;ul>
&lt;li>Compute mean&lt;/li>
&lt;/ul>
$$
\mu = \frac{1}{N}\sum_{i=1}^{N}z^i
$$
&lt;ul>
&lt;li>Compute standard deviation&lt;/li>
&lt;/ul>
$$
\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(z^i-\mu)^2}
$$
&lt;ul>
&lt;li>
&lt;p>Normalize $z^i$ (Division is &lt;strong>element-wise&lt;/strong>)
&lt;/p>
$$
\tilde{z}^i = \frac{z^i - \mu}{\sigma}
$$
&lt;ul>
&lt;li>
&lt;p>Now $\tilde{z}^i$ has zero mean and unit variance.&lt;/p>
&lt;ul>
&lt;li>Good for activation function which could saturate (such as sigmoid, tanh, etc.)
&lt;table>
&lt;thead>
&lt;tr>
&lt;th style="text-align:center">Sigmoid witout BN&lt;/th>
&lt;th style="text-align:center">Sigmoid with BN&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align:center">&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/0*0CJqMLXgnZo1VqhS.jpeg" alt="img" />&lt;/td>
&lt;td style="text-align:center">&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/0*tPSfbtV7ILH0IN-I.jpeg" alt="img" style="zoom:55%;" />&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Scale and shift
&lt;/p>
$$
\hat{z}^{i}=\gamma \odot \tilde{z}^{i}+\beta
$$
&lt;ul>
&lt;li>In practice, restricting the activations of each layer to be strictly zero mean and unit variance can limit the expressive power of the network.
&lt;ul>
&lt;li>E,g,, some activation function doesn&amp;rsquo;t require the input to be zero mean and unit variance&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Scaling and shifting allow the network to learn input-independent parameters $\gamma$ and $\beta$ that can convert the mean and variance to any value that the network desires&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Note&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Ideally, $\mu$ and $\sigma$ should be computed using the whole training dataset&lt;/p>
&lt;/li>
&lt;li>
&lt;p>But this is expensive and infeasible&lt;/p>
&lt;ul>
&lt;li>The size of training dataset is &lt;strong>enormous&lt;/strong>&lt;/li>
&lt;li>When $W^1$ gets updated, the output of the hidden layer will change, we have to compute $\mu$ and $\sigma$ again&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>In practice, we can apply BN on batch of data, instead of the whole training dataset&lt;/p>
&lt;ul>
&lt;li>
&lt;p>But the size of bach can not be too small&lt;/p>
&lt;/li>
&lt;li>
&lt;p>If we apply BN on a small batch, it is difficult to estimate the mean ($\mu$) and the standard deviation ($\sigma$) of the WHOLE training dataset&lt;/p>
&lt;p>$\rightarrow$ The performance of BN will be bad!&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="bn-in-testing">BN in Testing&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-20%2016.31.25.png" alt="截屏2020-05-20 16.31.25" style="zoom:80%;" />
&lt;p>Problem: We do NOT have &lt;strong>batch&lt;/strong> at testing stage. How can we estimate $\mu$ and $\sigma$?&lt;/p>
&lt;p>Ideal solution: Compute $\mu$ and $\sigma$ using the whole training set&lt;/p>
&lt;ul>
&lt;li>But it is difficult in pratice
&lt;ul>
&lt;li>Traing set too large&lt;/li>
&lt;li>Training could be online training&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Practical solution: Compute the moving average of $\mu$ and $\sigma$ of the batches during training&lt;/strong>&lt;/p>
&lt;h3 id="-benefit-of-bn">👍 Benefit of BN&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Reduce training times, and make very deep net trainable&lt;/p>
&lt;ul>
&lt;li>Less covariate shift, we can use larger learning rate&lt;/li>
&lt;li>less exploding/vanishing gradients
&lt;ul>
&lt;li>Especailly effective for sigmoid, tanh, etc.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Learning is less affected by parameters initialization&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-20%2016.49.22.png" alt="截屏2020-05-20 16.49.22" style="zoom:40%;" />
&lt;/li>
&lt;li>
&lt;p>Reduces the demand for regularization, helps preventing overfitting&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="layer-normalization">Layer Normalization&lt;/h2>
&lt;p>Batch Normalization was wonderful, but not always applicable&lt;/p>
&lt;ul>
&lt;li>
&lt;p>When large mini-batch was not feasible&lt;/p>
&lt;/li>
&lt;li>
&lt;p>It was difficult to apply for Recurrent Neural Networks&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Alternative: &lt;strong>Layer Normalization&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Perform
&lt;/p>
$$
\begin{aligned}
\mu &amp;= \frac{1}{M} \sum\_{i=1}^{M} X\_{i,:} \\\\
\sigma^{2} &amp;= \frac{1}{M} \sum\_{i=1}^{M}\left(X\_{i}-\mu\right)^{2}
\end{aligned}
$$
&lt;p>
over the &lt;em>feature dimension&lt;/em>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Not as effective as Batch Normalization, but still widely used because of better efficiency.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="normalization-techniques-comparison">Normalization Techniques Comparison&lt;/h2>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/normalization.png"
alt="Normalization techniques. (Source: In-layer normalization techniques for training very deep neural networks)">&lt;figcaption>
&lt;p>Normalization techniques. (Source: &lt;a href="https://theaisummer.com/normalization/">In-layer normalization techniques for training very deep neural networks&lt;/a>)&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>&lt;a href="https://arxiv.org/abs/1502.03167">Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>👍 &lt;a href="https://www.youtube.com/watch?v=BZh1ltr5Rkg">Batch Normalization&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>👍 &lt;a href="https://www.youtube.com/watch?v=-5hESl-Lj-4">Why need batch normalization?&lt;/a>&lt;/p>
&lt;/li>
&lt;/ul></description></item></channel></rss>