<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Transformer | Haobin Tan</title><link>https://haobin-tan.netlify.app/tags/transformer/</link><atom:link href="https://haobin-tan.netlify.app/tags/transformer/index.xml" rel="self" type="application/rss+xml"/><description>Transformer</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Tue, 13 Apr 2021 00:00:00 +0000</lastBuildDate><image><url>https://haobin-tan.netlify.app/media/icon_hu7d15bc7db65c8eaf7a4f66f5447d0b42_15095_512x512_fill_lanczos_center_3.png</url><title>Transformer</title><link>https://haobin-tan.netlify.app/tags/transformer/</link></image><item><title>Visual Transformer</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-transformer/</link><pubDate>Sun, 11 Apr 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-transformer/</guid><description/></item><item><title>Attention Mechanism</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-transformer/attention-recap/</link><pubDate>Sun, 11 Apr 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-transformer/attention-recap/</guid><description>&lt;h2 id="human-attention">Human Attention&lt;/h2>
&lt;p>The visual attention mechanism is a signal processing mechanism in the brain that is unique to human vision. By quickly scanning the global image, human vision obtains a target area to focus on, which is generally referred to as the focus of attention, and then devotes more attentional resources to this area to obtain more detailed information about the target to be focused on and suppress other useless information.&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/20171210213743273.jpeg"
alt="This figure demonstrates how humans efficiently allocate their limited attentional resources when presented with an image. Red areas indicate the targets to which the visual system is more attentive. It is clear that people devote more attention to the face, the title of the text, and the first sentence of the article.">&lt;figcaption>
&lt;p>This figure demonstrates how humans efficiently allocate their limited attentional resources when presented with an image. Red areas indicate the targets to which the visual system is more attentive. It is clear that people devote more attention to the face, the title of the text, and the first sentence of the article.&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;h2 id="understanding-attention-mechanism">Understanding Attention Mechanism&lt;/h2>
&lt;p>The attention mechanism in deep learning is essentially similar to the human selective visual attention mechanism. Its main goal is also to &lt;strong>select the information that is more critical to the current task goal from a large amount of information.&lt;/strong>&lt;/p>
&lt;p>The general process of attentional mechanism can be modelled as follows&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-04-11%2011.07.26.png" alt="截屏2021-04-11 11.07.26">&lt;/p>
&lt;ul>
&lt;li>In source, there&amp;rsquo;re a number of &lt;code>&amp;lt;key, value&amp;gt;&lt;/code> pairs.&lt;/li>
&lt;li>Given a &lt;code>query&lt;/code> of target, the weight coefficient of each &lt;code>value&lt;/code> is obtained by calculating the similarity/relevance between its corresponding &lt;code>key&lt;/code> and the &lt;code>query&lt;/code>.&lt;/li>
&lt;li>The &lt;code>value&lt;/code>s are then weighted and summed up, which gives the final attention value.&lt;/li>
&lt;/ul>
&lt;p>So the attention mechanism can be formulated as: (assuming the length of Source is $N$)
&lt;/p>
$$
\operatorname{Attention}(\text{Query}, \text{Source}) = \sum\_{i=1}^{N} \operatorname{Similarity}(\text{Query}, \text{Key}\_i) \cdot \text{Value}\_i
$$
&lt;h3 id="understanding-attention-mechanism-as-soft-adressing">Understanding Attention Mechanism as &amp;ldquo;Soft Adressing&amp;rdquo;&lt;/h3>
&lt;p>We can also understand attention mechanism as &amp;ldquo;&lt;strong>soft adressing&lt;/strong>&amp;rdquo;&lt;/p>
&lt;ul>
&lt;li>A number of key(&lt;em>address&lt;/em>)-value(&lt;em>content&lt;/em>) pairs are stored in source(&lt;em>memory&lt;/em>)&lt;/li>
&lt;li>Given a query, soft addressing is performed by comparing the similarity/relevance between query and keys.
&lt;ul>
&lt;li>By general addressing, only ONE value can be found from the memory given a query.&lt;/li>
&lt;li>In soft addressing, values may be taken out from many addresses. The importance of each value is determined by the similarity/relevance between its address and the given query. The higher the relevance, the more important is the value.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>All retrieved values are then weighted and summed up to achieve the final value.&lt;/li>
&lt;/ul>
&lt;p>Example:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-04-11%2012.59.27.png" alt="截屏2021-04-11 12.59.27">&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-04-11%2012.59.45.png" alt="截屏2021-04-11 12.59.45">&lt;/p>
&lt;h2 id="computation-of-attention">Computation of Attention&lt;/h2>
&lt;p>The computation of attention can be described as follows:&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/attention-Computation.png"
alt="Computation of attention">&lt;figcaption>
&lt;p>Computation of attention&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;ol>
&lt;li>
&lt;p>Compute similarity/relevance score between query $Q$ and each key $K\_i$ using one of the following methods&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Dot product
&lt;/p>
$$
s\_i = \operatorname{Similarity}(Q, K\_i) = Q \cdot K\_i
$$
&lt;/li>
&lt;li>
&lt;p>Cosine similarity
&lt;/p>
$$
s\_i = \operatorname{Similarity}(Q, K\_i) = \frac{Q \cdot K\_i}{\\|Q\\| \cdot \\|K\_i\\|}
$$
&lt;/li>
&lt;li>
&lt;p>MLP
&lt;/p>
$$
s\_i = \operatorname{Similarity}(Q, K\_i) = \operatorname{MLP}(Q, K\_i)
$$
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Apply $softmax()$, obtain weight for each value $V\_i$
&lt;/p>
$$
a\_i = \frac{e^{s\_i}}{\sum\_{j=1}^{N} e^{s\_j}} \in [0, 1]
$$
$$
\sum\_{i} a\_i = 1
$$
&lt;/li>
&lt;li>
&lt;p>Weighted sum
&lt;/p>
$$
\operatorname{Attention}(Q, \text{Source}) = \sum\_{i}^{N} a\_i V\_i
$$
&lt;/li>
&lt;/ol>
&lt;h2 id="self-attention">Self-Attention&lt;/h2>
&lt;p>In general Encoder-Decoder architecture, source (input) and target (output) are &lt;em>different&lt;/em>. For example, in French-English translation task, source is the input frence sentence, and target is the output translated english sentence. In this case, query comes from target. Attention mechanism is applied between query and elements in source.&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/bahdanau-fig3.png"
alt="Alignment matrix of &amp;amp;lsquo;L’accord sur l’Espace économique européen a été signé en août 1992&amp;amp;rsquo; (French) and its English translation &amp;amp;lsquo;The agreement on the European Economic Area was signed in August 1992&amp;amp;rsquo;. In this case, source is the Frence sentence and target is the English sentence. (Source: Bahdanau et al., 2015)">&lt;figcaption>
&lt;p>Alignment matrix of &amp;lsquo;&lt;em>L’accord sur l’Espace économique européen a été signé en août 1992&lt;/em>&amp;rsquo; (French) and its English translation &amp;lsquo;&lt;em>The agreement on the European Economic Area was signed in August 1992&lt;/em>&amp;rsquo;. In this case, source is the Frence sentence and target is the English sentence. (Source: &lt;a href="https://arxiv.org/pdf/1409.0473.pdf">Bahdanau et al., 2015&lt;/a>)&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;p>A special case is that source and target are the SAME. This is called &lt;strong>self-attention&lt;/strong>, also known as &lt;strong>intra-attention&lt;/strong>. It is an attention mechanism relating different positions of a &lt;em>single&lt;/em> sequence in order to compute a representation of the &lt;em>same&lt;/em> sequence. In other words, $Q=K=V$.&lt;/p>
&lt;p>The self-attention answers the question: &amp;ldquo;Looking at a word in a sentence, how much attention should be paid to each of the other words in this sentence?&amp;rdquo;&lt;/p>
&lt;p>Example: In &lt;a href="https://arxiv.org/pdf/1601.06733.pdf">this paper&lt;/a>, self-attention is applied to do machine reading. the self-attention mechanism enables us to learn the correlation between the current words and the previous part of the sentence.&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/cheng2016-fig1.png"
alt="The current word is in red and the size of the blue shade indicates the activation level. (Source: Cheng et al., 2016)">&lt;figcaption>
&lt;p>The current word is in red and the size of the blue shade indicates the activation level. (Source: &lt;a href="https://arxiv.org/pdf/1601.06733.pdf">Cheng et al., 2016&lt;/a>)&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;h3 id="self-attention-in-detail">Self-Attention in Detail&lt;/h3>
&lt;p>Assume that we have 4 input vector $\boldsymbol{a_{1}},..., \boldsymbol{a_{4}}$. As an exmple, we take the first output, $b_{1}$, to see how self-attention works under the hood.&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-04-13%2021.52.15.png"
alt="Self-attention computation">&lt;figcaption>
&lt;p>Self-attention computation&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;ol>
&lt;li>
&lt;p>For every input vector $\boldsymbol{a}\_i$, we apply 3 &lt;strong>learnable&lt;/strong> linear transformation (i.e., three matrix $W^q, W^k, W^v$) to achieve the query, key, and value vectors
&lt;/p>
$$
\begin{array}{l}
\forall a\_i: &amp; \boldsymbol{q\_i} = W^q \boldsymbol{a\_i} \quad \text{(Query)}\\\\
&amp; \boldsymbol{k\_i} = W^k \boldsymbol{a\_i} \quad \text{(Key)}\\\\
&amp; \boldsymbol{v\_i} = W^v \boldsymbol{a\_i} \quad \text{(Value)}
\end{array}
$$
&lt;/li>
&lt;li>
&lt;p>We&amp;rsquo;re looking at the first input vector $\boldsymbol{a_{1}}$. We compute attention scores $\alpha\_{1, i}$ by performing dot product of $\boldsymbol{q_{1}}$ and $\boldsymbol{k_{i}}$.
&lt;/p>
$$
\alpha\_{1, i} = \boldsymbol{q_1} \cdot \boldsymbol{k_i}
$$
&lt;p>
$\alpha\_{1, i}$ can also be considered as the similarity/relevance between $\boldsymbol{q_1}$ and $\boldsymbol{k_i}$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>We perform Normalization for the attention score by applying Softmax function.
&lt;/p>
$$
\alpha\_{1, i}^\prime = \frac{\exp(\alpha\_{1, i})}{\sum\_j \exp(\alpha\_{1, j})}
$$
&lt;p>
$\alpha\_{1, i}^\prime$ is the weight of attention that should be paid to the vector $\boldsymbol{a}\_i$ when looking at the vector $\boldsymbol{a}\_1$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>We extract information based on the normalized attention scores by performing a weighted sum of $\boldsymbol{v\_i}$
&lt;/p>
$$
\boldsymbol{b\_1} = \sum\_{i} \alpha\_{1, i}^\prime \boldsymbol{v}\_i
$$
&lt;/li>
&lt;/ol>
&lt;h3 id="matrix-representation">Matrix Representation&lt;/h3>
&lt;p>The illustration above shows the details of self-attention computation. In practice, we use matrix multiplication to make the computation more efficient.&lt;/p>
&lt;ol>
&lt;li>
&lt;p>Get query, key, value vectors&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-04-13%2022.22.53.png" alt="截屏2021-04-13 22.22.53">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Compute attention scores&lt;/p>
&lt;ul>
&lt;li>For $\alpha_{1, i}$:&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-04-13%2022.32.37.png" alt="截屏2021-04-13 22.32.37">&lt;/p>
&lt;ul>
&lt;li>
&lt;p>For all attention scores:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-04-13%2022.38.18.png" alt="截屏2021-04-13 22.38.18">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Apply softmax:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-04-13%2022.41.27.png" alt="截屏2021-04-13 22.41.27">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Compute weighted sum of $\boldsymbol{v}\_i$&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-04-13%2022.45.37.png" alt="截屏2021-04-13 22.45.37">&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>In summary:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-04-13%2023.41.39.png" alt="截屏2021-04-13 23.41.39" style="zoom:67%;" />
&lt;h3 id="multi-head-self-attention">Multi-head Self-Attention&lt;/h3>
&lt;p>We take 2 heads as example:&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-04-14%2011.52.24.png"
alt="Multi-head Self-attention (2 heads as example)">&lt;figcaption>
&lt;p>Multi-head Self-attention (2 heads as example)&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;p>Different heads represent different types of relevance. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.&lt;/p>
&lt;h3 id="self-attention-for-image">Self-Attention for Image&lt;/h3>
&lt;p>The required input of self-attention is a vector set. An image can also be considered as a vector set.&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-04-13%2023.28.50.png"
alt="This image can be considered as a vector set containing $5 \times 10$ vectors of shape $(1 \times 1 \times 3).$">&lt;figcaption>
&lt;p>This image can be considered as a vector set containing $5 \times 10$ vectors of shape $(1 \times 1 \times 3).$&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;h3 id="self-attention-vs-cnn">Self-Attention vs. CNN&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-04-13%2023.32.39.png" alt="截屏2021-04-13 23.32.39" style="zoom:67%;" />
&lt;ul>
&lt;li>
&lt;p>CNN: Self-attention can only only attends in a &lt;strong>fixed&lt;/strong> local receptive field&lt;/p>
&lt;p>$\Rightarrow$ CNN is &lt;strong>simplified&lt;/strong> self-attention&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Self-attention: CNN with learnable receptive field&lt;/p>
&lt;ul>
&lt;li>Can consider information of the whole image $\Rightarrow$ More flexible than CNN&lt;/li>
&lt;li>The receptive field is &lt;strong>learnable&lt;/strong>.&lt;/li>
&lt;/ul>
&lt;p>$\Rightarrow$ Self-attention is the complex version of CNN&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>For a more detailed and mathematical explanation, see paper &lt;a href="https://arxiv.org/abs/1911.03584">On the Relationship between Self-Attention and Convolutional Layers&lt;/a>, as well as their &lt;a href="http://jbcordonnier.com/posts/attention-cnn/">blog post&lt;/a> and &lt;a href="https://epfml.github.io/attention-cnn/">visualization&lt;/a>.&lt;/p>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://blog.csdn.net/malefactor/article/details/78767781">深度学习中的注意力机制(2017版)&lt;/a> - explains attention mechanism intuitively and detailedly&lt;/li>
&lt;li>&lt;a href="https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html#self-attention">Attention? Attention!&lt;/a> - Summary of attention mechanisms&lt;/li>
&lt;li>📹 &lt;a href="https://www.youtube.com/watch?v=hYdO9CscNes">【機器學習2021】自注意力機制 (Self-attention) (上)&lt;/a>&lt;/li>
&lt;li>📹 &lt;a href="https://www.youtube.com/watch?v=gmsMY5kc-zw"> 【機器學習2021】自注意力機制 (Self-attention) (下)&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>Transformer</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-transformer/transformer-recap/</link><pubDate>Sun, 11 Apr 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-transformer/transformer-recap/</guid><description>&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/transformer-stey_by_step%20%285%29.png"
alt="Transformer architecture">&lt;figcaption>
&lt;p>Transformer architecture&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>&lt;a href="http://ai.googleblog.com/2017/08/transformer-novel-neural-network.html">Transformer: A Novel Neural Network Architecture for Language Understanding&lt;/a> - An introduction of Transformer from Google AI Blog&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Tutorials&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;a href="https://jalammar.github.io/illustrated-transformer/">The Illustrated Transformer&lt;/a> - Detailed explanation with tons of illustrations 👍🔥&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://www.youtube.com/watch?v=4Bdc55j80l8">Illustrated Guide to Transformers Neural Network: A step by step explanation&lt;/a> - Step-by-step video explanation 📹👍🔥&lt;/p>
&lt;div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
&lt;iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="allowfullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube.com/embed/4Bdc55j80l8?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"
>&lt;/iframe>
&lt;/div>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://www.youtube.com/watch?v=S27pHKBEp30">LSTM is dead. Long Live Transformers!&lt;/a> - Briefly explanation of Transformer&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://www.youtube.com/watch?v=iDulhoQ2pro">Attention Is All You Need&lt;/a> - Video explanation about the paper &amp;ldquo;Attention is All You Need&amp;rdquo;&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Visualization&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://peltarion.com/blog/data-science/self-attention-video">Getting meaning from text: self-attention step-by-step video&lt;/a> (&lt;a href="https://www.youtube.com/watch?v=-9vVhYEXeyQ&amp;amp;t=4s">video&lt;/a>)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Implementation&lt;/p>
&lt;ul>
&lt;li>&lt;a href="http://nlp.seas.harvard.edu/2018/04/03/attention.html">The Annotated Transformer&lt;/a> - A guide annotation the paper with PyTorch implementation 👍🔥&lt;/li>
&lt;li>&lt;a href="https://github.com/jadore801120/attention-is-all-you-need-pytorch">attention-is-all-you-need-pytorch&lt;/a> - PyTorch implementation&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul></description></item><item><title>Visual Transformer</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-transformer/transformer_in_cv/</link><pubDate>Tue, 13 Apr 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/cv-transformer/transformer_in_cv/</guid><description/></item></channel></rss>