<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Seq2Seq | Haobin Tan</title><link>https://haobin-tan.netlify.app/tags/seq2seq/</link><atom:link href="https://haobin-tan.netlify.app/tags/seq2seq/index.xml" rel="self" type="application/rss+xml"/><description>Seq2Seq</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Sun, 23 Aug 2020 00:00:00 +0000</lastBuildDate><image><url>https://haobin-tan.netlify.app/media/icon_hu7d15bc7db65c8eaf7a4f66f5447d0b42_15095_512x512_fill_lanczos_center_3.png</url><title>Seq2Seq</title><link>https://haobin-tan.netlify.app/tags/seq2seq/</link></image><item><title>Encoder-Decoder Models</title><link>https://haobin-tan.netlify.app/docs/ai/deep-learning/encoder-decoder/</link><pubDate>Sun, 16 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/deep-learning/encoder-decoder/</guid><description/></item><item><title>Sequence to Sequence</title><link>https://haobin-tan.netlify.app/docs/ai/deep-learning/encoder-decoder/seq2seq/</link><pubDate>Sun, 16 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/deep-learning/encoder-decoder/seq2seq/</guid><description>&lt;h2 id="language-modeling">Language Modeling&lt;/h2>
&lt;p>&lt;strong>Language model&lt;/strong> is a particular model calculating the probability of a sequence
&lt;/p>
$$
\begin{aligned}
P(W) &amp;= P(W\_1 W\_2 \dots W\_n) \\\\
&amp;= P\left(W\_{1}\right) P\left(W_{2} \mid W\_{1}\right) P\left(W\_{3} \mid W\_{1} W\_{2}\right) \ldots P\left(W\_{n} \mid W\_{1 \ldots n-1}\right)
\end{aligned}
$$
&lt;p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-22%2010.19.22.png" alt="截屏2020-08-22 10.19.22" style="zoom: 40%;" />&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Softmax layer&lt;/p>
&lt;ul>
&lt;li>
&lt;p>After linear mapping the hidden layer $H$, a &amp;ldquo;score&amp;rdquo; vector $O = [O\_1, O\_2, \dots, O\_{n-1}, O\_n]$ is obtained&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The softmax function normalizes $O$ to get probabilities
&lt;/p>
$$
P\_{i}=\frac{\exp \left(O\_{i}\right)}{\sum\_{j} \exp \left(O\_{j}\right)}
$$
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Cross-Entropy Loss&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Use one-hot-vector $Y = [0,0,0,0,0,\dots,1,0,0]$ as the label to train the model&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Cross-entropy loss: difference between predicted probability and label
&lt;/p>
$$
L\_{CE} = \sum\_i Y\_i \log(P\_i)
$$
&lt;/li>
&lt;li>
&lt;p>When 𝑌 is an one-hot vector:
&lt;/p>
$$
L\_{C E}=-\log P_{j}(\text{the index of the correct word})
$$
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="training">Training&lt;/h3>
&lt;p>Force the model to “fit” known text sequences (teacher forcing / memorization)&lt;/p>
&lt;ul>
&lt;li>$W=w\_{1} w\_{2} w\_{3} \dots w\_{n-2} w\_{n-1} w\_{N}$&lt;/li>
&lt;li>Input: $w\_{1} w\_{2} w\_{3} \dots w\_{n-2} w\_{n-1}$&lt;/li>
&lt;li>Output: $w\_{2} w\_{3} \dots w\_{n-2} w\_{n-1} w\_{N}$&lt;/li>
&lt;/ul>
&lt;p>In generation: this model uses its own output at time step $t-1$ as input for time step $t$ (&lt;strong>Auto regressive&lt;/strong>, or &lt;strong>sampling mode&lt;/strong>)&lt;/p>
&lt;h3 id="rnn-language-modeling">RNN Language Modeling&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-22%2010.37.53.png" alt="截屏2020-08-22 10.37.53" style="zoom: 40%;" />
&lt;h4 id="basic-step-wise-operation">Basic step-wise operation&lt;/h4>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-22%2011.29.51.png" alt="截屏2020-08-22 11.29.51" style="zoom:50%;" />
&lt;ul>
&lt;li>
&lt;p>Input: $x\_t = [0, 0, 0, 1, \dots ]$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Label: $y\_t = [0, 0, 1, 0, \dots]$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Input to word embeddings:
&lt;/p>
$$
e\_t = x\_t \cdot W\_{emb}^T
$$
&lt;/li>
&lt;li>
&lt;p>Hidden layer:
&lt;/p>
$$
h\_t = \operatorname{LSTM}(x\_t, (t\_{t-1}, c\_{t-1}))
$$
&lt;/li>
&lt;li>
&lt;p>Output:
&lt;/p>
$$
o\_t = h\_t \cdot W\_{out}^T
$$
&lt;/li>
&lt;li>
&lt;p>Softmax:
&lt;/p>
$$
p\_t = \operatorname{softmax}(o\_t)
$$
&lt;/li>
&lt;li>
&lt;p>Loss: Cross Entropy
&lt;/p>
$$
L\_t = \sum\_i y\_{t\_{i}} \log p(y\_{t\_{i}})
$$
&lt;p>
($y\_{t\_{i}}$: the $i$-th element of $y\_t$)&lt;/p>
&lt;blockquote>
$$
> \frac{d L\_t}{d o\_t} = p\_t - y\_t
> $$
&lt;/blockquote>
&lt;/li>
&lt;/ul>
&lt;h4 id="output-layer">Output layer&lt;/h4>
&lt;p>Because of the softmax, the output layer is a &lt;strong>distribution over the vocabulary&lt;/strong> (The probability of each item given the context)&lt;/p>
&lt;p>&lt;strong>&amp;ldquo;Teacher-Forcing&amp;rdquo; method&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>We do not really tell the model to generate “all”&lt;/li>
&lt;li>But Applying the Cross Entropy Loss forces the distribution to favour “all”&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-22%2011.33.44.png" alt="截屏2020-08-22 11.33.44" style="zoom: 67%;" />
&lt;h4 id="backpropagation-in-the-model">Backpropagation in the model&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Gradient of the loss function
&lt;/p>
$$
\frac{d L\_t}{d o\_t} = p\_t - y\_t
$$
&lt;/li>
&lt;li>
&lt;p>$y\_t$ in Char-Language-Model is an one-hot-vector&lt;/p>
&lt;p>$\to$ The gradient is positive everywhere and negative in the label position&lt;/p>
&lt;blockquote>
&lt;p>$p\_{t\_i} \in (0, 1)$&lt;/p>
&lt;p>Assume the label is in the $i$-th position.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>In non-label position:
&lt;/p>
$$
> \forall j \neq i: \qquad y\_{t\_j} = 0 \\\\
> \Rightarrow p\_{t\_i} - y\_{t\_i} = p\_{t\_i} - 0 = p\_{t\_i} > 0
> $$
&lt;/li>
&lt;li>
&lt;p>In label position:
&lt;/p>
$$
> y\_{t\_i} = 1 \Rightarrow p\_{t\_i} - y\_{t\_i} = p\_{t\_i} - 1 &lt; 0
> $$
&lt;/li>
&lt;/ul>
&lt;/blockquote>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-22%2011.38.24.png" alt="截屏2020-08-22 11.38.24" style="zoom: 67%;" />
&lt;/li>
&lt;li>
&lt;p>Gradient at the hidden layer:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-22%2011.45.49.png" alt="截屏2020-08-22 11.45.49" style="zoom: 50%;" />
- The hidden layer $h\_t$ receives two sources of gradients:
- Coming from the loss of the current state
- Coming from the gradient carried over from the future
- Similary the memory cell $c\_t$:
- Receives the gradient from the $h\_t$
- Summing up that gradient with the gradient coming from the future
&lt;/li>
&lt;/ul>
&lt;h3 id="generate-a-new-sequence">Generate a New Sequence&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-22%2011.47.49.png" alt="截屏2020-08-22 11.47.49" style="zoom: 67%;" />
&lt;ul>
&lt;li>
&lt;p>Start from a memory state of the RNN (often initialized as zeros)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Run the network through a seed&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Generate the probability distribution given the seed and the memory state&lt;/p>
&lt;/li>
&lt;li>
&lt;p>“Sample” a new token from the distribution&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Use the new token as the seed and carry the new memory over to keep generating&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="sequence-to-sequence-models">Sequence-to-Sequence Models&lt;/h2>
&lt;p>💡 &lt;strong>Main idea: from $P(W)$ to $P(W \mid C)$ with $C$ being a context&lt;/strong>&lt;/p>
&lt;h3 id="controlling-generation-with-rnn">Controlling Generation with RNN&lt;/h3>
&lt;p>Hints:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>The generated sequence depends on&lt;/p>
&lt;ul>
&lt;li>
&lt;p>The first input(s)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The first recurrent hidden state&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-22%2012.11.10.png" alt="截屏2020-08-22 12.11.10" style="zoom: 60%;" />
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>The final hidden state of the network after rolling contains the compressed information about the sequence&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-22%2012.11.19.png" alt="截屏2020-08-22 12.11.19" style="zoom:60%;" />
&lt;/li>
&lt;/ul>
&lt;h3 id="sequence-to-sequence-problem">Sequence-to-Sequence problem&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Given: sequence $X$ of variable length (&lt;strong>source sequence&lt;/strong>)
&lt;/p>
$$
X = (X\_1, X\_2, \dots, X\_m)
$$
&lt;/li>
&lt;li>
&lt;p>Task: generate a new sequence $Y$ that has the same content
&lt;/p>
$$
Y = (Y\_1, Y\_2, \dots, X\_n)
$$
&lt;/li>
&lt;li>
&lt;p>Training:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Given the dataset $𝐷$ containing pairs of parallel sentences $(𝑋, 𝑌)$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Training objective
&lt;/p>
$$
\log P\left(Y^{\*} \mid X^{\*}\right) \qquad \forall \left(\mathrm{Y}^{\*}, \mathrm{X}^{\*}\right) \in D
$$
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="encoder-decoder-model">Encoder-Decoder model&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-22%2012.20.13.png" alt="截屏2020-08-22 12.20.13" style="zoom: 70%;" />
&lt;h4 id="encoder">Encoder&lt;/h4>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-22%2012.20.49.png" alt="截屏2020-08-22 12.20.49" style="zoom: 70%;" />
&lt;ul>
&lt;li>
&lt;p>Transforms the input into neural representation&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Input&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Discret&lt;/strong> variables: require using embedding to be “compatible” with neural networks&lt;/li>
&lt;li>&lt;strong>Continuous&lt;/strong> variables: can be “raw features” of the input
&lt;ul>
&lt;li>Speech signals&lt;/li>
&lt;li>Image pixels&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>The encoder represents the $X$ part&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The generative chain does not incline any generation from the encoder&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="decoder">Decoder&lt;/h4>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-22%2012.21.56.png" alt="截屏2020-08-22 12.21.56" style="zoom:60%;" />
&lt;ul>
&lt;li>The encoder gives the decoder a representation of the source input&lt;/li>
&lt;li>The decoder would try to “decode” the information from that representation&lt;/li>
&lt;li>Key idea: &lt;strong>the encoder representation is the $H\_0$ of the decoder network&lt;/strong>&lt;/li>
&lt;li>The operation is identical to the character-based language model
&lt;ul>
&lt;li>Back-propagation through time provides the gradient w.r.t any hidden state
&lt;ul>
&lt;li>For the decoder: the BPTT is identical to the language model&lt;/li>
&lt;li>For the encoder: At each step the encoder hidden
state only has 1 source of gradient&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="problem-with-encoder-decoder-model">Problem with Encoder-Decoder Model&lt;/h3>
&lt;p>The model worked well to translate short sentences. &lt;span style="color:red">But long sentences (more than 10 words) suffered greatly&lt;/span> &amp;#x1f622;&lt;/p>
&lt;p>🔴 Main problem:&lt;/p>
&lt;ul>
&lt;li>Long-range dependency&lt;/li>
&lt;li>Gradient starving&lt;/li>
&lt;/ul>
&lt;p>Funny trick Solution: &lt;em>Reversing&lt;/em> the source sentence to make the sentence starting words closer. (pretty &amp;ldquo;hacky&amp;rdquo; 😈)&lt;/p>
&lt;h3 id="observation-for-solving-this-problem">Observation for solving this problem&lt;/h3>
&lt;p>Each word in the source sentence can be aligned with some words in the target sentence. (“&lt;strong>alignment&lt;/strong>” in Machine Translation)&lt;/p>
&lt;p>We can try to &lt;strong>establish a connection between the aligned words&lt;/strong>.&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-22%2012.37.16.png" alt="截屏2020-08-22 12.37.16" style="zoom:67%;" />
&lt;h2 id="seq2seq-with-attention">Seq2Seq with Attention&lt;/h2>
&lt;p>As mentioned, we want to find the alignment between decoder-encoder. However, Our decoder only looks at the final and compressed encoder state to find the information.&lt;/p>
&lt;p>Therefore we have to modify the decoder to do better! &amp;#x1f4aa;&lt;/p>
&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">&lt;p>We will use superscript $e$ and $d$ to distinguish Encoder and Decoder.&lt;/p>
&lt;p>E.g.:&lt;/p>
&lt;ul>
&lt;li>$H\_j^e$:$j$-th Hidden state of Encoder&lt;/li>
&lt;li>$H\_j^d$:$j$-th Hidden state of Decoder&lt;/li>
&lt;/ul>
&lt;/span>
&lt;/div>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-22%2016.43.12.png" alt="截屏2020-08-22 16.43.12" style="zoom:67%;" />
&lt;ol>
&lt;li>Run the encoder LSTM through the input sentence (Read the sentence and encode it into states&lt;/li>
&lt;li>The LSTM operation gives us some assumption about $H\_j^e$&lt;/li>
&lt;li>The state $H\_j^e$ contains information about
&lt;ul>
&lt;li>the word $W\_j$ (because of input gate)&lt;/li>
&lt;li>some information about the surrounding (because of memory))&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;p>Now we start generating the translation with the decoder.&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-22%2016.43.19.png" alt="截屏2020-08-22 16.43.19" style="zoom:67%;" />
&lt;ol start="4">
&lt;li>
&lt;p>The LSTM consumes the &lt;code>EOS&lt;/code> token (always) and the hidden states copied over to get the first hidden state $H\_0$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>From $H\_0$ we have to generate the first word, and we need to look back to the encoder.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Here we ask the question:&lt;/p>
&lt;p>&lt;em>&lt;strong>“Which word is responsible to generate the first word?”&lt;/strong>&lt;/em>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>However, it&amp;rsquo;s not easy to answer this question&lt;/p>
&lt;ul>
&lt;li>First we don’t know 😭&lt;/li>
&lt;li>Second there might be more than one relevant word (like when we translate phrases or compound words) 🤪&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;p>The best way to find out is: to check all of the words! &amp;#x1f4aa;&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-22%2016.43.32.png" alt="截屏2020-08-22 16.43.32" style="zoom:67%;" />
&lt;ol start="6">
&lt;li>
&lt;p>$H\_0$ has to connect all $H\_i^e$ in the encoder side for &lt;strong>querying&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Each “connection” will return a score $\alpha\_i$ indicating how relevant $H\_i^e$ is to generate the translation from $H\_0$&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Generate $\alpha_i^0$&lt;/p>
&lt;ol>
&lt;li>A feed forward neural network with nonlinearity&lt;/li>
&lt;/ol>
$$
\alpha\_{i}^{0}=\mathrm{W}\_{2} \cdot \tanh \left(\mathrm{W}\_{1} \cdot\left[\mathrm{H}_{0}, \mathrm{H}\_{1}^{\mathrm{e}}\right]+b\_{1}\right)
$$
&lt;ol start="2">
&lt;li>Use Softmax to get probabilities
$$
\alpha \leftarrow \operatorname{softmax}(\alpha)
$$&lt;/li>
&lt;/ol>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>The higher $\alpha\_i$ is, the more relevant the state $H\_i^e$ is&lt;/p>
&lt;/li>
&lt;/ol>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-22%2016.43.42.png" alt="截屏2020-08-22 16.43.42" style="zoom:67%;" />
&lt;p>After $H\_0$ asks “Everyone” in the encoder, it needs to sum up the information
&lt;/p>
$$
C\_0 = \sum\_i \alpha\_i^0 H\_i^e
$$
&lt;p>
($C\_0$ is the summarization of the information in the encoder that is the most relevant to $H\_0$ to generate the first word in the decoder)&lt;/p>
&lt;ol start="9">
&lt;li>Now we can answer the question “Which word is responsible to generate the first word?”&lt;/li>
&lt;/ol>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-22%2016.43.49.png" alt="截屏2020-08-22 16.43.49" style="zoom:67%;" />
&lt;p>The answer is: &lt;strong>the words with highest $\alpha$ coefficients&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Combine $H\_0$ with $C\_0$&lt;/li>
&lt;li>Generate the softmax output $P\_0$ from $\hat{H}\_0$&lt;/li>
&lt;li>Go to the next step&lt;/li>
&lt;/ul>
&lt;p>In general, at time step $t$:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-08-22%2017.02.36.png" alt="截屏2020-08-22 17.02.36" style="zoom:67%;" />
&lt;ul>
&lt;li>
&lt;p>Decoder LSTM generates the hidden state $H\_t$ from the memory $H\_{t-1}$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$H\_t$ “pays attention” to the encoder states to know which source information is relevant&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Generate $\alpha\_i^t$ from each $H\_i^e$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Weighted sum for &amp;ldquo;context vector&amp;rdquo; $C\_t$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Combine $C\_t$ and $H\_t$ then generates $P\_t$ for output distribution&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Use a feed-forward neural network
&lt;/p>
$$
\hat{H}\_t = \operatorname{tanh}(W\cdot [C\_t, H\_t])
$$
&lt;/li>
&lt;li>
&lt;p>Or use a RNN
&lt;/p>
$$
\hat{H}\_t = \operatorname{RNN}(C\_t, H\_t)
$$
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="training-1">Training&lt;/h3>
&lt;ul>
&lt;li>Very similar to basic Encoder-Decoder&lt;/li>
&lt;li>Since the scoring neural network is continuous, we can use backpropagation&lt;/li>
&lt;li>No loner gradient starving on the encoder side&lt;/li>
&lt;/ul>
&lt;h2 id="pratical-suggestions-for-training">Pratical Suggestions for Training&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Loss is too high and generation is garbarge&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Try to match your implementation with the LSTM equations&lt;/li>
&lt;li>Check the gradients by gradcheck&lt;/li>
&lt;li>Note that the gradcheck can still return some weights not passing the relative error check, but acceptable if only several (1 or 2% of the weights) cannot pass&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Loss looks decreasing and the generated text looks correct but not readable&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>In the sampling process, you can encourage the model to generate with higher certainty by using “argmax” (taking the char with highest probability)&lt;/li>
&lt;li>But “always using argmax” will look terrible&lt;/li>
&lt;li>A mixture of argmax and sampling can be used to ensure spelling correctness and exploration&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Large network size is normally needed for character based models&lt;/p>
&lt;/li>
&lt;li>&lt;/li>
&lt;/ul></description></item><item><title>👍 Attention</title><link>https://haobin-tan.netlify.app/docs/ai/deep-learning/encoder-decoder/attention/</link><pubDate>Sun, 16 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/deep-learning/encoder-decoder/attention/</guid><description>&lt;h2 id="core-idea">Core Idea&lt;/h2>
&lt;p>The main assumption in sequence modelling networks such as RNNs, LSTMs and GRUs is that &lt;strong>the current state holds information for the whole of input&lt;/strong> seen so far. Hence the final state of a RNN after reading the whole input sequence should contain complete information about that sequence. But this seems to be too strong a condition and too much to ask.&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*1JcHGUU7rFgtXC_mydUA_Q.jpeg" alt="Image for post" style="zoom: 33%;" />
&lt;p>Attention mechanism relax this assumption and proposes that &lt;strong>we should look at the hidden states corresponding to the whole input sequence in order to make any prediction.&lt;/strong>&lt;/p>
&lt;h2 id="details">Details&lt;/h2>
&lt;p>The architecture of attention mechanism:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*e5665dfyxLDgZzKmrZ8Y0Q.png" alt="Image for post" style="zoom: 40%;" />
&lt;p>The network is shown in a state:&lt;/p>
&lt;ul>
&lt;li>the encoder (lower part of the figure) has computed the hidden states $h\_j$ corresponding to each input $X\_j$&lt;/li>
&lt;li>the decoder (top part of the figure) has run for $t-1$ steps and is now going to produce output for time step $t$.&lt;/li>
&lt;/ul>
&lt;p>The whole process can be divided into four steps:&lt;/p>
&lt;ol>
&lt;li>&lt;a href="#encoding">Encoding&lt;/a>&lt;/li>
&lt;li>&lt;a href="#computing-attention-weightsalignment">Computing Attention Weights/Alignment&lt;/a>&lt;/li>
&lt;li>&lt;a href="#creating-context-vector">Creating context vector&lt;/a>&lt;/li>
&lt;li>&lt;a href="#decodingtranslation">Decoding/Translation&lt;/a>&lt;/li>
&lt;/ol>
&lt;h3 id="encoding">Encoding&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*_hL6bQGbYGSJ4E-PgF4UfA.png" alt="Image for post" style="zoom:33%;" />
&lt;ul>
&lt;li>
&lt;p>$(X\_1, X\_2, \dots, X\_T)$: Input sequence&lt;/p>
&lt;ul>
&lt;li>$T$: Length of sequence&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>$(\overrightarrow{h}\_{1}, \overrightarrow{h}\_{2}, \dots, \overrightarrow{h}\_{T})$: Hidden state of the forward RNN&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$(\overleftarrow{h}\_{1}, \overleftarrow{h}\_{2}, \ldots \overleftarrow{h}\_{T})$: Hidden state of the backward RNN&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The hidden state for the $j$-th input $h\_j$ is the &lt;em>concatenation&lt;/em> of $j$-th hidden states of forward and backward RNNs.&lt;/p>
$$
h\_{j}=\left[\overrightarrow{h}\_{j} ; \overleftarrow{h}\_{j}\right], \quad \forall j \in[1, T]
$$
&lt;/li>
&lt;/ul>
&lt;h3 id="computing-attention-weightsalignment">Computing Attention Weights/Alignment&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*jiJmd9ako4eBBkEf0igTHA.png" alt="Image for post" style="zoom:33%;" />
&lt;p>At each time step $t$ of the decoder, the amount of attention to be paid to the hidden encoder unit $h\_j$ is denoted by $\alpha_{tj}$ and calculated as a function of both $h\_j$ and previous hidden state of decoder $s\_{t-1}$:
&lt;/p>
$$
\begin{array}{l}
e\_{t j}=\boldsymbol{a}\left(h\_{j}, s\_{t-1}\right), \forall j \in[1, T] \\\\ \\\\
\alpha_{t j}=\frac{\displaystyle \exp \left(e\_{t j}\right)}{\displaystyle \sum_{k=1}^{T} \exp \left(e\_{t k}\right)}
\end{array}
$$
&lt;ul>
&lt;li>$\boldsymbol{a}(\cdot)$: parametrized as a feedforward neural network that runs for all $j$ at the decoding time step $t$&lt;/li>
&lt;li>$\alpha\_{tj} \in [0, 1]$&lt;/li>
&lt;li>$\displaystyle \sum\_j \alpha\_{tj} = 1$&lt;/li>
&lt;li>$\alpha\_{tj}$ can be visualized as the attention paid by decoder at time step $t$ to the hidden ecncoder unit $h\_j$&lt;/li>
&lt;/ul>
&lt;h3 id="computing-context-vector">Computing Context Vector&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*Y78e7OLg9A4LAg3_4bRUGA.png" alt="Image for post" style="zoom:33%;" />
&lt;p>Now we compute the context vector. The context vector is simply a linear combination of the hidden weights $h\_j$ weighted by the attention values $\alpha_{tj}$ that we&amp;rsquo;ve computed in the precdeing step:
&lt;/p>
$$
c\_t = \sum\_{j=1}^T \alpha\_{tj}h\_j
$$
&lt;p>
From the equation we can see that $\alpha_{tj}$ determines how much $h\_j$ affects the context $c\_t$. The higher the value, the higher the impact of $h\_j$ on the context for time $t$.&lt;/p>
&lt;h3 id="decodingtranslation">Decoding/Translation&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*uIiUT02LY8aa5Qj4rUZ8Uw.png" alt="Image for post" style="zoom:33%;" />
&lt;p>Compute the new hidden state $s\_t$ using&lt;/p>
&lt;ul>
&lt;li>the context vector $c\_t$&lt;/li>
&lt;li>the previous hidden state of the decoder $s\_{t-1}$&lt;/li>
&lt;li>the previous output $y\_{t-1}$&lt;/li>
&lt;/ul>
$$
s\_{t}=f\left(s\_{t-1}, y\_{t-1}, c\_{t}\right)
$$
&lt;p>The output at time step $t$ is
&lt;/p>
$$
p\left(y\_{t} \mid y\_{1}, y\_{2}, \ldots y\_{t-1}, x\right)=g\left(y\_{t-1}, s\_{t}, c\_{i}\right)
$$
&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">In the &lt;a href="https://arxiv.org/pdf/1409.0473.pdf">paper&lt;/a>, authors have used a GRU cell for $f$ and a similar function for $g$.&lt;/span>
&lt;/div>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://medium.com/@shashank7.iitd/understanding-attention-mechanism-35ff53fc328e">Understanding Attention Mechanism&lt;/a>&lt;/li>
&lt;li>&lt;a href="http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/">Attention and Memory in Deep Learning and NLP&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/">Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)&lt;/a> 👍&lt;/li>
&lt;/ul></description></item><item><title>👍 Transformer</title><link>https://haobin-tan.netlify.app/docs/ai/deep-learning/encoder-decoder/transformer/</link><pubDate>Sun, 23 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/deep-learning/encoder-decoder/transformer/</guid><description>&lt;h2 id="tldr">TL;DR&lt;/h2>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/transformer_improved.png"
alt="Transformer">&lt;figcaption>
&lt;p>Transformer&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;h2 id="high-level-look">High-Level Look&lt;/h2>
&lt;p>Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/the_transformer_3.png" alt="img">&lt;/p>
&lt;p>The transformer consists of&lt;/p>
&lt;ul>
&lt;li>an encoding component&lt;/li>
&lt;li>a decoding component&lt;/li>
&lt;li>connections between them&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/The_transformer_encoders_decoders.png" alt="img">&lt;/p>
&lt;p>Let&amp;rsquo;s take a deeper look:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/The_transformer_encoder_decoder_stack.png" alt="img">&lt;/p>
&lt;ul>
&lt;li>
&lt;p>The encoding component is a &lt;strong>stack of encoders&lt;/strong> (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The decoding component is a &lt;strong>stack of decoders of the same number&lt;/strong>.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="encoder">Encoder&lt;/h3>
&lt;p>The encoders are all &lt;strong>identical&lt;/strong> in structure (yet they do NOT share weights). Each one is composed of two sub-layers:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/Transformer_encoder.png" alt="img">&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Self-attention layer&lt;/strong>: helps the encoder to look at other words in the input sentence as it encodes a specific word.&lt;/li>
&lt;li>&lt;strong>Feed Forwrd Neural Network (FFNN)&lt;/strong>: The exact same feed-forward network is &lt;em>independently&lt;/em> applied to each position.&lt;/li>
&lt;/ul>
&lt;h3 id="decoder">Decoder&lt;/h3>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/Transformer_decoder.png" alt="img">&lt;/p>
&lt;p>The decoder has both those layers, but between them is an &lt;strong>attention layer&lt;/strong> that helps the decoder focus on relevant parts of the input sentence (similar what attention does in &lt;a href="https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/">seq2seq models&lt;/a>)&lt;/p>
&lt;h2 id="encoding-component">Encoding Component&lt;/h2>
&lt;h3 id="how-tensorsvectors-flow">How tensors/vectors flow&lt;/h3>
&lt;p>As is the case in NLP applications in general, we begin by turning each input word into a vector using an &lt;a href="https://medium.com/deeper-learning/glossary-of-deep-learning-word-embedding-f90c3cec34ca">embedding algorithm&lt;/a>.&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/embeddings.png"
alt="Each word is embedded into a vector of size 512. We&amp;amp;rsquo;ll represent those vectors with these simple boxes.">&lt;figcaption>
&lt;p>Each word is embedded into a vector of size 512. We&amp;rsquo;ll represent those vectors with these simple boxes.&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;p>Note that the embedding ONLY happens in the &lt;strong>bottom-most&lt;/strong> encoder.&lt;/p>
&lt;p>The abstraction that is common to a&lt;strong>ll the encoders is that they receive a list of vectors each of the size 512&lt;/strong> (The size of this list is &lt;em>hyperparameter&lt;/em> we can set – basically it would be the length of the longest sentence in our training dataset.)&lt;/p>
&lt;ul>
&lt;li>In bottom encoder: word embeddings&lt;/li>
&lt;li>In other encoders: output of the encoder that’s directly below&lt;/li>
&lt;/ul>
&lt;p>After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/encoder_with_tensors.png" alt="img" style="zoom:80%;" />
&lt;ul>
&lt;li>The word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer.&lt;/li>
&lt;li>The feed-forward layer does not have those dependencies, thus the various paths can be executed in &lt;strong>parallel&lt;/strong> while flowing through the feed-forward layer. &amp;#x1f44f;&lt;/li>
&lt;/ul>
&lt;p>In summary, An encoder&lt;/p>
&lt;ol>
&lt;li>receives a list of vectors as input&lt;/li>
&lt;li>processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network&lt;/li>
&lt;li>sends out the output upwards to the next encoder.&lt;/li>
&lt;/ol>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/encoder_with_tensors_2.png" alt="img">&lt;/p>
&lt;h3 id="self-attention">Self-Attention&lt;/h3>
&lt;p>Say the following sentence is an input sentence we want to translate:&lt;/p>
&lt;p>”&lt;code>The animal didn't cross the street because it was too tired&lt;/code>”&lt;/p>
&lt;p>What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.&lt;/p>
&lt;p>As the model processes each word (each position in the input sequence), &lt;strong>self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word&lt;/strong>. Therefore, when the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.&lt;/p>
&lt;p>We can think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. &lt;strong>Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.&lt;/strong>&lt;/p>
&lt;figure>&lt;img src="https://jalammar.github.io/images/t/transformer_self-attention_visualization.png"
alt="When encoding the word &amp;amp;lsquo;it&amp;amp;rsquo; in 5.encoder (the top encoder in the stack), part of the attention mechanism was focusing on &amp;amp;lsquo;The Animal&amp;amp;rsquo;, and baked a part of its representation into the encoding of &amp;amp;lsquo;it&amp;amp;rsquo;. (Visualization source: Tensor2Tensor notebook)">&lt;figcaption>
&lt;p>When encoding the word &amp;lsquo;it&amp;rsquo; in 5.encoder (the top encoder in the stack), part of the attention mechanism was focusing on &amp;lsquo;The Animal&amp;rsquo;, and baked a part of its representation into the encoding of &amp;lsquo;it&amp;rsquo;. (Visualization source: &lt;a href="https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb">Tensor2Tensor notebook&lt;/a>)&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;h4 id="self-attention-in-detail">Self-Attention in Detail&lt;/h4>
&lt;p>Calculate self-attention using vectors:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Create three vectors&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Query vector&lt;/strong>&lt;/li>
&lt;li>&lt;strong>Key vector&lt;/strong>&lt;/li>
&lt;li>&lt;strong>Value vector&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>from each of the encoder’s input vectors&lt;/strong> by multiplying the embedding by three matrices that we trained during the training process.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-08-23%2012.23.41.png" alt="截屏2020-08-23 12.23.41">&lt;/p>
&lt;p>These new vectors are smaller in dimension (64) than the embedding vector (512). &lt;em>They don’t have to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.&lt;/em>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">In the following steps we will keep using the first word &amp;ldquo;Thinking&amp;rdquo; as example.&lt;/span>
&lt;/div>
&lt;ol start="2">
&lt;li>
&lt;p>&lt;strong>calculate a score&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>The score is calculated by taking the dot product of the &lt;span style="color:purple">query vector&lt;/span> with the &lt;span style="color:orange">key vector&lt;/span> of the respective word we’re scoring.&lt;/p>
&lt;ul>
&lt;li>So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of &lt;span style="color:purple">q1&lt;/span> and &lt;span style="color:orange">k1&lt;/span>. The second score would be the dot product of &lt;span style="color:purple">q1&lt;/span> and &lt;span style="color:orange">k2&lt;/span>.&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/transformer_self_attention_score.png" alt="img">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Divide the scores by the square root of the dimension of the key vectors&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>In paper the dimension of the key vectors is 64. Therefore devide the scores by 8&lt;/p>
&lt;p>(There could be other possible values here, but this is the default)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>This leads to having more stable gradients. &amp;#x1f44f;&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Pass the result through a softmax operation&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Softmax normalizes the scores so they’re all positive and add up to 1.&lt;/li>
&lt;li>The softmax score determines &lt;strong>how much each word will be expressed at this position&lt;/strong>.
&lt;ul>
&lt;li>Clearly the word at this position will have the highest softmax score&lt;/li>
&lt;li>but sometimes it’s useful to attend to another word that is relevant to the current word.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/self-attention_softmax.png" alt="img" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Multiply each &lt;span style="color:#5CBCE9">value vector&lt;/span> by the softmax score&lt;/strong> (in preparation to sum them up)&lt;/p>
&lt;ul>
&lt;li>Keep intact the values of the word(s) we want to focus on&lt;/li>
&lt;li>drown-out irrelevant words &lt;em>(by multiplying them by tiny numbers like 0.001, for example)&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Sum up the weighted value vectors&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>This produces the output of the self-attention layer at this position (for the first word).&lt;/li>
&lt;li>The resulting vector is one we can send along to the feed-forward neural network.&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/self-attention-output.png" alt="img">&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h4 id="matrix-calculation-of-self-attention">Matrix Calculation of Self-Attention&lt;/h4>
&lt;p>In the actual implementation, the above calculation is done in matrix form for faster processing.&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Calculate the Query, Key, and Value matrices&lt;/strong>&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-08-23%2013.03.34.png" alt="截屏2020-08-23 13.03.34">&lt;/p>
&lt;ul>
&lt;li>Pack our embeddings into a matrix &lt;span style="color:#70BF41">X&lt;/span>&lt;/li>
&lt;li>Multiplying it by the weight matrices we’ve trained (&lt;span style="color:#B36AE2">$W^Q$&lt;/span>, &lt;span style="color:#F39019">$W^K$&lt;/span>, &lt;span style="color:#5CBCE9">$W^V$&lt;/span>)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Calculate the outputs of the self-attention layer&lt;/strong>&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/self-attention-matrix-calculation-2.png" alt="img">&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h3 id="multi-headed-mechanism">&amp;ldquo;Multi-headed&amp;rdquo; Mechanism&lt;/h3>
&lt;p>The paper further refined the self-attention layer by adding a mechanism called “&lt;strong>multi-headed” attention&lt;/strong>. This improves the performance of the attention layer in two ways:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Expands the model’s ability to focus on different positions&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Gives the attention layer multiple “representation subspaces”&lt;/p>
&lt;ul>
&lt;li>
&lt;p>With multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices&lt;/p>
&lt;p>&lt;em>(the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder)&lt;/em>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Each of these sets is randomly initialized.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>After training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/transformer_attention_heads_qkv.png"
alt="With multi-headed attention, we maintain separate $Q$/$K$/$V$ weight matrices for each head resulting in different $Q$/$K$/$V$ matrices. As we did before, we multiply $X$ by the $W^Q$/$W^K$/$W^V$ matrices to produce $Q$/$K$/$V$ matrices.">&lt;figcaption>
&lt;p>With multi-headed attention, we maintain separate $Q$/$K$/$V$ weight matrices for each head resulting in different $Q$/$K$/$V$ matrices. As we did before, we multiply $X$ by the $W^Q$/$W^K$/$W^V$ matrices to produce $Q$/$K$/$V$ matrices.&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;p>If we do the same self-attention calculation as above, just eight different times with different weight matrices, we end up with eight different $Z$ matrices&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/transformer_attention_heads_z.png" alt="img">&lt;/p>
&lt;p>Since the feed-forward layer is expecting a single matrix (a vector for each word), we concat the matrices then multiple them by an additional weights matrix $W^O$.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/transformer_attention_heads_weight_matrix_o.png" alt="img">&lt;/p>
&lt;h4 id="summarize-them-into-a-figure">Summarize them into a figure&lt;/h4>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/transformer_multi-headed_self-attention-recap.png" alt="img">&lt;/p>
&lt;h4 id="example">Example&lt;/h4>
&lt;p>Let’s revisit our example from before to see where the different attention heads are focusing as we encode the word “it” in our example sentence:&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/transformer_self-attention_visualization_2.png"
alt="As we encode the word &amp;amp;lsquo;it&amp;amp;rsquo;, one attention head is focusing most on &amp;amp;rsquo;the animal&amp;amp;rsquo;, while another is focusing on &amp;amp;rsquo;tired&amp;amp;rsquo; &amp;amp;ndash; in a sense, the model&amp;amp;rsquo;s representation of the word &amp;amp;lsquo;it&amp;amp;rsquo; bakes in some of the representation of both &amp;amp;lsquo;animal&amp;amp;rsquo; and &amp;amp;rsquo;tired&amp;amp;rsquo;. (Visualization source: Tensor2Tensor notebook)">&lt;figcaption>
&lt;p>As we encode the word &amp;lsquo;it&amp;rsquo;, one attention head is focusing most on &amp;rsquo;the animal&amp;rsquo;, while another is focusing on &amp;rsquo;tired&amp;rsquo; &amp;ndash; in a sense, the model&amp;rsquo;s representation of the word &amp;lsquo;it&amp;rsquo; bakes in some of the representation of both &amp;lsquo;animal&amp;rsquo; and &amp;rsquo;tired&amp;rsquo;. (Visualization source: &lt;a href="https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb">Tensor2Tensor notebook&lt;/a>)&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;h3 id="representing-the-order-of-the-sequence-using-positional-encoding">Representing The Order of The Sequence Using Positional Encoding&lt;/h3>
&lt;p>In order to represent the order of the words in the input sequence, the transformer adds a vector to each input embedding.&lt;/p>
&lt;ul>
&lt;li>These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence.&lt;/li>
&lt;li>💡 Intuition: adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into $Q$/$K$/$V$ vectors and during dot-product attention.&lt;/li>
&lt;/ul>
&lt;figure>&lt;img src="https://jalammar.github.io/images/t/transformer_positional_encoding_vectors.png"
alt="To give the model a sense of the order of the words, we add positional encoding vectors &amp;amp;ndash; the values of which follow a specific pattern.">&lt;figcaption>
&lt;p>To give the model a sense of the order of the words, we add positional encoding vectors &amp;ndash; the values of which follow a specific pattern.&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;p>For instance, if we assumed the embedding has a dimensionality of 4, the actual positional encodings would look like this:&lt;/p>
&lt;figure>&lt;img src="https://jalammar.github.io/images/t/transformer_positional_encoding_example.png"
alt="A real example of positional encoding with a toy embedding size of 4">&lt;figcaption>
&lt;p>A real example of positional encoding with a toy embedding size of 4&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;p>What might this pattern look like?&lt;/p>
&lt;p>In the following figure, each row corresponds the a positional encoding of a vector.&lt;/p>
&lt;ul>
&lt;li>The first row would be the vector we’d add to the embedding of the first word in an input sequence.&lt;/li>
&lt;li>Each row contains 512 values – each with a value between 1 and -1.&lt;/li>
&lt;/ul>
&lt;figure>&lt;img src="https://jalammar.github.io/images/t/transformer_positional_encoding_large_example.png"
alt="A real example of positional encoding for 20 words (rows) with an embedding size of 512 (columns). You can see that it appears split in half down the center. That&amp;amp;rsquo;s because the values of the left half are generated by one function (which uses sine), and the right half is generated by another function (which uses cosine). They&amp;amp;rsquo;re then concatenated to form each of the positional encoding vectors.">&lt;figcaption>
&lt;p>A real example of positional encoding for 20 words (rows) with an embedding size of 512 (columns). You can see that it appears split in half down the center. That&amp;rsquo;s because the values of the left half are generated by one function (which uses sine), and the right half is generated by another function (which uses cosine). They&amp;rsquo;re then concatenated to form each of the positional encoding vectors.&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;p>&lt;strong>July 2020 Update:&lt;/strong> The positional encoding shown above is from the Tranformer2Transformer implementation of the Transformer. The method shown in the paper is slightly different in that it doesn’t directly concatenate, but interweaves the two signals. The following figure shows what that looks like.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/attention-is-all-you-need-positional-encoding.png" alt="img">&lt;/p>
&lt;h3 id="the-residuals">The Residuals&lt;/h3>
&lt;p>Each sub-layer (self-attention, FFNN) in each encoder has a residual connection around it, and is followed by a &lt;a href="https://arxiv.org/abs/1607.06450">layer-normalization&lt;/a> step.&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/transformer_resideual_layer_norm.png" alt="img" style="zoom:80%;" />
&lt;p>Visualize the vectors and the layer-norm operation associated with self attention&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/transformer_resideual_layer_norm_2.png" alt="img" style="zoom:80%;" />
&lt;p>&lt;strong>This goes for the sub-layers of the decoder as well.&lt;/strong>&lt;/p>
&lt;p>If we’re to think of a Transformer of 2 stacked encoders and decoders, it would look something like this:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/transformer_resideual_layer_norm_3.png" alt="img">&lt;/p>
&lt;h2 id="decoding-component">Decoding Component&lt;/h2>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/transformer_decoding_1.gif"
alt="After finishing the encoding phase, we begin the decoding phase. Each step in the decoding phase outputs an element from the output sequence (the English translation sentence in this case).">&lt;figcaption>
&lt;p>After finishing the encoding phase, we begin the decoding phase. Each step in the decoding phase outputs an element from the output sequence (the English translation sentence in this case).&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;ul>
&lt;li>The encoder start by processing the input sequence.&lt;/li>
&lt;li>The output of the top encoder is then transformed into a set of attention vectors $K$ and $V$.&lt;/li>
&lt;li>These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence&lt;/li>
&lt;/ul>
&lt;p>The following steps repeat the process until a special symbol &lt;code>&amp;lt;eos&amp;gt;&lt;/code> is reached, indicating the transformer decoder has completed its output. &lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/transformer_decoding_2-20200823133024688.gif" alt="img">&lt;/p>
&lt;ul>
&lt;li>The output of each step is fed to the bottom decoder in the next time step,&lt;/li>
&lt;li>The decoders bubble up their decoding results just like the encoders did.
&lt;ul>
&lt;li>Just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Note that the self attention layers in the decoder operate in a slightly different way than the one in the encoder: &lt;strong>In the decoder, the self-attention layer is ONLY allowed to attend to earlier positions in the output sequence.&lt;/strong>&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-04-10%2014.50.13.png" alt="截屏2021-04-10 14.50.13">&lt;/p>
&lt;ul>
&lt;li>
&lt;p>This can be done by &lt;strong>masking future positions&lt;/strong> (setting them to &lt;code>-inf&lt;/code>) before the softmax step in the self-attention calculation.&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-04-10%2014.51.48.png"
alt="The attention scores for &amp;amp;lsquo;am&amp;amp;rsquo; have values for itself and all other words before, but zero for the word &amp;amp;lsquo;fine&amp;amp;rsquo;. This essentially tells the model to put NO focus on the word &amp;amp;lsquo;fine&amp;amp;rsquo;.">&lt;figcaption>
&lt;p>The attention scores for &amp;lsquo;am&amp;rsquo; have values for itself and all other words before, but zero for the word &amp;lsquo;fine&amp;rsquo;. This essentially tells the model to put NO focus on the word &amp;lsquo;fine&amp;rsquo;.&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;h2 id="the-final-linear-and-softmax-layer">The Final Linear and Softmax Layer&lt;/h2>
&lt;p>The final Linear layer + Softmax layer: &lt;strong>Turn a vector of floats (the output of the decoder) into a word&lt;/strong>&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/transformer_decoder_output_softmax.png" alt="img">&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Linear layer&lt;/strong>: a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a &lt;strong>logits vector&lt;/strong>
&lt;ul>
&lt;li>Let’s assume that our model knows 10,000 unique English words (our model’s “&lt;strong>output vocabulary&lt;/strong>”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Softmax layer&lt;/strong>
&lt;ul>
&lt;li>Turns those scores into probabilities (all positive, all add up to 1.0).&lt;/li>
&lt;li>The cell with the &lt;strong>highest probability&lt;/strong> is chosen, and the word associated with it is produced as the output for this time step.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="training">Training&lt;/h2>
&lt;h3 id="word-representation">Word Representation&lt;/h3>
&lt;p>During training, an untrained model would go through the exact same forward pass. Since we are training it on a labeled training dataset, we can compare its output with the actual correct output.&lt;/p>
&lt;p>To visualize this, let’s assume our output vocabulary only contains six words: “a”, “am”, “i”, “thanks”, “student”, and “&lt;code>&amp;lt;eos&amp;gt;&lt;/code>” (short for ‘end of sentence’).&lt;/p>
&lt;figure>&lt;img src="https://jalammar.github.io/images/t/vocabulary.png"
alt="The output vocabulary of our model is created in the preprocessing phase before we even begin training.">&lt;figcaption>
&lt;p>The output vocabulary of our model is created in the preprocessing phase before we even begin training.&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;p>After defining the output vocabulary, use &lt;strong>One-Hot-encoding&lt;/strong> to indicate each word in our vocabulary. E.g., we can indicate the word “am” using the following vector:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/one-hot-vocabulary-example.png" alt="img">&lt;/p>
&lt;h3 id="the-loss-function">The Loss Function&lt;/h3>
&lt;p>Say it’s our &lt;strong>first&lt;/strong> step in the training phase, and we’re training it on a simple example – translating “merci” into “thanks”.&lt;/p>
&lt;p>We want the output to be a probability distribution indicating the word “thanks”. But since this model is not yet trained, that’s unlikely to happen just yet.&lt;/p>
&lt;figure>&lt;img src="https://jalammar.github.io/images/t/transformer_logits_output_and_label.png"
alt="Since the model&amp;amp;rsquo;s parameters (weights) are all initialized randomly, the (untrained) model produces a probability distribution with arbitrary values for each cell/word. We can compare it with the actual output, then tweak all the model&amp;amp;rsquo;s weights using backpropagation to make the output closer to the desired output.">&lt;figcaption>
&lt;p>Since the model&amp;rsquo;s parameters (weights) are all initialized randomly, the (untrained) model produces a probability distribution with arbitrary values for each cell/word. We can compare it with the actual output, then tweak all the model&amp;rsquo;s weights using backpropagation to make the output closer to the desired output.&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;p>Compare two probability distributions: &lt;strong>simply subtract one from the other&lt;/strong>. (For more details, look at &lt;a href="https://colah.github.io/posts/2015-09-Visual-Information/">cross-entropy&lt;/a> and &lt;a href="https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained">Kullback–Leibler divergence&lt;/a>.)&lt;/p>
&lt;p>Note that the example above is an oversimplified example.&lt;/p>
&lt;p>In practice, we’ll use a sentence longer than one word. For example&lt;/p>
&lt;ul>
&lt;li>input: “je suis étudiant” and&lt;/li>
&lt;li>expected output: “i am a student”.&lt;/li>
&lt;/ul>
&lt;p>What this really means, is that we want our model to successively output probability distributions where:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Each probability distribution is represented by a vector of width vocab_size (6 in our toy example, but more realistically a number like 30,000 or 50,000)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The first probability distribution has the highest probability at the cell associated with the word “i”&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The second probability distribution has the highest probability at the cell associated with the word “am”&lt;/p>
&lt;/li>
&lt;li>
&lt;p>And so on, until the fifth output distribution indicates ‘&lt;code>&amp;lt;end of sentence&amp;gt;&lt;/code>’ symbol, which also has a cell associated with it from the 10,000 element vocabulary.&lt;/p>
&lt;figure>&lt;img src="https://jalammar.github.io/images/t/output_target_probability_distributions.png"
alt="The targeted probability distributions we&amp;amp;rsquo;ll train our model against in the training example for one sample sentence.">&lt;figcaption>
&lt;p>The targeted probability distributions we&amp;rsquo;ll train our model against in the training example for one sample sentence.&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;p>After training the model for enough time on a large enough dataset, we would hope the produced probability distributions would look like this:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/output_trained_model_probability_distributions.png" alt="img">&lt;/p>
&lt;p>Now, because the model produces the outputs one at a time, we can assume that the model is selecting the word with the highest probability from that probability distribution and throwing away the rest (&lt;strong>&amp;ldquo;greedy decoding&amp;rdquo;&lt;/strong>).&lt;/p>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://jalammar.github.io/illustrated-transformer/">The Illustrated Transformer&lt;/a> - great explanation with a number of illustrations 👍🔥&lt;/li>
&lt;li>Paper: &lt;a href="https://arxiv.org/abs/1706.03762">Attention is All You Need&lt;/a>&lt;/li>
&lt;li>Pytorch implementation: &lt;a href="http://nlp.seas.harvard.edu/2018/04/03/attention.html">guide annotating the paper with PyTorch implementation&lt;/a>&lt;/li>
&lt;/ul></description></item></channel></rss>