Sequence to Sequence

Language Modeling

Language model is a particular model calculating the probability of a sequence

P(W)=P(W_1W_2W_n)=P(W_1)P(W2W_1)P(W_3W_1W_2)P(W_nW_1n1) \begin{aligned} P(W) &= P(W\_1 W\_2 \dots W\_n) \\\\ &= P\left(W\_{1}\right) P\left(W_{2} \mid W\_{1}\right) P\left(W\_{3} \mid W\_{1} W\_{2}\right) \ldots P\left(W\_{n} \mid W\_{1 \ldots n-1}\right) \end{aligned}

截屏2020-08-22 10.19.22

  • Softmax layer

    • After linear mapping the hidden layer HH, a “score” vector O=[O_1,O_2,,O_n1,O_n]O = [O\_1, O\_2, \dots, O\_{n-1}, O\_n] is obtained

    • The softmax function normalizes OO to get probabilities

      P_i=exp(O_i)_jexp(O_j) P\_{i}=\frac{\exp \left(O\_{i}\right)}{\sum\_{j} \exp \left(O\_{j}\right)}
  • Cross-Entropy Loss

    • Use one-hot-vector Y=[0,0,0,0,0,,1,0,0]Y = [0,0,0,0,0,\dots,1,0,0] as the label to train the model

    • Cross-entropy loss: difference between predicted probability and label

      L_CE=_iY_ilog(P_i) L\_{CE} = \sum\_i Y\_i \log(P\_i)
    • When 𝑌 is an one-hot vector:

      L_CE=logPj(the index of the correct word) L\_{C E}=-\log P_{j}(\text{the index of the correct word})

Training

Force the model to “fit” known text sequences (teacher forcing / memorization)

  • W=w_1w_2w_3w_n2w_n1w_NW=w\_{1} w\_{2} w\_{3} \dots w\_{n-2} w\_{n-1} w\_{N}
  • Input: w_1w_2w_3w_n2w_n1w\_{1} w\_{2} w\_{3} \dots w\_{n-2} w\_{n-1}
  • Output: w_2w_3w_n2w_n1w_Nw\_{2} w\_{3} \dots w\_{n-2} w\_{n-1} w\_{N}

In generation: this model uses its own output at time step t1t-1 as input for time step tt (Auto regressive, or sampling mode)

RNN Language Modeling

截屏2020-08-22 10.37.53

Basic step-wise operation

截屏2020-08-22 11.29.51
  • Input: x_t=[0,0,0,1,]x\_t = [0, 0, 0, 1, \dots ]

  • Label: y_t=[0,0,1,0,]y\_t = [0, 0, 1, 0, \dots]

  • Input to word embeddings:

    e_t=x_tW_embT e\_t = x\_t \cdot W\_{emb}^T
  • Hidden layer:

    h_t=LSTM(x_t,(t_t1,c_t1)) h\_t = \operatorname{LSTM}(x\_t, (t\_{t-1}, c\_{t-1}))
  • Output:

    o_t=h_tW_outT o\_t = h\_t \cdot W\_{out}^T
  • Softmax:

    p_t=softmax(o_t) p\_t = \operatorname{softmax}(o\_t)
  • Loss: Cross Entropy

    L_t=_iy_t_ilogp(y_t_i) L\_t = \sum\_i y\_{t\_{i}} \log p(y\_{t\_{i}})

    (y_t_iy\_{t\_{i}}: the ii-th element of y_ty\_t)

    >dL_tdo_t=p_ty_t> > \frac{d L\_t}{d o\_t} = p\_t - y\_t >

Output layer

Because of the softmax, the output layer is a distribution over the vocabulary (The probability of each item given the context)

“Teacher-Forcing” method

  • We do not really tell the model to generate “all”
  • But Applying the Cross Entropy Loss forces the distribution to favour “all”
截屏2020-08-22 11.33.44

Backpropagation in the model

  • Gradient of the loss function

    dL_tdo_t=p_ty_t \frac{d L\_t}{d o\_t} = p\_t - y\_t
  • y_ty\_t in Char-Language-Model is an one-hot-vector

    \to The gradient is positive everywhere and negative in the label position

    p_t_i(0,1)p\_{t\_i} \in (0, 1)

    Assume the label is in the ii-th position.

    • In non-label position:

      >ji:y_t_j=0>p_t_iy_t_i=p_t_i0=p_t_i>0> > \forall j \neq i: \qquad y\_{t\_j} = 0 \\\\ > \Rightarrow p\_{t\_i} - y\_{t\_i} = p\_{t\_i} - 0 = p\_{t\_i} > 0 >
    • In label position:

      >y_t_i=1p_t_iy_t_i=p_t_i1<0> > y\_{t\_i} = 1 \Rightarrow p\_{t\_i} - y\_{t\_i} = p\_{t\_i} - 1 < 0 >
    截屏2020-08-22 11.38.24
  • Gradient at the hidden layer:

    截屏2020-08-22 11.45.49 - The hidden layer h_th\_t receives two sources of gradients: - Coming from the loss of the current state - Coming from the gradient carried over from the future - Similary the memory cell c_tc\_t: - Receives the gradient from the h_th\_t - Summing up that gradient with the gradient coming from the future

Generate a New Sequence

截屏2020-08-22 11.47.49
  • Start from a memory state of the RNN (often initialized as zeros)

  • Run the network through a seed

  • Generate the probability distribution given the seed and the memory state

  • “Sample” a new token from the distribution

  • Use the new token as the seed and carry the new memory over to keep generating

Sequence-to-Sequence Models

💡 Main idea: from P(W)P(W) to P(WC)P(W \mid C) with CC being a context

Controlling Generation with RNN

Hints:

  • The generated sequence depends on

    • The first input(s)

    • The first recurrent hidden state

      截屏2020-08-22 12.11.10
  • The final hidden state of the network after rolling contains the compressed information about the sequence

    截屏2020-08-22 12.11.19

Sequence-to-Sequence problem

  • Given: sequence XX of variable length (source sequence)

    X=(X_1,X_2,,X_m) X = (X\_1, X\_2, \dots, X\_m)
  • Task: generate a new sequence YY that has the same content

    Y=(Y_1,Y_2,,X_n) Y = (Y\_1, Y\_2, \dots, X\_n)
  • Training:

    • Given the dataset 𝐷𝐷 containing pairs of parallel sentences (𝑋,𝑌)(𝑋, 𝑌)

    • Training objective

      logP(Y\*X\*)(Y\*,X\*)D \log P\left(Y^{\*} \mid X^{\*}\right) \qquad \forall \left(\mathrm{Y}^{\*}, \mathrm{X}^{\*}\right) \in D

Encoder-Decoder model

截屏2020-08-22 12.20.13

Encoder

截屏2020-08-22 12.20.49
  • Transforms the input into neural representation

  • Input

    • Discret variables: require using embedding to be “compatible” with neural networks
    • Continuous variables: can be “raw features” of the input
      • Speech signals
      • Image pixels
  • The encoder represents the XX part

  • The generative chain does not incline any generation from the encoder

Decoder

截屏2020-08-22 12.21.56
  • The encoder gives the decoder a representation of the source input
  • The decoder would try to “decode” the information from that representation
  • Key idea: the encoder representation is the H_0H\_0 of the decoder network
  • The operation is identical to the character-based language model
    • Back-propagation through time provides the gradient w.r.t any hidden state
      • For the decoder: the BPTT is identical to the language model
      • For the encoder: At each step the encoder hidden state only has 1 source of gradient

Problem with Encoder-Decoder Model

The model worked well to translate short sentences. But long sentences (more than 10 words) suffered greatly 😢

🔴 Main problem:

  • Long-range dependency
  • Gradient starving

Funny trick Solution: Reversing the source sentence to make the sentence starting words closer. (pretty “hacky” 😈)

Observation for solving this problem

Each word in the source sentence can be aligned with some words in the target sentence. (“alignment” in Machine Translation)

We can try to establish a connection between the aligned words.

截屏2020-08-22 12.37.16

Seq2Seq with Attention

As mentioned, we want to find the alignment between decoder-encoder. However, Our decoder only looks at the final and compressed encoder state to find the information.

Therefore we have to modify the decoder to do better! 💪

We will use superscript ee and dd to distinguish Encoder and Decoder.

E.g.:

  • H_jeH\_j^e:jj-th Hidden state of Encoder
  • H_jdH\_j^d:jj-th Hidden state of Decoder
截屏2020-08-22 16.43.12
  1. Run the encoder LSTM through the input sentence (Read the sentence and encode it into states
  2. The LSTM operation gives us some assumption about H_jeH\_j^e
  3. The state H_jeH\_j^e contains information about
    • the word W_jW\_j (because of input gate)
    • some information about the surrounding (because of memory))

Now we start generating the translation with the decoder.

截屏2020-08-22 16.43.19
  1. The LSTM consumes the EOS token (always) and the hidden states copied over to get the first hidden state H_0H\_0

  2. From H_0H\_0 we have to generate the first word, and we need to look back to the encoder.

    • Here we ask the question:

      “Which word is responsible to generate the first word?”

    • However, it’s not easy to answer this question

      • First we don’t know 😭
      • Second there might be more than one relevant word (like when we translate phrases or compound words) 🤪

The best way to find out is: to check all of the words! 💪

截屏2020-08-22 16.43.32
  1. H_0H\_0 has to connect all H_ieH\_i^e in the encoder side for querying

  2. Each “connection” will return a score α_i\alpha\_i indicating how relevant H_ieH\_i^e is to generate the translation from H_0H\_0

    • Generate αi0\alpha_i^0

      1. A feed forward neural network with nonlinearity
      α_i0=W_2tanh(W_1[H0,H_1e]+b_1) \alpha\_{i}^{0}=\mathrm{W}\_{2} \cdot \tanh \left(\mathrm{W}\_{1} \cdot\left[\mathrm{H}_{0}, \mathrm{H}\_{1}^{\mathrm{e}}\right]+b\_{1}\right)
      1. Use Softmax to get probabilities αsoftmax(α) \alpha \leftarrow \operatorname{softmax}(\alpha)
  3. The higher α_i\alpha\_i is, the more relevant the state H_ieH\_i^e is

截屏2020-08-22 16.43.42

After H_0H\_0 asks “Everyone” in the encoder, it needs to sum up the information

C_0=_iα_i0H_ie C\_0 = \sum\_i \alpha\_i^0 H\_i^e

(C_0C\_0 is the summarization of the information in the encoder that is the most relevant to H_0H\_0 to generate the first word in the decoder)

  1. Now we can answer the question “Which word is responsible to generate the first word?”
截屏2020-08-22 16.43.49

The answer is: the words with highest α\alpha coefficients

  • Combine H_0H\_0 with C_0C\_0
  • Generate the softmax output P_0P\_0 from H^_0\hat{H}\_0
  • Go to the next step

In general, at time step tt:

截屏2020-08-22 17.02.36
  • Decoder LSTM generates the hidden state H_tH\_t from the memory H_t1H\_{t-1}

  • H_tH\_t “pays attention” to the encoder states to know which source information is relevant

  • Generate α_it\alpha\_i^t from each H_ieH\_i^e

  • Weighted sum for “context vector” C_tC\_t

  • Combine C_tC\_t and H_tH\_t then generates P_tP\_t for output distribution

    • Use a feed-forward neural network

      H^_t=tanh(W[C_t,H_t]) \hat{H}\_t = \operatorname{tanh}(W\cdot [C\_t, H\_t])
    • Or use a RNN

      H^_t=RNN(C_t,H_t) \hat{H}\_t = \operatorname{RNN}(C\_t, H\_t)

Training

  • Very similar to basic Encoder-Decoder
  • Since the scoring neural network is continuous, we can use backpropagation
  • No loner gradient starving on the encoder side

Pratical Suggestions for Training

  • Loss is too high and generation is garbarge

    • Try to match your implementation with the LSTM equations
    • Check the gradients by gradcheck
    • Note that the gradcheck can still return some weights not passing the relative error check, but acceptable if only several (1 or 2% of the weights) cannot pass
  • Loss looks decreasing and the generated text looks correct but not readable

    • In the sampling process, you can encourage the model to generate with higher certainty by using “argmax” (taking the char with highest probability)
    • But “always using argmax” will look terrible
    • A mixture of argmax and sampling can be used to ensure spelling correctness and exploration
  • Large network size is normally needed for character based models