Backpropagation Through Time (BPTT)

Recurrent neural networks (RNNs) have attracted great attention on sequential tasks. However, compared to general feedforward neural networks, it is a little bit harder to train RNNs since RNNs have feedback loops.

In this article, we dive into basics, especially the error backpropagation to compute gradients with respect to model parameters. Furthermore, we go into detail on how error backpropagation algorithm is applied on long short-term memory (LSTM) by unfolding the memory unit.

BPTT in RNN

Left: recursive description. Right: unrolled RNN in a time sequential manner

Left: recursive description. Right: unrolled RNN in a time sequential manner

  • x_t\mathbf{x}\_t: current observation/input

  • h_t\mathbf{h}\_t: hidden state

    • dependent on:

      • current observation x_t\mathbf{x}\_t
      • previous hidden state h_tβˆ’1\mathbf{h}\_{t-1}
    • Representation:

      h_t=f(h_tβˆ’1,x_t) \mathbf{h}\_{t}=f\left(\mathbf{h}\_{t-1}, \mathbf{x}\_{t}\right)
      • ff: nonlinear mapping
  • z_tz\_t: output/prediction at time step tt

Suppose we have the following RNN:

h_t=tanh⁑(W_hhh_tβˆ’1+W_xhx_t+b_h)Ξ±_t=W_hzh_t+b_zz_t=softmax⁑(Ξ±_t) \begin{array}{l} \mathbf{h}\_{t}=\tanh \left(W\_{h h} \mathbf{h}\_{t-1}+W\_{x h} \mathbf{x}\_{t}+\mathbf{b}\_{\mathbf{h}}\right) \\\\ \alpha\_t = W\_{h z} \mathbf{h}\_{t}+\mathbf{b}\_{z}\\\\ z\_{t}=\operatorname{softmax}\left(\alpha\_t\right) \end{array}

Reminder:

tanh⁑(x)=sinh⁑(x)cosh⁑(x)=exβˆ’eβˆ’xex+eβˆ’x=e2xβˆ’1e2x+1 \tanh (x)=\frac{\sinh (x)}{\cosh (x)}=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}=\frac{e^{2 x}-1}{e^{2 x}+1}

Considering the varying length for each sequential data, we also assume the parameters in each time step are the same across the whole sequential analysis (Otherwise it will be hard to compute the gradients). In addition, sharing the weights for any sequential length can generalize the model well.

As for sequential labeling, we can use the maximum likelihood to estimate model parameters. In other words, we can minimize the negative log likelihood the objective function (β†’\to cross entropy)

L(x,y)=βˆ’βˆ‘tytlog⁑zt \mathcal{L}(\mathbf{x}, \mathbf{y})=-\sum_{t} y_{t} \log z_{t}
  • For simplicity, in the following we will use L\mathcal{L} as the objective function
  • At time step t+1t+1: L(t+1)=βˆ’y_t+1log⁑z_t+1\mathcal{L}(t+1)=-y\_{t+1}\log z\_{t+1}

Derivation

W_hzW\_{hz} and b_zb\_z

Based on the RNN above, by taking the derivative with respect to Ξ±_t\alpha\_t, we have (refer to derivative of softmax)

βˆ‚Lβˆ‚Ξ±_t=βˆ’(y_tβˆ’z_t) \frac{\partial \mathcal{L}}{\partial \alpha\_{t}}=-\left(y\_{t}-z\_{t}\right)

Note the weight W_hzW\_{hz} is shared across all time sequence, thus we can differentiate to it at each time step and sum all together

βˆ‚Lβˆ‚Whz=βˆ‘tβˆ‚Lβˆ‚ztβˆ‚ztβˆ‚Whz \frac{\partial \mathcal{L}}{\partial W_{h z}}=\sum_{t} \frac{\partial \mathcal{L}}{\partial z_{t}} \frac{\partial z_{t}}{\partial W_{h z}}

Similarly, we can get the gradient w.r.t. bias b_zb\_z

βˆ‚Lβˆ‚bz=βˆ‘tβˆ‚Lβˆ‚ztβˆ‚ztβˆ‚bz \frac{\partial \mathcal{L}}{\partial b_{z}}=\sum_{t} \frac{\partial \mathcal{L}}{\partial z_{t}} \frac{\partial z_{t}}{\partial b_{z}}

W_hhW\_{hh}

Consider at time step t→t+1t \to t + 1 in the figure above

βˆ‚L(t+1)βˆ‚W_hh=βˆ‚L(t+1)βˆ‚z_t+1βˆ‚z_t+1βˆ‚h_t+1βˆ‚h_t+1βˆ‚W_hh \frac{\partial \mathcal{L}(t+1)}{\partial W\_{h h}}=\frac{\partial \mathcal{L}(t+1)}{\partial z\_{t+1}} \frac{\partial z\_{t+1}}{\partial \mathbf{h}\_{t+1}} \frac{\partial \mathbf{h}\_{t+1}}{\partial W\_{h h}}

Because the hidden state h_t+1\mathbf{h}\_{t+1} partially depends on h_t\mathbf{h}\_t, we can use backpropagation to compute the above partial derivative. Futhermore, W_hhW\_{hh} is shared cross the whole time sequence. Therefore, at time step (tβˆ’1)β†’t(t-1) \to t, we can get the partial derivative w.r.t. W_hhW\_{hh}:

βˆ‚L(t+1)βˆ‚W_hh=βˆ‚L(t+1)βˆ‚z_t+1βˆ‚z_t+1βˆ‚h_t+1βˆ‚h_t+1βˆ‚h_tβˆ‚h_tβˆ‚W_hh \frac{\partial \mathcal{L}(t+1)}{\partial W\_{h h}}=\frac{\partial \mathcal{L}(t+1)}{\partial z\_{t+1}} \frac{\partial z\_{t+1}}{\partial \mathbf{h}\_{t+1}} \frac{\partial \mathbf{h}\_{t+1}}{\partial \mathbf{h}\_{t}} \frac{\partial \mathbf{h}\_{t}}{\partial W\_{h h}}

Thus, at the time step t+1t + 1, we can compute gradient w.r.t. z_t+1z\_{t+1} and further use backpropagation through time (BPTT) from tt to 00 to calculate gradient w.r.t. W_hhW\_{hh} (shown as the red chain in figure above). In other words, if we only consider the output z_t+1z\_{t+1} at time step t+1t + 1, we can yield the following gradient w.r.t. W_hhW\_{hh}:

βˆ‚L(t+1)βˆ‚W_hh=βˆ‘k=1t+1βˆ‚L(t+1)βˆ‚z_t+1βˆ‚z_t+1βˆ‚h_t+1βˆ‚h_t+1βˆ‚h_kβˆ‚h_kβˆ‚W_hh \frac{\partial \mathcal{L}(t+1)}{\partial W\_{h h}}=\sum_{k=1}^{t+1} \frac{\partial \mathcal{L}(t+1)}{\partial z\_{t+1}} \frac{\partial z\_{t+1}}{\partial \mathbf{h}\_{t+1}} \frac{\partial \mathbf{h}\_{t+1}}{\partial \mathbf{h}\_{k}} \frac{\partial \mathbf{h}\_{k}}{\partial W\_{h h}}

Example: t=2t = 2

\mathbf{…
\mathbf{…
\mathbf{…
\mathbf{…
\mathbf{…
\mathbf{…
\mathbf{…
\mathbf{…
\mathbf{…
y_1
y_2
y_3
W{xh}
W{xh}
W{xh}
W{hh}
W_{hh}
W_{hz}
W_{hz}
W_{hz}










\mathbf{h}
3 = tanh(W{hh}\color{green}{\mathbf{h}2} + W{x…

Computational graph
Computational graph
\math…
W{hh}
\mathbf{…
\mathbf{…
\mathbf{…
W{hz}
\mathbf{…
\mathcal…
\fra…
\fra…
\fra…
\fra…
\fra…
\frac{\partial \mathcal{L}3}{W{hh}} =…
Ground truth
Ground truth
Output/Prediction
Output/Prediction
Hidden state
Hidden state
Input
Input
\fra…
\mathbf{…
\fra…
\mathbf{…
W{hh}
Viewer does not support full SVG 1.1

Aggregate the gradients w.r.t. W_hhW\_{hh} over the whole time sequence with back propagation, we can finally yield the following gradient w.r.t W_hhW\_{hh}:

βˆ‚Lβˆ‚W_hh=βˆ‘_tβˆ‘_k=1t+1βˆ‚L(t+1)βˆ‚z_t+1βˆ‚z_t+1βˆ‚h_t+1βˆ‚h_t+1βˆ‚h_kβˆ‚h_kβˆ‚W_hh \frac{\partial \mathcal{L}}{\partial W\_{h h}}=\sum\_{t} \sum\_{k=1}^{t+1} \frac{\partial \mathcal{L}(t+1)}{\partial z\_{t+1}} \frac{\partial z\_{t+1}}{\partial \mathbf{h}\_{t+1}} \frac{\partial \mathbf{h}\_{t+1}}{\partial \mathbf{h}\_{k}} \frac{\partial \mathbf{h}\_{k}}{\partial W\_{h h}}

W_xhW\_{xh}

Similar to W_hhW\_{hh}, we consider the time step t+1t + 1 (only contribution from x_t+1\mathbf{x}\_{t+1}) and calculate the gradient w.r.t. W_xhW\_{xh}:

βˆ‚L(t+1)βˆ‚W_xh=βˆ‚L(t+1)βˆ‚h_t+1βˆ‚h_t+1βˆ‚W_xh \frac{\partial \mathcal{L}(t+1)}{\partial W\_{x h}}=\frac{\partial \mathcal{L}(t+1)}{\partial \mathbf{h}\_{t+1}} \frac{\partial \mathbf{h}\_{t+1}}{\partial W\_{x h}}

Because h_t\mathbf{h}\_{t} and x_t\mathbf{x}\_t both make contribution to h_t+1\mathbf{h}\_{t+1}, we need to back propagate to h_t\mathbf{h}\_{t} as well. If we consider the contribution from the time step tt, we can further get

βˆ‚L(t+1)βˆ‚W_xh=βˆ‚L(t+1)βˆ‚h_t+1βˆ‚h_t+1βˆ‚W_xh+βˆ‚L(t+1)βˆ‚h_tβˆ‚h_tβˆ‚W_xh=βˆ‚L(t+1)βˆ‚h_t+1βˆ‚h_t+1βˆ‚W_xh+βˆ‚L(t+1)βˆ‚h_t+1βˆ‚h_t+1βˆ‚h_tβˆ‚h_tβˆ‚W_xh \begin{aligned} & \frac{\partial \mathcal{L}(t+1)}{\partial W\_{x h}}=\frac{\partial \mathcal{L}(t+1)}{\partial \mathbf{h}\_{t+1}} \frac{\partial \mathbf{h}\_{t+1}}{\partial W\_{x h}}+\frac{\partial \mathcal{L}(t+1)}{\partial \mathbf{h}\_{t}} \frac{\partial \mathbf{h}\_{t}}{\partial W\_{x h}} \\\\ =& \frac{\partial \mathcal{L}(t+1)}{\partial \mathbf{h}\_{t+1}} \frac{\partial \mathbf{h}\_{t+1}}{\partial W\_{x h}}+\frac{\partial \mathcal{L}(t+1)}{\partial \mathbf{h}\_{t+1}} \frac{\partial \mathbf{h}\_{t+1}}{\partial \mathbf{h}\_{t}} \frac{\partial \mathbf{h}\_{t}}{\partial W\_{x h}} \end{aligned}

Thus, summing up all contributions from tt to 00 via backpropagation, we can yield the gradient at the time step t+1t + 1:

βˆ‚L(t+1)βˆ‚W_xh=βˆ‘k=1t+1βˆ‚L(t+1)βˆ‚h_t+1βˆ‚h_t+1βˆ‚h_kβˆ‚h_kβˆ‚W_xh \frac{\partial \mathcal{L}(t+1)}{\partial W\_{x h}}=\sum_{k=1}^{t+1} \frac{\partial \mathcal{L}(t+1)}{\partial \mathbf{h}\_{t+1}} \frac{\partial \mathbf{h}\_{t+1}}{\partial \mathbf{h}\_{k}} \frac{\partial \mathbf{h}\_{k}}{\partial W\_{x h}}

Example: t=2t=2

Computational graph for W_xh

Computational graph for W_xh

Further, we can take derivative w.r.t. W_xhW\_{xh} over the whole sequence as

βˆ‚Lβˆ‚Wxh=βˆ‘_tβˆ‘k=1t+1βˆ‚L(t+1)βˆ‚z_t+1βˆ‚zt+1βˆ‚h_t+1βˆ‚h_t+1βˆ‚h_kβˆ‚h_kβˆ‚W_xh \frac{\partial \mathcal{L}}{\partial W_{x h}}=\sum\_{t} \sum_{k=1}^{t+1} \frac{\partial \mathcal{L}(t+1)}{\partial z\_{t+1}} \frac{\partial z_{t+1}}{\partial \mathbf{h}\_{t+1}} \frac{\partial \mathbf{h}\_{t+1}}{\partial \mathbf{h}\_{k}} \frac{\partial \mathbf{h}\_{k}}{\partial W\_{x h}}

Gradient vanishing or exploding problems

Notice that βˆ‚h_t+1βˆ‚h_k\frac{\partial \mathbf{h}\_{t+1}}{\partial \mathbf{h}\_{k}} in the equation above indicates matrix multiplication over the sequence. And RNNs need to backpropagate gradients over a long sequence

  • With small values in the matrix multiplication

    Gradient value will shrink layer over layer, and eventually vanish after a few time steps. Thus, the states that are far away from the current time step does not contribute to the parameters’ gradient computing (or parameters that RNNs is learning)!

    β†’\to Gradient vanishing

  • With large values in the matrix multiplication

    Gradient value will get larger layer over layer, and eventually become extremly large!

    β†’\to Gradient exploding

BPTT in LSTM

The representation of LSTM here follows the one in A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation.

More details about LSTM see: LSTM Summary

Unit structure of LSTM

Unit structure of LSTM

How LSTM works

Given a sequence data \left\\{\mathbf{x}\_{1}, \dots,\mathbf{x}\_{T}\right\\}, we have

  • Forget gate

    f_t=Οƒ(W_xfx_t+W_hfh_tβˆ’1+b_f) \mathbf{f}\_{t}=\sigma\left(W\_{x f} \mathbf{x}\_{t}+W\_{h f} \mathbf{h}\_{t-1}+b\_{f}\right)
  • Input gate

    i_t=Οƒ(W_xix_t+Whih_tβˆ’1+b_i) \mathbf{i}\_{t}=\sigma\left(W\_{x i} \mathbf{x}\_{t}+W_{h i} \mathbf{h}\_{t-1}+b\_{i}\right)
  • Candidate of new cell state

    g_t=tanh⁑(W_xcx_t+Whch_tβˆ’1+b_c) \mathbf{g}\_{t}=\tanh \left(W\_{x c} \mathbf{x}\_{t}+W_{h c} \mathbf{h}\_{t-1}+b\_{c}\right)
  • New cell state

    c_t=f_tβŠ™c_tβˆ’1+i_tβŠ™g_t \mathbf{c}\_{t}=\mathbf{f}\_{t} \odot \mathbf{c}\_{t-1}+\mathbf{i}\_{t} \odot \mathbf{g}\_{t}

    Note

    βŠ™\odot is a pointwise/elementwise multiplication.

    [>x_1>x_2>]βŠ™[>y_1>y_2>]=[>x_1y_1>x_2y_2>]\left[\begin{array}{l} > x\_{1} \\\\ > x\_{2} > \end{array}\right] \odot\left[\begin{array}{l} > y\_{1} \\\\ > y\_{2} > \end{array}\right]=\left[\begin{array}{l} > x\_{1} y\_1 \\\\ > x\_{2} y\_{2} > \end{array}\right]

  • Output gate

    o_t=Οƒ(W_xox_t+W_hoh_tβˆ’1+b_o) \mathbf{o}\_{t}=\sigma\left(W\_{x o} \mathbf{x}\_{t}+W\_{h o} \mathbf{h}\_{t-1}+b\_{o}\right)
  • New hidden state (and output)

    h_t=o_tβŠ™tanh⁑(c_t),z_t=softmax⁑(W_hzh_t+b_z) \mathbf{h}\_{t}=\mathbf{o}\_{t} \odot \tanh \left(\mathbf{c}\_{t}\right), \quad z\_{t}=\operatorname{softmax}\left(W\_{h z} \mathbf{h}\_{t}+b\_{z}\right)

Derivatives

At the time step tt:

  • h_t=o_t∘tanh⁑(c_t)β‡’\mathbf{h}\_{t}=\mathbf{o}\_{t} \circ \tanh \left(\mathbf{c}\_{t}\right)\Rightarrow
    • do_t=tanh⁑(c_t)dh_td \mathbf{o}\_{t}=\tanh \left(\mathbf{c}\_{t}\right) d \mathbf{h}\_{t}
    • dc_t=(1βˆ’tanh⁑(c_t)2)o_tdh_td \mathbf{c}\_{t}=\left(1-\tanh \left(\mathbf{c}\_{t}\right)^{2}\right) \mathbf{o}\_{t} d \mathbf{h}\_{t} (see: Derivation of tanhtanh)
  • c_t=f_t∘c_tβˆ’1+i_t∘g_tβ‡’\mathbf{c}\_{t}=\mathbf{f}\_{t} \circ \mathbf{c}\_{t-1}+\mathbf{i}\_{t} \circ \mathbf{g}\_{t} \Rightarrow
    • di_t=g_tdc_td \mathbf{i}\_{t}=\mathbf{g}\_{t} d \mathbf{c}\_{t}
    • dg_t=i_tdc_td \mathbf{g}\_{t}=\mathbf{i}\_{t} d \mathbf{c}\_{t}
    • df_t=c_tβˆ’1dc_td \mathbf{f}\_{t}=\mathbf{c}\_{t-1} d \mathbf{c}\_{t}
    • dc_tβˆ’1+=f_t∘dc_td \mathbf{c}\_{t-1}+=\mathbf{f}\_{t} \circ d \mathbf{c}\_{t} (derivation see: Error propagation)

What’s more, we backpropagate activation functions over the whole sequence (As the weights W_xo,W_xi,W_xf,W_xcW\_{xo}, W\_{xi}, W\_{xf}, W\_{xc} are shared across the whole sequence, we need to take the same summation over tt as in RNNs):

dW_xo=βˆ‘_to_t(1βˆ’o_t)x_tdo_tdW_xi=βˆ‘ti_t(1βˆ’i_t)x_tdi_tdW_xf=βˆ‘_tf_t(1βˆ’f_t)x_tdf_tdW_xc=βˆ‘(1βˆ’g_t2)x_tdg_t \begin{aligned} d W\_{x o} &=\sum\_{t} \mathbf{o}\_{t}\left(1-\mathbf{o}\_{t}\right) \mathbf{x}\_{t} d \mathbf{o}\_{t} \\\\ d W\_{x i} &=\sum_{t} \mathbf{i}\_{t}\left(1-\mathbf{i}\_{t}\right) \mathbf{x}\_{t} d \mathbf{i}\_{t} \\\\ d W\_{x f} &=\sum\_{t} \mathbf{f}\_{t}\left(1-\mathbf{f}\_{t}\right) \mathbf{x}\_{t} d \mathbf{f}\_{t} \\\\ d W\_{x c} &=\sum\left(1-\mathbf{g}\_{t}^{2}\right) \mathbf{x}\_{t} d \mathbf{g}\_{t} \end{aligned}

Similarly, we have

dW_ho=βˆ‘to_t(1βˆ’o_t)h_tβˆ’1do_tdW_hi=βˆ‘_ti_t(1βˆ’i_t)h_tβˆ’1di_tdW_hf=βˆ‘_tf_t(1βˆ’f_t)h_tβˆ’1df_tdW_hc=βˆ‘_t(1βˆ’g_t2)h_tβˆ’1dg_t \begin{aligned} d W\_{h o} &=\sum_{t} \mathbf{o}\_{t}\left(1-\mathbf{o}\_{t}\right) \mathbf{h}\_{t-1} d \mathbf{o}\_{t} \\\\ d W\_{h i} &=\sum\_{t} \mathbf{i}\_{t}\left(1-\mathbf{i}\_{t}\right) \mathbf{h}\_{t-1} d \mathbf{i}\_{t} \\\\ d W\_{h f} &=\sum\_{t} \mathbf{f}\_{t}\left(1-\mathbf{f}\_{t}\right) \mathbf{h}\_{t-1} d \mathbf{f}\_{t} \\\\ d W\_{h c} &=\sum\_{t}\left(1-\mathbf{g}\_{t}^{2}\right) \mathbf{h}\_{t-1} d \mathbf{g}\_{t} \end{aligned}

Since h_tβˆ’1h\_{t-1}, hidden state at time step tβˆ’1t-1, is used in forget gate, input gate, candidate of new cell state, and output gate, therefore:

dh_tβˆ’1=o_t(1βˆ’o_t)Whodo_t+i_t(1βˆ’i_t)W_hidi_t+f_t(1βˆ’f_t)Whfdf_t+(1βˆ’g_t2)W_hcdg_t \begin{aligned} d \mathbf{h}\_{t-1} = &\mathbf{o}\_{t}\left(1-\mathbf{o}\_{t}\right) W_{h o} d \mathbf{o}\_{t}+\mathbf{i}\_{t}\left(1-\mathbf{i}\_{t}\right) W\_{h i} d \mathbf{i}\_{t} \\\\ &+\mathbf{f}\_{t}\left(1-\mathbf{f}\_{t}\right) W_{h f} d \mathbf{f}\_{t}+\left(1-\mathbf{g}\_{t}^{2}\right) W\_{h c} d \mathbf{g}\_{t} \end{aligned}

Alternatively, we can derive dh_tβˆ’1d \mathbf{h}\_{t-1} from the objective function at time step tβˆ’1t-1:

dh_tβˆ’1=dh_tβˆ’1+Whzdz_tβˆ’1 d \mathbf{h}\_{t-1}=d \mathbf{h}\_{t-1}+W_{h z} d z\_{t-1}

Error backpropagation

Suppose we have the least square objective function

L(x,ΞΈ)=minβ‘βˆ‘_t12(y_tβˆ’z_t)2 \mathcal{L}(\mathbf{x}, \theta)=\min \sum\_{t} \frac{1}{2}\left(y\_{t}-z\_{t}\right)^{2}

where \boldsymbol{\theta}=\left\\{W\_{h z}, W\_{x o}, W\_{x i}, W\_{x f}, W\_{x c}, W\_{h o}, W\_{h i}, W\_{h f}, W\_{h c}\right\\} with bias ignored. For the sake of brevity, we use the following notation

L(t)=12(y_tβˆ’z_t)2 \mathcal{L}(t)=\frac{1}{2}\left(y\_{t}-z\_{t}\right)^{2}

At time step TT, we take derivative w.r.t. c_T\mathbf{c}\_T

βˆ‚L(T)βˆ‚c_T=βˆ‚L(T)βˆ‚h_Tβˆ‚h_Tβˆ‚c_T \frac{\partial \mathcal{L}(T)}{\partial \mathbf{c}\_{T}}=\frac{\partial \mathcal{L}(T)}{\partial \mathbf{h}\_{T}} \frac{\partial \mathbf{h}\_{T}}{\partial \mathbf{c}\_{T}}

At time step Tβˆ’1T-1, we take derivative of L(tβˆ’1)\mathcal{L}(t-1) w.r.t. c_Tβˆ’1\mathbf{c}\_{T-1} as

βˆ‚L(Tβˆ’1)βˆ‚c_Tβˆ’1=βˆ‚L(Tβˆ’1)βˆ‚h_Tβˆ’1βˆ‚h_Tβˆ’1βˆ‚c_Tβˆ’1 \frac{\partial \mathcal{L}(T-1)}{\partial \mathbf{c}\_{T-1}}=\frac{\partial \mathcal{L}(T-1)}{\partial \mathbf{h}\_{T-1}} \frac{\partial \mathbf{h}\_{T-1}}{\partial \mathbf{c}\_{T-1}}

However, according to the following unfolded unit of structure,

Unfolded unit of LSTM, in order to make it easy to understand error backpropagation

Unfolded unit of LSTM, in order to make it easy to understand error backpropagation

the error is not only backpropagated via L(Tβˆ’1)\mathcal{L}(T-1), but also from c_T\mathbf{c}\_T. Therefore, the gradient w.r.t. c_Tβˆ’1\mathbf{c}\_{T-1} should be

βˆ‚L(Tβˆ’1)βˆ‚c_Tβˆ’1=βˆ‚L(Tβˆ’1)βˆ‚c_Tβˆ’1+βˆ‚L(T)βˆ‚c_Tβˆ’1=βˆ‚L(Tβˆ’1)βˆ‚h_Tβˆ’1βˆ‚h_Tβˆ’1βˆ‚c_Tβˆ’1+βˆ‚L(T)βˆ‚h_Tβˆ‚h_Tβˆ‚c_T⏟_=dc_Tβˆ‚c_Tβˆ‚c_Tβˆ’1⏟_=f_T \begin{array}{ll} \frac{\partial \mathcal{L}(T-1)}{\partial \mathbf{c}\_{T-1}} &= \frac{\partial \mathcal{L}(T-1)}{\partial \mathbf{c}\_{T-1}}+\frac{\partial \mathcal{L}(T)}{\partial \mathbf{c}\_{T-1}} \\\\ &=\frac{\partial \mathcal{L}(T-1)}{\partial \mathbf{h}\_{T-1}} \frac{\partial \mathbf{h}\_{T-1}}{\partial \mathbf{c}\_{T-1}} + \underbrace{\frac{\partial \mathcal{L}(T)}{\partial \mathbf{h}\_{T}} \frac{\partial \mathbf{h}\_{T}}{\partial \mathbf{c}\_{T}}}\_{=d\mathbf{c}\_T} \underbrace{\frac{\partial \mathbf{c}\_{T}}{\partial \mathbf{c}\_{T-1}}}\_{=\mathbf{f}\_T} \end{array} β‡’dc_Tβˆ’1=dc_Tβˆ’1+f_T∘dc_T \Rightarrow \qquad d \mathbf{c}\_{T-1}=d \mathbf{c}\_{T-1}+\mathbf{f}\_{T} \circ d \mathbf{c}\_{T}

Parameters learning

  1. Forward

    Use the equations in How LSTM works to update states as the feedforward neural network from the time step 11 to TT.

  2. Compute loss

  3. Backward

    Backpropage the error from TT to 11 using equations in Derivatives. Then use gradient dΞΈd\boldsymbol{\theta} to update the parameters \boldsymbol{\theta}=\left\\{W\_{h z}, W\_{x o}, W\_{x i}, W\_{x f}, W\_{x c}, W\_{h o}, W\_{h i}, W\_{h f}, W\_{h c}\right\\}. For example, if we use SGD, we have:

    ΞΈ=ΞΈβˆ’Ξ·dΞΈ \boldsymbol{\theta}=\boldsymbol{\theta}-\eta d \boldsymbol{\theta}

    where Ξ·\eta is the learning rate.

Derivative of softmax

z=softmax⁑(W_hzh+b_z)z=\operatorname{softmax}\left(W\_{h z} \mathbf{h}+\mathbf{b}\_{z}\right): predicts the probability assigned to KK classes.

Furthermore, we can use 11 of KK (One-hot) encoding to represent the groundtruth yy but with probability vector to represent z=[p(y^_1),…,p(y^_K)]z=\left[p\left(\hat{y}\_{1}\right), \ldots, p\left(\hat{y}\_{K}\right)\right]. Then, we can consider the gradient in each dimension, and then generalize it to the vector case in the objective function (cross-entropy loss):

L(W_hz,b_z)=βˆ’ylog⁑z \mathcal{L}\left(W\_{h z}, \mathbf{b}\_{z}\right)=-y \log z

Let

αj(Θ)=W_hz(:,j)h_t \alpha_{j}(\Theta)=W\_{h z}(:, j) \mathbf{h}\_{t}

Then

p(y^_j∣h_t;Θ)=exp⁑(Ξ±_j(Θ))βˆ‘_kexp⁑(Ξ±_k(Θ)) p\left(\hat{y}\_{j} \mid \mathbf{h}\_{t} ; \Theta\right)=\frac{\exp \left(\alpha\_{j}(\Theta)\right)}{\sum\_{k} \exp \left(\alpha\_{k}(\Theta)\right)}

βˆ€k=j\forall k = j:

βˆ‚βˆ‚Ξ±_jy_jlog⁑p(y^_j∣h_t;Θ)=y_j(βˆ‚βˆ‚Ξ±_jlog⁑p(y^_j∣h_t;Θ))(βˆ‚βˆ‚Ξ±_jp(y^_j∣h_t;Θ))=y_jβ‹…1p(y^_j∣h_t;Θ)β‹…(βˆ‚βˆ‚Ξ±_jexp⁑(Ξ±_j(Θ))βˆ‘_kexp⁑(Ξ±_k(Θ)))=y_jp(y^_j)exp⁑(Ξ±_j(Θ))βˆ‘kexp⁑(Ξ±_k(Θ))βˆ’exp⁑(Ξ±_j(Θ))exp⁑(Ξ±_j(Θ))[βˆ‘_kexp⁑(Ξ±_k(Θ))]2=y_jp(y^_j)exp⁑(Ξ±_j(Θ))βˆ‘_kexp⁑(Ξ±_k(ΞΈ))⏟_=p(y^_j)βˆ‘_kexp⁑(Ξ±_k(ΞΈ))βˆ’exp⁑(Ξ±_j(Θ))βˆ‘_kexp⁑(Ξ±_k(ΞΈ))=y_j(βˆ‘_kexp⁑(Ξ±_k(ΞΈ))βˆ‘_kexp⁑(Ξ±_k(ΞΈ))βˆ’exp⁑(Ξ±_j(Θ))βˆ‘_kexp⁑(Ξ±_k(ΞΈ))⏟_=p(y^_j))=y_j(1βˆ’p(y^_j)) \begin{array}{ll} &\frac{\partial}{\partial \alpha\_{j}} y\_{j} \log p\left(\hat{y}\_{j} \mid \mathbf{h}\_{t} ; \Theta\right) \\\\ \\\\ =&y\_{j} \left(\frac{\partial}{\partial \alpha\_{j}} \log p\left(\hat{y}\_{j} \mid \mathbf{h}\_{t} ; \Theta\right)\right) \left(\frac{\partial}{\partial \alpha\_{j}} p\left(\hat{y}\_{j} \mid \mathbf{h}\_{t} ; \Theta\right)\right) \\\\ \\\\ =&y\_{j} \cdot \frac{1}{p\left(\hat{y}\_{j} \mid \mathbf{h}\_{t} ; \Theta\right)} \cdot \left(\frac{\partial}{\partial \alpha\_{j}} \frac{\exp \left(\alpha\_{j}(\Theta)\right)}{\sum\_{k} \exp \left(\alpha\_{k}(\Theta)\right)}\right) \\\\ \\\\ =&\frac{y\_{j}}{p\left(\hat{y}\_{j}\right)} \frac{\exp \left(\alpha\_{j}(\Theta)\right) \sum_{k} \exp \left(\alpha\_{k}(\Theta)\right)-\exp \left(\alpha\_{j}(\Theta)\right) \exp \left(\alpha\_{j}(\Theta)\right)}{\left[\sum\_{k} \exp \left(\alpha\_{k}(\Theta)\right)\right]^{2}} \\\\ \\\\ =& \frac{y\_{j}}{p\left(\hat{y}\_{j}\right)} \underbrace{\frac{\exp \left(\alpha\_{j}(\Theta)\right)}{\sum\_k \exp(\alpha\_k(\theta))}}\_{=p\left(\hat{y}\_{j}\right)} \frac{\sum\_k \exp(\alpha\_k(\theta)) - \exp \left(\alpha\_{j}(\Theta)\right)}{\sum\_k \exp(\alpha\_k(\theta))}\\\\ \\\\ =&y\_{j} (\frac{\sum\_k \exp(\alpha\_k(\theta))}{\sum\_k \exp(\alpha\_k(\theta))} - \underbrace{\frac{\exp \left(\alpha\_{j}(\Theta)\right)}{\sum\_k \exp(\alpha\_k(\theta))}}\_{=p\left(\hat{y}\_{j}\right)}) \\\\ \\\\ =&y\_{j}\left(1-p\left(\hat{y}\_{j}\right)\right) \end{array}

βˆ€kβ‰ j\forall k \neq j:

βˆ‚βˆ‚Ξ±_jy_jlog⁑p(y^_j∣h_t;Θ)=y_jp(y^_j)βˆ’exp⁑(Ξ±_k(Θ))exp⁑(Ξ±_j(Θ))[βˆ‘_sexp⁑(Ξ±_s(Θ))]2=βˆ’y_kp(y^_j) \begin{array}{ll} &\frac{\partial}{\partial \alpha\_{j}} y\_{j} \log p\left(\hat{y}\_{j} \mid \mathbf{h}\_{t} ; \Theta\right) \\\\ \\\\ =&\frac{y\_{j}}{p\left(\hat{y}\_{j}\right)} \frac{-\exp \left(\alpha\_{k}(\Theta)\right) \exp \left(\alpha\_{j}(\Theta)\right)}{\left[\sum\_{s} \exp \left(\alpha\_{s}(\Theta)\right)\right]^{2}} \\\\ \\\\ =&-y\_{k}p\left(\hat{y}\_{j}\right) \end{array}

We can yield the following gradient w.r.t. α_j(Θ)\alpha\_j(\Theta):

βˆ‚p(y^)βˆ‚Ξ±j=βˆ‘_jβˆ‚y_jlog⁑p(y^_j∣h_i;Θ)βˆ‚Ξ±_j=βˆ‚log⁑p(y^_j∣h_i;Θ)βˆ‚Ξ±_j+βˆ‘kβ‰ jβˆ‚log⁑p(y^_k∣h_i;Θ)βˆ‚Ξ±j=(y_jβˆ’y_jp(y^_j))βˆ’(βˆ‘kβ‰ jy_kp(y^_j))=y_jβˆ’p(y^_j)(yj+βˆ‘kβ‰ jy_k)=y_jβˆ’p(y^_j) \begin{aligned} \frac{\partial p(\hat{\mathbf{y}})}{\partial \alpha_{j}} &=\sum\_{j} \frac{\partial y\_{j} \log p\left(\hat{y}\_{j} \mid \mathbf{h}\_{i} ; \Theta\right)}{\partial \alpha\_{j}} \\\\ &=\frac{\partial \log p\left(\hat{y}\_{j} \mid \mathbf{h}\_{i} ; \Theta\right)}{\partial \alpha\_{j}}+\sum_{k \neq j} \frac{\partial \log p\left(\hat{y}\_{k} \mid \mathbf{h}\_{i} ; \Theta\right)}{\partial \alpha_{j}} \\\\ &=\left(y\_{j}-y\_{j} p\left(\hat{y}\_{j}\right)\right)-\left(\sum_{k \neq j} y\_{k} p\left(\hat{y}\_{j}\right)\right) \\\\ &=y\_{j}-p\left(\hat{y}\_{j}\right)\left(y_{j}+\sum_{k \neq j} y\_{k}\right)\\\\ &=y\_{j}-p\left(\hat{y}\_{j}\right) \end{aligned} β‡’βˆ‚Lβˆ‚Ξ±_j=βˆ‚βˆ‚Ξ±_j(βˆ’βˆ‘_jβˆ‚y_jlog⁑p(y^_j∣h_i;Θ)βˆ‚Ξ±_j)=βˆ’(y_jβˆ’p(y^_j))=p(y^_j)βˆ’y_j \begin{aligned} \Rightarrow \frac{\partial \mathcal{L}}{\partial \alpha\_j} &= \frac{\partial }{\partial \alpha\_j} \left(-\sum\_{j} \frac{\partial y\_{j} \log p\left(\hat{y}\_{j} \mid \mathbf{h}\_{i} ; \Theta\right)}{\partial \alpha\_{j}}\right) \\\\ &= -\left(y\_{j}-p\left(\hat{y}\_{j}\right)\right) \\\\ &= p\left(\hat{y}\_{j}\right) - y\_{j} \end{aligned}

See also:

Derivative of tanh

βˆ‚tanh⁑(x)βˆ‚(x)=βˆ‚sinh⁑(x)cosh⁑(x)βˆ‚x=βˆ‚sinh⁑(x)βˆ‚xcosh⁑(x)βˆ’sinh⁑(x)βˆ‚cosh⁑(x)βˆ‚x(cosh⁑(x))2=[cosh⁑(x)]2βˆ’[sinh⁑(x)]2(cosh⁑(x))2=1βˆ’[tanh⁑(x)]2 \begin{aligned} \frac{\partial \tanh (x)}{\partial(x)}=& \frac{\partial \frac{\sinh (x)}{\cosh (x)}}{\partial x} \\\\ =& \frac{\frac{\partial \sinh (x)}{\partial x} \cosh (x)-\sinh (x) \frac{\partial \cosh (x)}{\partial x}}{(\cosh (x))^{2}} \\\\ =& \frac{[\cosh (x)]^{2}-[\sinh (x)]^{2}}{(\cosh (x))^{2}} \\\\ =& 1-[\tanh (x)]^{2} \end{aligned}

Derivative of Sigmoid

Sigmoid:

Οƒ(x)=11+eβˆ’x \sigma(x)=\frac{1}{1+e^{-x}}

Derivative:

ddxΟƒ(x)=ddx[11+eβˆ’x]=ddx(1+eβˆ’x)βˆ’1=βˆ’(1+eβˆ’x)βˆ’2(βˆ’eβˆ’x)=eβˆ’x(1+eβˆ’x)2=11+eβˆ’xβ‹…eβˆ’x1+eβˆ’x=11+eβˆ’xβ‹…(1+eβˆ’x)βˆ’11+eβˆ’x=11+eβˆ’xβ‹…(1+eβˆ’x1+eβˆ’xβˆ’11+eβˆ’x)=11+eβˆ’xβ‹…(1βˆ’11+eβˆ’x)=Οƒ(x)β‹…(1βˆ’Οƒ(x)) \begin{aligned} \frac{d}{d x} \sigma(x) &=\frac{d}{d x}\left[\frac{1}{1+e^{-x}}\right] \\\\ &=\frac{d}{d x}\left(1+\mathrm{e}^{-x}\right)^{-1} \\\\ &=-\left(1+e^{-x}\right)^{-2}\left(-e^{-x}\right) \\\\ &=\frac{e^{-x}}{\left(1+e^{-x}\right)^{2}} \\\\ &=\frac{1}{1+e^{-x}} \cdot \frac{e^{-x}}{1+e^{-x}} \\\\ &=\frac{1}{1+e^{-x}} \cdot \frac{\left(1+e^{-x}\right)-1}{1+e^{-x}} \\\\ &=\frac{1}{1+e^{-x}} \cdot\left(\frac{1+e^{-x}}{1+e^{-x}}-\frac{1}{1+e^{-x}}\right) \\\\ &=\frac{1}{1+e^{-x}} \cdot\left(1-\frac{1}{1+e^{-x}}\right) \\\\ &=\sigma(x) \cdot(1-\sigma(x)) \end{aligned}

Reference