Recurrent Neural Networks

Recurrent Neural Networks

For detailed explanation and summary see: RNN Summary

Overview

  • Specifically designed for long-range dependency
  • 💡 Main idea: connecting the hidden states together within a layer

Simple RNNs

Elman Networks

截屏2020-08-21 00.09.01
  • The output of the hidden layer is used as input for the next time step
  • They use a copy mechanism. Previous state of H copied into next iteration of net
    • Do NOT really have true recurrence, no backpropagation through time

Jordan Nets

截屏2020-08-21 00.15.46
  • Same Copy Mechanisms as Elman Networks, but from Output Layer

RNN Structure

截屏2020-08-21 00.19.34 $$ H\_{t}=f\_{a c t}\left(W^{H} H\_{t-1}+W^{X} X\_{t}+b\right) $$
  • $f\_{act}$: activation function (sigmoid, tanh, ReLU … )
  • Inside the bracket: a linear “voting” combination between the memory $H\_{t-1}$ and input $X\_t$

Learning in RNNs: BPTT

One-to-One RNN

截屏2020-08-21 09.30.48
  • Identical to a MLP

  • $W^H$ is non-existent or an identity function

  • Backprop (obviously) like an MLP

    • $\frac{\delta L}{\delta O}$ is directly computed from $L(O, Y)$
    $$ \begin{array}{l} \frac{\delta L}{\delta H\_{1}}=\frac{\delta L}{\delta O} \frac{\delta O}{\delta H\_{1}}=W^{O} \color{red}{\frac{\delta L}{\delta O}} \\\\ \frac{\delta L}{\delta W^O}={\color{red}{\frac{\delta L}{\delta O}}} \frac{\delta L}{\delta W^{O}}={\color{red}{\frac{\delta L}{\delta O}}} H^{T} \end{array} $$

    ​ (Red terms are precomputed values in backpropagation)

Many-to-One RNN

截屏2020-08-21 09.58.38
  • Often used to sequence-level classification

  • For example:

    • Movie review sentiment analysis
    • Predicting blood type from DNA
  • Input: sequence $I \in \mathbb{R}^{T \times D\_I}$

  • Output: vector $O \in \mathbb{R}^{D\_O}$

  • Backprop

    • The gradient signal is backpropagated through time by the recurrent connections
    • Shared weights: the weight matrices are duplicated over the time dimension
    • In each time step, the weights $\theta = (W^H, W^I, W^O)$ establishes a multi-layer perceptron.

    截屏2020-08-21 10.01.30

    截屏2020-08-21 10.08.30

    • The gradients are summed up over time $$ \begin{aligned} \frac{\delta L}{\delta W^{I}} &=\frac{\delta L}{\delta W\_{1}^{I}}+\frac{\delta L}{\delta W_{2}^{I}}+\cdots+\frac{\delta L}{\delta W\_{T}^{I}} \\\\ \\\\ \frac{\delta L}{\delta W^{H}} &=\frac{\delta L}{\delta W\_{1}^{H}}+\frac{\delta L}{\delta W\_{2}^{H}}+\cdots+\frac{\delta L}{\delta W\_{T-1}^{H}} \end{aligned} $$

Many-to-Many RNN

截屏2020-08-21 10.14.35
  • The inputs and outputs have the same number of elements

  • Example

    • Auto-regressive models
    • Character generation model
    • Pixel models
  • Backprop

    截屏2020-08-21 10.26.26
    • Loss function

      $$ L = L\_1 + L\_2 + \cdots + L\_T $$
    • $H\_t$ can not change $O\_{k $$ \frac{\delta L}{\delta H\_t} = \sum\_{i \geq t}\frac{\delta L\_i}{\delta H\_t} $$

      (Because we cannot change the past so the terms with $i < t$ can be omitted)

      截屏2020-08-21 10.29.12

    • After getting the $\frac{\delta L}{\delta H\_t}$ terms, we can compute the necessary gradients for $\frac{\delta L}{\delta W^{O}}, \frac{\delta L}{\delta W^{H}}$ and $\frac{\delta L}{\delta W^{I}}$

    • For shared weights $W^O$, apply the same principle in the Many-to-One RNN.

One-to-Many RNN

截屏2020-08-21 11.46.58
  • More rarely seen than Many-to-One

  • For example:

    • Image description generation
    • Music generation
  • Input: vector $I \in \mathbb{R}^{D\_{I}}$

  • Output: sequence $O \in R^{T \times D\_{O}}$

Problem of BPTT

截屏2020-08-21 11.51.23 $$ \begin{array}{l} \frac{\delta L}{\delta H\_{114}}=\frac{\delta L}{\delta H\_{115}} \frac{\delta H\_{115}}{\delta H\_{114}} \\\\ \\\\ \frac{\delta L}{\delta H_{113}}=\frac{\delta L}{\delta H\_{115}} \frac{\delta H\_{115}}{\delta H\_{114}} \frac{\delta H\_{114}}{\delta H\_{113}}=W^{O} f^{\prime}\left(H\_{115}\right) W^{H} f^{\prime}\left(H\_{114}\right) W^{H} \end{array} $$ If we wand to send signals back to step 50: $$ \begin{aligned} \frac{\delta L}{\delta H\_{50}} &=\frac{\delta L}{\delta H\_{115}} \frac{\delta H\_{115}}{\delta H\_{114}} \ldots \frac{\delta H\_{51}}{\delta H\_{50}} \\\\ \\\\ &=W^{O} f^{\prime}\left(H\_{115}\right) W^{H} f^{\prime}\left(H\_{114}\right) W^{H} f^{\prime}\left(H\_{113}\right) W^{H} \ldots \end{aligned} $$ 65 times of repetition for $f^{\prime}\left(H\_{115}\right) W^{H}$!!! 😱😱😱

Assuming network is univariate $W^H$ has one element $w$ and $f$ is either ReLU or linear

  • If $w = 1.1$: $\frac{\delta L}{\delta H_{50}}=117$

  • If $w = 0.9$: $\frac{\delta L}{\delta H_{50}}=0.005$

In general

  • if the maximum eigenvalue of $W^H > 1$: $\frac{\delta L}{\delta H_{T}}$ is likely to explode (Gradient explosion)

    More see: Exploding Gradient Problem

  • if the maximum eigenvalue of $W^H < 1$: $\frac{\delta L}{\delta H_{T}}$ is likely to vanish (Gradient vanishing)

    • small gradients make the RNN unable to learn the important features to solve the problem

Gradient clipping

  • Simple solution for gradient exploding

  • Rescale gradients so that their norm is at most a particular value.

    • Clip the gradient $g = \frac{\delta L}{\delta w}$ so
    $$ \|g\|>\mu: g=\mu \frac{g}{\|g\|} \qquad \text{(biased gradients)} $$
截屏2020-08-21 12.05.02
  • With gradient clipping, pre-determined gradient threshold be introduced, and then gradients norms that exceed this threshold are scaled down to match the norm. This prevents any gradient to have norm greater than the threshold and thus the gradients are clipped.

    More see: Gradient Clipping

BPTT Example