Recurrent Neural Networks

For detailed explanation and summary see: RNN Summary

Overview

  • Specifically designed for long-range dependency
  • πŸ’‘ Main idea: connecting the hidden states together within a layer

Simple RNNs

Elman Networks

ζˆͺ屏2020-08-21 00.09.01
  • The output of the hidden layer is used as input for the next time step
  • They use a copy mechanism. Previous state of H copied into next iteration of net
    • Do NOT really have true recurrence, no backpropagation through time

Jordan Nets

ζˆͺ屏2020-08-21 00.15.46
  • Same Copy Mechanisms as Elman Networks, but from Output Layer

RNN Structure

ζˆͺ屏2020-08-21 00.19.34 $$ H\_{t}=f\_{a c t}\left(W^{H} H\_{t-1}+W^{X} X\_{t}+b\right) $$
  • $f_{act}$: activation function (sigmoid, tanh, ReLU … )
  • Inside the bracket: a linear β€œvoting” combination between the memory $H_{t-1}$ and input $X_t$

Learning in RNNs: BPTT

One-to-One RNN

ζˆͺ屏2020-08-21 09.30.48
  • Identical to a MLP

  • $W^H$ is non-existent or an identity function

  • Backprop (obviously) like an MLP

    • $\frac{\delta L}{\delta O}$ is directly computed from $L(O, Y)$

    $$ \begin{array}{l} \frac{\delta L}{\delta H_{1}}=\frac{\delta L}{\delta O} \frac{\delta O}{\delta H_{1}}=W^{O} \color{red}{\frac{\delta L}{\delta O}} \\ \frac{\delta L}{\delta W^O}={\color{red}{\frac{\delta L}{\delta O}}} \frac{\delta L}{\delta W^{O}}={\color{red}{\frac{\delta L}{\delta O}}} H^{T} \end{array} $$

    ​ (Red terms are precomputed values in backpropagation)

Many-to-One RNN

ζˆͺ屏2020-08-21 09.58.38
  • Often used to sequence-level classification

  • For example:

    • Movie review sentiment analysis
    • Predicting blood type from DNA
  • Input: sequence $I \in \mathbb{R}^{T \times D_I}$

  • Output: vector $O \in \mathbb{R}^{D_O}$

  • Backprop

    • The gradient signal is backpropagated through time by the recurrent connections
    • Shared weights: the weight matrices are duplicated over the time dimension
    • In each time step, the weights $\theta = (W^H, W^I, W^O)$ establishes a multi-layer perceptron.

    ζˆͺ屏2020-08-21 10.01.30

    ζˆͺ屏2020-08-21 10.08.30

    • The gradients are summed up over time $$ \begin{aligned} \frac{\delta L}{\delta W^{I}} &=\frac{\delta L}{\delta W_{1}^{I}}+\frac{\delta L}{\delta W_{2}^{I}}+\cdots+\frac{\delta L}{\delta W_{T}^{I}} \\ \\ \frac{\delta L}{\delta W^{H}} &=\frac{\delta L}{\delta W_{1}^{H}}+\frac{\delta L}{\delta W_{2}^{H}}+\cdots+\frac{\delta L}{\delta W_{T-1}^{H}} \end{aligned} $$

Many-to-Many RNN

ζˆͺ屏2020-08-21 10.14.35
  • The inputs and outputs have the same number of elements

  • Example

    • Auto-regressive models
    • Character generation model
    • Pixel models
  • Backprop

    ζˆͺ屏2020-08-21 10.26.26
    • Loss function $$ L = L_1 + L_2 + \cdots + L_T $$

    • $H_t$ can not change $O_{k<t}$, so we have $$ \frac{\delta L}{\delta H_t} = \sum_{i \geq t}\frac{\delta L_i}{\delta H_t} $$ (Because we cannot change the past so the terms with $i < t$ can be omitted)

      ζˆͺ屏2020-08-21 10.29.12

    • After getting the $\frac{\delta L}{\delta H_t}$ terms, we can compute the necessary gradients for $\frac{\delta L}{\delta W^{O}}, \frac{\delta L}{\delta W^{H}}$ and $\frac{\delta L}{\delta W^{I}}$

    • For shared weights $W^O$, apply the same principle in the Many-to-One RNN.

One-to-Many RNN

ζˆͺ屏2020-08-21 11.46.58
  • More rarely seen than Many-to-One

  • For example:

    • Image description generation
    • Music generation
  • Input: vector $I \in \mathbb{R}^{D_{I}}$

  • Output: sequence $O \in R^{T \times D_{O}}$

Problem of BPTT

ζˆͺ屏2020-08-21 11.51.23 $$ \begin{array}{l} \frac{\delta L}{\delta H\_{114}}=\frac{\delta L}{\delta H\_{115}} \frac{\delta H\_{115}}{\delta H\_{114}} \\\\ \\\\ \frac{\delta L}{\delta H_{113}}=\frac{\delta L}{\delta H\_{115}} \frac{\delta H\_{115}}{\delta H\_{114}} \frac{\delta H\_{114}}{\delta H\_{113}}=W^{O} f^{\prime}\left(H\_{115}\right) W^{H} f^{\prime}\left(H\_{114}\right) W^{H} \end{array} $$ If we wand to send signals back to step 50: $$ \begin{aligned} \frac{\delta L}{\delta H\_{50}} &=\frac{\delta L}{\delta H\_{115}} \frac{\delta H\_{115}}{\delta H\_{114}} \ldots \frac{\delta H\_{51}}{\delta H\_{50}} \\\\ \\\\ &=W^{O} f^{\prime}\left(H\_{115}\right) W^{H} f^{\prime}\left(H\_{114}\right) W^{H} f^{\prime}\left(H\_{113}\right) W^{H} \ldots \end{aligned} $$ 65 times of repetition for $f^{\prime}\left(H\_{115}\right) W^{H}$!!! 😱😱😱

Assuming network is univariate $W^H$ has one element $w$ and $f$ is either ReLU or linear

  • If $w = 1.1$: $\frac{\delta L}{\delta H_{50}}=117$

  • If $w = 0.9$: $\frac{\delta L}{\delta H_{50}}=0.005$

In general

  • if the maximum eigenvalue of $W^H > 1$: $\frac{\delta L}{\delta H_{T}}$ is likely to explodeΒ (Gradient explosion)

    More see: Exploding Gradient Problem

  • if the maximum eigenvalue of $W^H < 1$: $\frac{\delta L}{\delta H_{T}}$ is likely to vanishΒ (Gradient vanishing)

    • small gradients make the RNN unable to learn the important features to solve the problem

Gradient clipping

  • Simple solution for gradient exploding

  • Rescale gradients so that their norm is at most a particular value.

    • Clip the gradient $g = \frac{\delta L}{\delta w}$ so

    $$ |g|>\mu: g=\mu \frac{g}{|g|} \qquad \text{(biased gradients)} $$

ζˆͺ屏2020-08-21 12.05.02
  • With gradient clipping, pre-determined gradient threshold be introduced, and then gradients norms that exceed this threshold are scaled down to match the norm. This prevents any gradient to have norm greater than the threshold and thus the gradients are clipped.

    More see: Gradient Clipping

BPTT Example

Next