Long Short-Term Memory (LSTM)
For detailed explanation and summary see:
Motivation

- Memory cell
- Inputs are “commited” into memory. Later inputs “erase” early inputs
- An additional memory “cell” for long term memory
- Also being read and write from the current step, but less affected like 𝐻
LSTM Operations

- Forget gate
- Input Gate
- Candidate Content
- Output Gate
Forget

Forget: remove information from cell
What to forget depends on:
- the current input
- the previous memory
Forget gate: controls what should be forgotten
Content to forget:
- near 0: Forgetting the content stored in
Write

Write: Adding new information for cell
What to forget depends on:
- the current input
- the previous memory
Input gate: controls what should be add
Content to write:
Write content:
Output

Output: Reading information from cell 𝐶 (to store in the current state )
How much to write depends on:
- the current input
- the previous memory
Forget gate: controls what should be output
New state
LSTM Gradients

Truncated Backpropagation
What happens if the sequence is really long (E.g. Character sequences, DNA sequences, video frame sequences …)? Back-propagation through time becomes exorbitant at large
Solution: Truncated Backpropagation

- Divide the sequences into segments and truncate between segments
- However the memory is kept to remain some information about the past (rather than resetting)
TDNN vs. LSTM
| TDNN | LSTM | |
|---|---|---|
![]() | ![]() | |
| For : | For : | |
| Weights are shared over time? | Yes | Yes |
| Handle variable length sequences | Can flexibly adapt to variable length sequences without changing structures | |
| Gradient vanishing or exploding problem? | No | Yes |
| Parallelism? | Can be parallelized into 𝑂(1) (Assuming Matrix multiplication cost is O(1) thanks to GPUs…) | Sequential computation (because cannot be computed before ) |

