TL;DR Transformer High-Level Look Letβs begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.
2020-08-23
Core Idea The main assumption in sequence modelling networks such as RNNs, LSTMs and GRUs is that the current state holds information for the whole of input seen so far. Hence the final state of a RNN after reading the whole input sequence should contain complete information about that sequence.
2020-08-16
Language Modeling Language model is a particular model calculating the probability of a sequence $$ \begin{aligned} P(W) &= P(W\_1 W\_2 \dots W\_n) \\\\ &= P\left(W\_{1}\right) P\left(W_{2} \mid W\_{1}\right) P\left(W\_{3} \mid W\_{1} W\_{2}\right) \ldots P\left(W\_{n} \mid W\_{1 \ldots n-1}\right) \end{aligned} $$ Softmax layer
2020-08-16
2020-08-16