Summary (TL;DR)

Language models

  • offer a way to assign a probability to a sentence or other sequence of words, and to predict a word from preceding words.
    • $P(w|h)$: Probability of word $w$ given history $h$

n-gram model

  • estimate words from a fixed window of previous words

    $$ P\left(w_{n} | w_{1}^{n-1}\right) \approx P\left(w_{n} | w_{n-N+1}^{n-1}\right) $$
  • n-gram probabilities can be estimated by counting in a corpus and normalizing (MLE)

    $$ P\left(w_{n} | w_{n-N+1}^{n-1}\right)=\frac{C\left(w_{n-N+1}^{n-1} w_{n}\right)}{C\left(w_{n-N+1}^{n-1}\right)} $$
  • Evaluation

    • Extrinsically in some task (expensive!)

    • Instrinsically using perplexity

      • perplexity of a test set according to a language model: the geometric mean of the inverse test set probability computed by the model. $$ \begin{array}{ll} \operatorname{PP}(W) &=P\left(w_{1} w_{2} \ldots w_{N}\right)^{-\frac{1}{N}} \\\\ &=\sqrt[N]{\frac{1}{P\left(w_{1} w_{2} \ldots w_{N}\right)}} \\\\ &\overset{\text{chain rule}}{=} \sqrt[N]{\displaystyle\prod_{i=1}^{N} \frac{1}{P\left(w_{i} | w_{1} \ldots w_{i-1}\right)}} \end{array} $$

Smoothing

provide a more sophisticated way to estimate the probability of n-grams

  • Laplace (Add-one) smmothing

    $$ P_{\text {Laplace}}\left(w_{i}\right)=\frac{c_{i}+1}{N+V} $$
    • $V$: Number of words in the vocabulary
  • Add-k smoothing

    $$ P_{\mathrm{Add}-\mathrm{k}}^{*}\left(w_{n} | w_{n-1}\right)=\frac{C\left(w_{n-1} w_{n}\right)+k}{C\left(w_{n-1}\right)+k V} $$
  • Backoff or interpolation

    • Rely on lower-order n-gram counts
  • Kneser-Ney smoothing

    • makes use of the probability of a word being a novel continuation