Summary (TL;DR)

Language models

  • offer a way to assign a probability to a sentence or other sequence of words, and to predict a word from preceding words.
    • P(w∣h)P(w|h): Probability of word ww given history hh

n-gram model

  • estimate words from a fixed window of previous words

    P(wn∣w1nβˆ’1)β‰ˆP(wn∣wnβˆ’N+1nβˆ’1) P\left(w_{n} | w_{1}^{n-1}\right) \approx P\left(w_{n} | w_{n-N+1}^{n-1}\right)
  • n-gram probabilities can be estimated by counting in a corpus and normalizing (MLE)

    P(wn∣wnβˆ’N+1nβˆ’1)=C(wnβˆ’N+1nβˆ’1wn)C(wnβˆ’N+1nβˆ’1) P\left(w_{n} | w_{n-N+1}^{n-1}\right)=\frac{C\left(w_{n-N+1}^{n-1} w_{n}\right)}{C\left(w_{n-N+1}^{n-1}\right)}
  • Evaluation

    • Extrinsically in some task (expensive!)

    • Instrinsically using perplexity

      • perplexity of a test set according to a language model: the geometric mean of the inverse test set probability computed by the model. PP⁑(W)=P(w1w2…wN)βˆ’1N=1P(w1w2…wN)N=chain rule∏i=1N1P(wi∣w1…wiβˆ’1)N \begin{array}{ll} \operatorname{PP}(W) &=P\left(w_{1} w_{2} \ldots w_{N}\right)^{-\frac{1}{N}} \\\\ &=\sqrt[N]{\frac{1}{P\left(w_{1} w_{2} \ldots w_{N}\right)}} \\\\ &\overset{\text{chain rule}}{=} \sqrt[N]{\displaystyle\prod_{i=1}^{N} \frac{1}{P\left(w_{i} | w_{1} \ldots w_{i-1}\right)}} \end{array}

Smoothing

provide a more sophisticated way to estimate the probability of n-grams

  • Laplace (Add-one) smmothing

    PLaplace(wi)=ci+1N+V P_{\text {Laplace}}\left(w_{i}\right)=\frac{c_{i}+1}{N+V}
    • VV: Number of words in the vocabulary
  • Add-k smoothing

    PAddβˆ’kβˆ—(wn∣wnβˆ’1)=C(wnβˆ’1wn)+kC(wnβˆ’1)+kV P_{\mathrm{Add}-\mathrm{k}}^{*}\left(w_{n} | w_{n-1}\right)=\frac{C\left(w_{n-1} w_{n}\right)+k}{C\left(w_{n-1}\right)+k V}
  • Backoff or interpolation

    • Rely on lower-order n-gram counts
  • Kneser-Ney smoothing

    • makes use of the probability of a word being a novel continuation