Summary (TL;DR) | Haobin Tan

Summary (TL;DR)

Language models

offer a way to assign a probability to a sentence or other sequence of words, and to predict a word from preceding words.
- $P(w|h)$ : Probability of word $w$ given history $h$

estimate words from a fixed window of previous words
$P\left(w_{n} | w_{1}^{n-1}\right) \approx P\left(w_{n} | w_{n-N+1}^{n-1}\right)$
n-gram probabilities can be estimated by counting in a corpus and normalizing (MLE)
$P\left(w_{n} | w_{n-N+1}^{n-1}\right)=\frac{C\left(w_{n-N+1}^{n-1} w_{n}\right)}{C\left(w_{n-N+1}^{n-1}\right)}$
Evaluation
- Extrinsically in some task (expensive!)
- Instrinsically using perplexity
  - perplexity of a test set according to a language model: the geometric mean of the inverse test set probability computed by the model. $\begin{array}{ll} \operatorname{PP}(W) &=P\left(w_{1} w_{2} \ldots w_{N}\right)^{-\frac{1}{N}} \\\\ &=\sqrt[N]{\frac{1}{P\left(w_{1} w_{2} \ldots w_{N}\right)}} \\\\ &\overset{\text{chain rule}}{=} \sqrt[N]{\displaystyle\prod_{i=1}^{N} \frac{1}{P\left(w_{i} | w_{1} \ldots w_{i-1}\right)}} \end{array}$

provide a more sophisticated way to estimate the probability of n-grams

Laplace (Add-one) smmothing
$P_{\text {Laplace}}\left(w_{i}\right)=\frac{c_{i}+1}{N+V}$
- $V$ : Number of words in the vocabulary
Add-k smoothing
$P_{\mathrm{Add}-\mathrm{k}}^{*}\left(w_{n} | w_{n-1}\right)=\frac{C\left(w_{n-1} w_{n}\right)+k}{C\left(w_{n-1}\right)+k V}$
Backoff or interpolation
- Rely on lower-order n-gram counts
Kneser-Ney smoothing
- makes use of the probability of a word being a novel continuation

Last updated on 2024-09-05