Summary (TL;DR)
Language models
- offer a way to assign a probability to a sentence or other sequence of words, and to predict a word from preceding words.
- : Probability of word given history
n-gram model
estimate words from a fixed window of previous words
n-gram probabilities can be estimated by counting in a corpus and normalizing (MLE)
Evaluation
Extrinsically in some task (expensive!)
Instrinsically using perplexity
- perplexity of a test set according to a language model: the geometric mean of the inverse test set probability computed by the model.
Smoothing
provide a more sophisticated way to estimate the probability of n-grams
Laplace (Add-one) smmothing
- : Number of words in the vocabulary
Add-k smoothing
Backoff or interpolation
- Rely on lower-order n-gram counts
Kneser-Ney smoothing
- makes use of the probability of a word being a novel continuation