Smoothing

To keep a language model from assigning zero probability to these unseen events, we’ll have to shave off a bit of probability mass from some more frequent events and give it to the events we’ve never seen.

Laplace smoothing (Add-1 smoothing)

πŸ’‘ Idea: add one to all the bigram counts, before we normalize them into probabilities.

  • does not perform well enough to be used in modern n-gram models πŸ€ͺ, but
  • usefully introduces many of the concepts
  • gives a useful baseline
  • a practical smoothing algorithm for other tasks like text classification

Unsmoothed maximum likelihood estimate of the unigram probability of the word wiw_i: its count cic_i normalized by the total number of word tokens NN

P(w_i)=c_iN P\left(w\_{i}\right)=\frac{c\_{i}}{N}

Laplace smoothed:

  • Merely adds one to each count
  • Since there are VV words in the vocabulary and each one was incremented, we also need to adjust the denominator to take into account the extra VV observations.
PLaplace(wi)=ci+1N+V P_{\text {Laplace}}\left(w_{i}\right)=\frac{c_{i}+1}{N+V}

Adjust count cβˆ—c^*

ciβˆ—=(ci+1)NN+V c_{i}^{*}=\left(c_{i}+1\right) \frac{N}{N+V}
  • easier to compare directly with the MLE counts and can be turned into a probability like an MLE count by normalizing by NN

    cIβˆ—N=(ci+1)NN+Vβ‹…1N=ci+1N+V=PLaplace(wi)\frac{c_I^*}{N} = \left(c_{i}+1\right) \frac{N}{N+V} \cdot \frac{1}{N} =\frac{c_{i}+1}{N+V} = P_{\text {Laplace}}\left(w_{i}\right)

Another aspect of smoothing

A related way to view smoothing is as discounting (lowering) some non-zero counts in order to get the probability mass that will be assigned to the zero counts.

Thus, instead of referring to the discounted counts c\*c^{\*}, we might describe a smoothing algorithm in terms of a relative discount dcd_c

dc=cβˆ—c d_c = \frac{c^*}{c}

(the ratio of the discounted counts to the original counts)

Example

Smooth the Berkeley Restaurant Project bigrams

  • Original(unsmoothed) bigram counts and probabilities

    ζˆͺ屏2020-06-02 16.56.21
ζˆͺ屏2020-06-02 16.58.56
  • Add-one smoothed counts and probabilities

    ζˆͺ屏2020-06-03 16.33.50
ζˆͺ屏2020-06-03 16.36.08

Computation:

  • Recall: normal bigram probabilities are computed by normalizing each row of counts by the unigram count

    P(wn∣wnβˆ’1)=C(wnβˆ’1wn)C(wnβˆ’1) P\left(w_{n} | w_{n-1}\right)=\frac{C\left(w_{n-1} w_{n}\right)}{C\left(w_{n-1}\right)}
  • add-one smoothed bigram: augment the unigram count by the number of total word types in the vocabulary

    PLaplace βˆ—(wn∣wnβˆ’1)=C(wnβˆ’1wn)+1βˆ‘w(C(wnβˆ’1w)+1)=C(wnβˆ’1wn)+1C(wnβˆ’1)+V P_{\text {Laplace }}^{*}\left(w_{n} | w_{n-1}\right)=\frac{C\left(w_{n-1} w_{n}\right)+1}{\sum_{w}\left(C\left(w_{n-1} w\right)+1\right)}=\frac{C\left(w_{n-1} w_{n}\right)+1}{C\left(w_{n-1}\right)+V}
  • It is often convenient to reconstruct the count matrix so we can see how much a smoothing algorithm has changed the original counts.

    cβˆ—(wnβˆ’1wn)=[C(wnβˆ’1wn)+1]Γ—C(wnβˆ’1)C(wnβˆ’1)+V c^{*}\left(w_{n-1} w_{n}\right)=\frac{\left[C\left(w_{n-1} w_{n}\right)+1\right] \times C\left(w_{n-1}\right)}{C\left(w_{n-1}\right)+V}

    ζˆͺ屏2020-06-03 17.19.19

Add-one smoothing has made a very big change to the counts.

  • C(want to)C(\text{want to}) changed from 609 to 238
  • P(to∣want)P(to|want) decreases from .66 in the unsmoothed case to .26 in the smoothed case
  • The discount dd (the ratio between new and old counts) shows us how strikingly the counts for each prefix word have been reduced
    • the discount for the bigram want to is .39
    • the discount for Chinese food is .10

The sharp change in counts and probabilities occurs because too much probability mass is moved to all the zeros.

Add-k smoothing

Instead of adding 1 to each count, we add a fractional count kk

PAddβˆ’kβˆ—(wn∣wnβˆ’1)=C(wnβˆ’1wn)+kC(wnβˆ’1)+kV P_{\mathrm{Add}-\mathrm{k}}^{*}\left(w_{n} | w_{n-1}\right)=\frac{C\left(w_{n-1} w_{n}\right)+k}{C\left(w_{n-1}\right)+k V}
  • kk: can be chosen by optimizing on a devset (validation set)

Add-k smoothing

  • useful for some tasks (including text classification)
  • still doesn’t work well for language modeling, generating counts with poor variances and often inappropriate discounts πŸ€ͺ

Backoff and interpolation

Sometimes using less context is a good thing, helping to generalize more for contexts that the model hasn’t learned much about.

Backoff

πŸ’‘ β€œBack off” to a lower-order n-gram if we have zero evidence for a higher-order n-gram

  • If the n-gram we need has zero counts, we approximate it by backing off to the (n-1)-gram. We continue backing off until we reach a history that has some counts.
    • we use the trigram if the evidence is sufficient, otherwise we use the bigram, otherwise the unigram.

Katz backoff

  • Rely on a discounted probability Pβˆ—P^* if we’ve seen this n-gram before (i.e., if we have non-zero counts)

    • We have to discount the higher-order n-grams to save some probability mass for the lower order n-grams

      if the higher-order n-grams aren’t discounted and we just used the undiscounted MLE probability, then as soon as we replaced an n-gram which has zero probability with a lower-order n-gram, we would be adding probability mass, and the total probability assigned to all possible strings by the language model would be greater than 1!

  • Otherwise, we recursively back off to the Katz probability for the shorter-history (n-1)-gram.

β‡’\Rightarrow The probability for a backoff n-gram PBOP_{\text{BO}} is image-20200803110631566

  • Pβˆ—P^*: discounted probability
  • Ξ±\alpha: a function to distribute the discounted probability mass to the lower order n-grams

Interpolation

πŸ’‘ Mix the probability estimates from all the n-gram estimators, weighing and combining the trigram, bigram, and unigram counts.

In simple linear interpolation, we combine different order n-grams by linearly interpolating all the models. I.e., we estimate the trigram probability P(wn∣wnβˆ’2wnβˆ’1)P\left(w_{n} | w_{n-2} w_{n-1}\right) by mixing together the unigram, bigram, and trigram probabilities, each weighted by a Ξ»\lambda

P^(wn∣wnβˆ’2wnβˆ’1)=Ξ»1P(wn∣wnβˆ’2wnβˆ’1)+Ξ»2P(wn∣wnβˆ’1)+Ξ»3P(wn) \begin{array}{ll} \hat{P}\left(w_{n} | w_{n-2} w_{n-1}\right) = & \lambda_{1} P\left(w_{n} | w_{n-2} w_{n-1}\right) \\\\ &+ \lambda_{2} P\left(w_{n} | w_{n-1}\right) \\\\ &+ \lambda_{3} P\left(w_{n}\right) \end{array}

s.t.

βˆ‘iΞ»i=1 \sum_{i} \lambda_{i}=1

In a slightly more sophisticated version of linear interpolation, each Ξ»\lambda weight is computed by conditioning on the context.

  • If we have particularly accurate counts for a particular bigram, we assume that the counts of the trigrams based on this bigram will be more trustworthy, so we can make the λλs for those trigrams higher and thus give that trigram more weight in the interpolation.
P^(wn∣wnβˆ’2wnβˆ’1)=Ξ»1(wnβˆ’2nβˆ’1)P(wn∣wnβˆ’2wnβˆ’1)+Ξ»2(wnβˆ’2nβˆ’1)P(wn∣wnβˆ’1)+Ξ»3(wnβˆ’2nβˆ’1)P(wn) \begin{array}{ll} \hat{P}\left(w_{n} | w_{n-2} w_{n-1}\right)=& \lambda_{1}\left(w_{n-2}^{n-1}\right) P\left(w_{n} | w_{n-2} w_{n-1}\right) \\\\ &+\lambda_{2}\left(w_{n-2}^{n-1}\right) P\left(w_{n} | w_{n-1}\right) \\\\ &+\lambda_{3}\left(w_{n-2}^{n-1}\right) P\left(w_{n}\right) \end{array}

How to set Ξ»\lambdas?

Learn from a held-out corpus

  • Held-out corpus: an additional training corpus that we use to set hyperparameters like these λλ values, by choosing the λλ values that maximize the likelihood of the held-out corpus.
  • We fix the n-gram probabilities and then search for the λλ values that give us the highest probability of the held-out set
    • Common method: EM algorithm

Kneser-Ney Smoothing πŸ‘

One of the most commonly used and best performing n-gram smoothing methods πŸ‘

Based on absolute discounting

  • subtracting a fixed (absolute) discount dd from each count.
  • πŸ’‘ Intuition:
    • since we have good estimates already for the very high counts, a small discount d won’t affect them much
    • It will mainly modify the smaller counts, for which we don’t necessarily trust the estimate anyway

ζˆͺ屏2020-06-03 22.51.39

Except for the held-out counts for 0 and 1, all the other bigram counts in the held-out set could be estimated pretty well by just subtracting 0.75 from the count in the training set! In practice this discount is actually a good one for bigrams with counts 2 through 9.

The equation for interpolated absolute discounting applied to bigrams:

PAbsoluteDiscounting (wi∣wiβˆ’1)=C(wiβˆ’1wi)βˆ’dβˆ‘vC(wiβˆ’1v)+Ξ»(wiβˆ’1)P(wi) P_{\text {AbsoluteDiscounting }}\left(w_{i} | w_{i-1}\right)=\frac{C\left(w_{i-1} w_{i}\right)-d}{\sum_{v} C\left(w_{i-1} v\right)}+\lambda\left(w_{i-1}\right) P\left(w_{i}\right)
  • First term: discounted bigram
    • We could just set all the dd values to .75, or we could keep a separate discount value of 0.5 for the bigrams with counts of 1.
  • Second term: unigram with an interpolation weight Ξ»\lambda

Kneser-Ney discounting augments absolute discounting with a more sophisticated way to handle the lower-order unigram distribution.

Sophisticated means: Instead of P(w)P(w) which answers the question β€œHow likely is ww?”, we’d like to create a unigram model that we might call P_CONTINUATIONP\_{\text{CONTINUATION}}, which answers the question β€œHow likely is w to appear as a novel continuation?”

How can we estimate this probability of seeing the word ww as a novel continuation, in a new unseen context?

πŸ’‘ The Kneser-Ney intuition: base our P_CONTINUATIONP\_{\text{CONTINUATION}} on the number of different contexts word ww has appeared in (the number of bigram types it completes).

  • Every bigram type was a novel continuation the first time it was seen.
  • We hypothesize that words that have appeared in more contexts in the past are more likely to appear in some new context as well.

The number of times a word ww appears as a novel continuation can be expressed as:

P_CONTINUATION(w)∝∣{v:C(vw)>0}∣ P\_{\mathrm{CONTINUATION}}(w) \propto|\{v: C(v w)>0\}|

To turn this count into a probability, we normalize by the total number of word bigram types:

P_CONTINUATION(w)=∣{v:C(vw)>0}∣∣{(uβ€²,wβ€²):Cuβ€²wβ€²)>0}∣ P\_{\text{CONTINUATION}}(w)=\frac{|\{v: C(v w)>0\}|}{|\{(u^{\prime}, w^{\prime}): Cu^{\prime} w^{\prime})>0\}|}

An equivalent formulation based on a different metaphor is to use the number of word types seen to precede ww, normalized by the number of words preceding all words,

P_CONTINUATION(w)=∣{v:C(vw)>0}βˆ£βˆ‘wβ€²βˆ£{v:C(vwβ€²)>0}∣ P\_{\mathrm{CONTINUATION}}(w)=\frac{|\{v: C(v w)>0\}|}{\sum_{w^{\prime}}|\{v: C(v w^{\prime})>0\}|}

The final equation for Interpolated Kneser-Ney smoothing for bigrams is:

PKN(w_i∣w_iβˆ’1)=max⁑(C(w_iβˆ’1w_i)βˆ’d,0)C(w_iβˆ’1)+Ξ»(w_iβˆ’1)P_CONTINUATION(w_i) P_{\mathrm{KN}}\left(w\_{i} | w\_{i-1}\right)=\frac{\max \left(C\left(w\_{i-1} w\_{i}\right)-d, 0\right)}{C\left(w\_{i-1}\right)}+\lambda\left(w\_{i-1}\right) P\_{\mathrm{CONTINUATION}}\left(w\_{i}\right)
  • λλ: normalizing constant that is used to distribute the probability mass

    Ξ»(wiβˆ’1)=dβˆ‘vC(wiβˆ’1v)∣{w:C(wiβˆ’1w)>0}∣ \lambda(w_{i-1})=\frac{d}{\sum_{v} C(w_{i-1} v)}|\{w: C(w_{i-1} w)>0\}|
    • First term: normalized discount
    • Second term: the number of word types that can follow wiβˆ’1w_{i-1}. or, equivalently, the number of word types that we discounted (i.e., the number of times we applied the normalized discount.)

The general recursive formulation is

PKN(wi∣wiβˆ’n+1iβˆ’1)=max⁑(cKN(wiβˆ’n+1i)βˆ’d,0)βˆ‘vcKN(wiβˆ’n+1iβˆ’1v)+Ξ»(wiβˆ’n+1iβˆ’1)PKN(wi∣wiβˆ’n+2iβˆ’1) P_{\mathrm{KN}}\left(w_{i} | w_{i-n+1}^{i-1}\right)=\frac{\max \left(c_{K N}\left(w_{i-n+1}^{i}\right)-d, 0\right)}{\sum_{v} c_{K N}\left(w_{i-n+1}^{i-1} v\right)}+\lambda\left(w_{i-n+1}^{i-1}\right) P_{K N}\left(w_{i} | w_{i-n+2}^{i-1}\right)
  • CKNC_{KN}: depends on whether we are counting the highest-order n-gram being interpolated or one of the lower-order n-grams image-20200803111708917

    • continuationcount⁑(β‹…)\operatorname{continuationcount}(\cdot): the number of unique single word contexts for β‹…\cdot

At the termination of the recursion, unigrams are interpolated with the uniform distribution

PKN(w)=max⁑(cKN(w)βˆ’d,0)βˆ‘wβ€²cKN(wβ€²)+Ξ»(Ξ΅)1V P_{\mathrm{KN}}(w)=\frac{\max \left(c_{K N}(w)-d, 0\right)}{\sum_{w^{\prime}} c_{K N}\left(w^{\prime}\right)}+\lambda(\varepsilon) \frac{1}{V}
  • Ξ΅\varepsilon: empty string