Train Naive Bayes Classifiers

Train Naive Bayes Classifiers

Maximum Likelihood Estimate (MLE)

In Naive Bayes calculation we have to learn the probabilities P(c)P(c) and P(wi∣c)P(w_i|c). We use the Maximum Likelihood Estimate (MLE) to estimate them. We’ll simply use the frequencies in the data.

  • P(c)P(c): document prior

    • β€œwhat percentage of the documents in our training set are in each class cc ?”
    P^(c)=NcNdoc \hat{P}(c)=\frac{N_{c}}{N_{d o c}}
    • NcN_c: the number of documents in our training data with class cc
    • NdocN_{doc}: the total number of documents
  • P(wi∣c)P(w_i|c)

    • β€œThe fraction of times the word wiw_i appears among all words in all documents of topic cc”

      • We first concatenate all documents with category cc into one big β€œcategory cc” text.

      • Then we use the frequency of wiw_i in this concatenated document to give a maximum likelihood estimate of the probability

        P^(wi∣c)=count⁑(wi,c)βˆ‘w∈Vcount⁑(w,c) \hat{P}\left(w_{i} | c\right)=\frac{\operatorname{count}\left(w_{i}, c\right)}{\sum_{w \in V} \operatorname{count}(w, c)}
        • VV: vocabulary that consists of the union of all the word types in all classes, not just the words in one class cc
    • Avoid zero probablities in the likelihood term for any class: Laplace (add-one) smoothing

      P^(wi∣c)=count⁑(wi,c)+1βˆ‘w∈V(count⁑(w,c)+1)=count⁑(wi,c)+1(βˆ‘w∈Vcount⁑(w,c))+∣V∣ \hat{P}\left(w_{i} | c\right)=\frac{\operatorname{count}\left(w_{i}, c\right)+1}{\sum_{w \in V}(\operatorname{count}(w, c)+1)}=\frac{\operatorname{count}\left(w_{i}, c\right)+1}{\left(\sum_{w \in V} \operatorname{count}(w, c)\right)+|V|}

      Why is this a problem?

      Imagine we are trying to estimate the likelihood of the word β€œfantastic” given class positive, but suppose there are no training documents that both contain the word β€œfantastic” and are classified as positive. Perhaps the word β€œfantastic” happens to occur (sarcastically?) in the class negative. In such a case the probability for this feature will be zero:

      >P^( "fantastic" βˆ£ positive )= count ( "fantastic", positive )βˆ‘w∈V count (w, positive )=0> > \hat{P}(\text { "fantastic" } | \text { positive })=\frac{\text { count }(\text { "fantastic}" \text {, positive })}{\sum_{w \in V} \text { count }(w, \text { positive })}=0 >

      But since naive Bayes naively multiplies all the feature likelihoods together, zero probabilities in the likelihood term for any class will cause the probability of the class to be zero, no matter the other evidence!

In addition to P(c)P(c) and P(wi∣c)P(w_i|c), We should also deal with

  • Unknown words
    • Words that occur in our test data but are NOT in our vocabulary at all because they did not occur in any training document in any class
    • πŸ”§ Solution: Ignore them
      • Remove them from the test document and not include any probability for them at all
  • Stop words
    • Very frequent words like the and a
    • Solution:
      • Method 1:
        1. sorting the vocabulary by frequency in the training set
        2. defining the top 10–100 vocabulary entries as stop words
      • Method 2:
        1. use one of the many pre-defined stop word list available online
        2. Then every instance of these stop words are simply removed from both training and test documents as if they had never occurred.
    • In most text classification applications, however, using a stop word list does NOT improve performance πŸ€ͺ, and so it is more common to make use of the entire vocabulary and not use a stop word list.

Example

We’ll use a sentiment analysis domain with the two classes positive (+) and negative (-), and take the following miniature training and test documents simplified from actual movie reviews.

ζˆͺ屏2020-06-14 12.37.43

  • The prior P(c)P(c): P(βˆ’)=35P(+)=25 P(-)=\frac{3}{5} \quad P(+)=\frac{2}{5}

The word with doesn’t occur in the training set, so we drop it completely.

The remaining three words are β€œpredictable”, β€œno”, and β€œfun”. Their likelihoods from the training set are (with Laplace smoothing):

P( "predictable" βˆ£βˆ’)=1+114+20P( "predictable" βˆ£+)=0+19+20P("no"βˆ£βˆ’)=1+114+20P("no"∣+)=0+19+20P("fun"βˆ£βˆ’)=0+114+20P("fun"∣+)=1+19+20 \begin{aligned} P\left(\text { "predictable" }|-\right) &=\frac{1+1}{14+20} \qquad P\left(\text { "predictable" } |+\right)=\frac{0+1}{9+20} \\\\ P\left(\text{"no"}|-\right) &=\frac{1+1}{14+20} \qquad P\left(\text{"no"}|+\right)=\frac{0+1}{9+20} \\\\ P\left(\text{"fun"}|-\right) &=\frac{0+1}{14+20} \qquad P\left(\text{"fun"}|+\right)=\frac{1+1}{9+20} \end{aligned}
>>V=>&just, plain, boring, entirely, predictable, and, lacks, energy, no, suprises, very, few, laughs, >&powerful, the, most, fun, film, of, summer>>> > \begin{aligned} > V=\\{ > \& \text{just, plain, boring, entirely, predictable, and, lacks, energy, no, suprises, very, few, laughs, } \\\\ > \& \text{powerful, the, most, fun, film, of, summer} > \\} > \end{aligned} >

β‡’βˆ£V∣=20\Rightarrow |V|=20

The word β€œpredictable” occurs in negative (-) once, therefore, with Laplace smoothing:

P( "predictable" βˆ£βˆ’)=1+114+20P\left(\text { "predictable" }|-\right) =\frac{1+1}{14+20}

For the test sentence S=S= β€œpredictable with no fun”, after removing the word β€œwith”:

P(βˆ’)P(Sβˆ£βˆ’)=35Γ—2Γ—2Γ—1343=6.1Γ—10βˆ’5P(+)P(S∣+)=25Γ—1Γ—1Γ—2293=3.2Γ—10βˆ’5 \begin{array}{l} P(-) P(S |-)=\frac{3}{5} \times \frac{2 \times 2 \times 1}{34^{3}}=6.1 \times 10^{-5} \\\\ P(+) P(S |+)=\frac{2}{5} \times \frac{1 \times 1 \times 2}{29^{3}}=3.2 \times 10^{-5} \end{array}

P(βˆ’)P(Sβˆ£βˆ’)>P(+)P(S∣+)P(-)P(S|-) > P(+)P(S|+)

β‡’\Rightarrow The model predicts the class negative for the test sentence SS.