Naive Bayes Classifiers

Naive Bayes Classifiers

Notation

Classifier for text classification

  • Input: dd (β€œdocument”)

  • Output: cc (β€œclass”)

  • Training set: NN documents that have each been hand-labeled with a class (d1,c1),…,(dN,cN)(d_1, c_1), \dots, (d_N, c_N)

  • 🎯 Goal: to learn a classifier that is capable of mapping from a new document dd to its correct class c∈Cc\in C

  • Type:

    • Generative:
      • Build a model of how a class could generate some input data
      • Given an observation, return the class most likely to have generated the observation.
      • E.g.: Naive Bayes
    • Discriminative
      • Learn what features from the input are most useful to discriminate between the different possible classes
      • Often more accurate and hence more commonly used
      • E.g.: Logistic Regression

Bag-of-words Representation

We represent a text document as if were a bag-of-words

  • unordered set of words
  • position ignored
  • keeping only their frequency in the document
ζˆͺ屏2020-06-14 11.55.44

In the example in the figure, instead of representing the word order in all the phrases like β€œI love this movie” and β€œI would recommend it”, we simply note that the word I occurred 5 times in the entire excerpt, the word it 6 times, the words love, recommend, and movie once, and so on.

Naive Bayes

  • Probablistic classifier

    • for a document dd, out of all classes c∈Cc \in C, the classifier returns the class c^\hat{c} which has the maximum a-posterior probability (MAP) given the document

      image-20200803115849747

We represent a document dd as a set of features f1,f2,…,fnf_1, f_2, \dots, f_n, then

c^=argmax⁑c∈CP(f1,f2,…,fn∣c)⏞ likelihood β‹…P(c)⏞ prior  \hat{c}=\underset{c \in C}{\operatorname{argmax}} \quad\overbrace{P\left(f_{1}, f_{2}, \ldots, f_{n} | c\right)}^{\text { likelihood }} \cdot \overbrace{P(c)}^{\text { prior }}

In order to make it possible to compute directly, Naive Bayes classifiers make two simplifying assumptions

  • Bag-of-words assumption

    • we assume position does NOT matter, and that the word β€œlove” has the SAME effect on classification whether it occurs as the 1st, 20th, or last word in the document.
    • Thus we assume that the features f1,f2,…,fnf_1, f_2, \dots, f_n only encode word identity and not position.
  • Naive Bayes assumption

    • The conditional independence assumption that the probabilities P(fi∣c)P(f_i|c) are independent given the class cc and hence can be β€˜naively’ multiplied P(f1,f2,….,fn∣c)=P(f1∣c)β‹…P(f2∣c)⋅…⋅P(fn∣c) P\left(f_{1}, f_{2}, \ldots ., f_{n} | c\right)=P\left(f_{1} | c\right) \cdot P\left(f_{2} | c\right) \cdot \ldots \cdot P\left(f_{n} | c\right)

Based on these two assumptions, the final equation for the class chosen by a naive Bayes classifier is thus

cNB=argmax⁑c∈CP(c)∏f∈FP(f∣c) c_{N B}=\underset{c \in C}{\operatorname{argmax}} P(c) \prod_{f \in F} P(f | c)

To apply the naive Bayes classifier to text, we need to consider word positions, by simply walking an index through every word position in the document:

positions←all word positions in test document \text{positions} \leftarrow \text{all word positions in test document} cNB=argmax⁑c∈CP(c)∏i∈positionsP(wi∣c) c_{NB}=\underset{c \in C}{\operatorname{argmax}} P(c) \prod_{i \in \text {positions}} P\left(w_{i} | c\right)

To avoid underflow and increase speed, Naive Bayes calculations, like calculations for language modeling, are done in log space:

cNB=argmax⁑c∈Clog⁑P(c)+βˆ‘i∈positions log⁑P(wi∣c)\labeleq:NBlog c_{NB}=\underset{c \in C}{\operatorname{argmax}} \log P(c)+\sum_{i \in \text {positions }} \log P\left(w_{i} | c\right) \label{eq:NB log}