Naive Bayes Classifiers

Notation

Classifier for text classification

Input: $d$ (“document”)
Output: $c$ (“class”)
Training set: $N$ documents that have each been hand-labeled with a class $(d_1, c_1), \dots, (d_N, c_N)$
🎯 Goal: to learn a classifier that is capable of mapping from a new document $d$ to its correct class $c\in C$
Type:
- Generative:
  - Build a model of how a class could generate some input data
  - Given an observation, return the class most likely to have generated the observation.
  - E.g.: Naive Bayes
- Discriminative
  - Learn what features from the input are most useful to discriminate between the different possible classes
  - Often more accurate and hence more commonly used
  - E.g.: Logistic Regression

Bag-of-words Representation

We represent a text document as if were a bag-of-words

unordered set of words
position ignored
keeping only their frequency in the document

In the example in the figure, instead of representing the word order in all the phrases like “I love this movie” and “I would recommend it”, we simply note that the word I occurred 5 times in the entire excerpt, the word it 6 times, the words love, recommend, and movie once, and so on.

Naive Bayes

Probablistic classifier
- for a document $d$ , out of all classes $c \in C$ , the classifier returns the class $\hat{c}$ which has the maximum a-posterior probability (MAP) given the document

We represent a document $d$ as a set of features $f_1, f_2, \dots, f_n$ , then

\hat{c}=\underset{c \in C}{\operatorname{argmax}} \quad\overbrace{P\left(f_{1}, f_{2}, \ldots, f_{n} | c\right)}^{\text { likelihood }} \cdot \overbrace{P(c)}^{\text { prior }}

In order to make it possible to compute directly, Naive Bayes classifiers make two simplifying assumptions

Bag-of-words assumption
- we assume position does NOT matter, and that the word “love” has the SAME effect on classification whether it occurs as the 1st, 20th, or last word in the document.
- Thus we assume that the features $f_1, f_2, \dots, f_n$ only encode word identity and not position.
Naive Bayes assumption
- The conditional independence assumption that the probabilities $P(f_i|c)$ are independent given the class $c$ and hence can be ‘naively’ multiplied $P\left(f_{1}, f_{2}, \ldots ., f_{n} | c\right)=P\left(f_{1} | c\right) \cdot P\left(f_{2} | c\right) \cdot \ldots \cdot P\left(f_{n} | c\right)$

Based on these two assumptions, the final equation for the class chosen by a naive Bayes classifier is thus

c_{N B}=\underset{c \in C}{\operatorname{argmax}} P(c) \prod_{f \in F} P(f | c)

To apply the naive Bayes classifier to text, we need to consider word positions, by simply walking an index through every word position in the document:

\text{positions} \leftarrow \text{all word positions in test document}

c_{NB}=\underset{c \in C}{\operatorname{argmax}} P(c) \prod_{i \in \text {positions}} P\left(w_{i} | c\right)

To avoid underflow and increase speed, Naive Bayes calculations, like calculations for language modeling, are done in log space:

c_{NB}=\underset{c \in C}{\operatorname{argmax}} \log P(c)+\sum_{i \in \text {positions }} \log P\left(w_{i} | c\right) \label{eq:NB log}

Last updated on 2024-09-05

Train Naive Bayes Classifiers 2020-08-03 →