NLP

The Cross-Entropy Loss Function

Motivation We need a loss function that expresses, for an observation $x$, how close the classifier output ($\hat{y}=\sigma(w \cdot x+b)$) is to the correct output ($y$, which is $0$ or $1$): $$ L(\hat{y}, y)= \text{How much } \hat{y} \text{ differs from the true } y $$ This loss function should prefer the correct class labels of the training examples to be more likely.

2020-08-03

Sigmoid

Sigmoid to Logistic Regression Consider a single input observation $x = [x_1, x_2, \dots, x_n]$ The classifier output $y$ can be $1$: the observation is a member of the class $0$: the observation is NOT a member of the class We want to know the probability $P(y=1|x)$ that this observation is a member of the class.

2020-08-03

Generative and Discriminative Classifiers

The most important difference between naive Bayes and logistic regression is that logistic regression is a discriminative classifier while naive Bayes is a generative classifier. Consider a visual metaphor: imagine we’re trying to distinguish dog images from cat images.

2020-08-03

Logistic Regression

2020-08-03

Evaluation

Two classes Gold labels the human-defined labels for each document that we are trying to match Confusion Matrix To evaluate any system for detecting things, we start by building a Contingency table (Confusion matrix):

2020-08-03

Optimizing for Sentiment Analysis

While standard naive Bayes text classification can work well for sentiment analysis, some small changes are generally employed that improve performance. 💪 Binary multinomial naive Bayes (binary NB) First, for sentiment classification and a number of other text classification tasks, whether a word occurs or not seems to matter more than its frequency.

2020-08-03

Train Naive Bayes Classifiers

Maximum Likelihood Estimate (MLE) In Naive Bayes calculation we have to learn the probabilities $P(c)$ and $P(w_i|c)$. We use the Maximum Likelihood Estimate (MLE) to estimate them. We’ll simply use the frequencies in the data.

2020-08-03

Naive Bayes Classifiers

Notation Classifier for text classification Input: $d$ (“document”) Output: $c$ (“class”) Training set: $N$ documents that have each been hand-labeled with a class $(d_1, c_1), \dots, (d_N, c_N)$ 🎯 Goal: to learn a classifier that is capable of mapping from a new document $d$ to its correct class $c\in C$

2020-08-03

Sentiment Classification

2020-08-03

Summary (TL;DR)

Language models offer a way to assign a probability to a sentence or other sequence of words, and to predict a word from preceding words. $P(w|h)$: Probability of word $w$ given history $h$ n-gram model estimate words from a fixed window of previous words $$ P\left(w_{n} | w_{1}^{n-1}\right) \approx P\left(w_{n} | w_{n-N+1}^{n-1}\right) $$ n-gram probabilities can be estimated by counting in a corpus and normalizing (MLE) $$ P\left(w_{n} | w_{n-N+1}^{n-1}\right)=\frac{C\left(w_{n-N+1}^{n-1} w_{n}\right)}{C\left(w_{n-N+1}^{n-1}\right)} $$ Evaluation

2020-08-03