Logistic Regression

Logistic Regression in NLP

Logistic Regression (in NLP) In natural language processing, logistic regression is the base-line supervised machine learning algorithm for classification, and also has a very close relationship with neural networks. Generative and Discriminative Classifier The most important difference between naive Bayes and logistic regression is that

2020-08-03

Logistic Regression: Summry

Supervised classification Input: $x = (x_1, x_2, \dots, x_n)^T$ Output: $y \in \{0, 1\}$ Parameters: Weight: $w = (w_1, w_2, \dots, w_n)^T$ Bias $b$ Prediction $$ z = w \cdot x + b \\ P(y=1|x)=\sigma(z) = \frac{1}{1+e^{-z}}\\ y=\left\{\begin{array}{ll} 1 & \text { if } P(y=1 | x)>0.

2020-08-03

Multinomial Logistic Regression

Motivation More than two classes? Use multinomial logistic regression (also called softmax regression, or maxent classifier). The target $y$ is a variable that ranges over more than two classes; we want to know the probability of $y$ being in each potential class $c \in C, p(y=c|x)$.

2020-08-03

Regularization

Overfitting 🔴 Problem with learning weights that make the model perfectly match the training data: If a feature is perfectly predictive of the outcome because it happens to only occur in one class, it will be assigned a very high weight.

2020-08-03

Gradient Descent

Overview 🎯 Goal with gradient descent: find the optimal weights that minimize the loss function we’ve defined for the model. From now on, we’ll explicitly represent the fact that the loss function $L$ is parameterized by the weights $\theta$ (in the case of logistic regression $\theta=(w, b)$): $$ \hat{\theta}=\underset{\theta}{\operatorname{argmin}} \frac{1}{m} \sum\_{i=1}^{m} L_{C E}\left(y^{(i)}, x^{(i)} ; \theta\right) $$ Gradient descent finds a minimum of a function by figuring out in which direction (in the space of the parameters $\theta$) the function’s slope is rising the most steeply, and moving in the opposite direction.

2020-08-03

Learning in Logistic Regression

Logistic regression is an instance of supervised classification in which we know the correct label $y$ (either 0 or 1) for each observation $x$. The system produces/predicts $\hat{y}$, the estimate for the true $y$.

2020-08-03

The Cross-Entropy Loss Function

Motivation We need a loss function that expresses, for an observation $x$, how close the classifier output ($\hat{y}=\sigma(w \cdot x+b)$) is to the correct output ($y$, which is $0$ or $1$): $$ L(\hat{y}, y)= \text{How much } \hat{y} \text{ differs from the true } y $$ This loss function should prefer the correct class labels of the training examples to be more likely.

2020-08-03

Sigmoid

Sigmoid to Logistic Regression Consider a single input observation $x = [x_1, x_2, \dots, x_n]$ The classifier output $y$ can be $1$: the observation is a member of the class $0$: the observation is NOT a member of the class We want to know the probability $P(y=1|x)$ that this observation is a member of the class.

2020-08-03

Generative and Discriminative Classifiers

The most important difference between naive Bayes and logistic regression is that logistic regression is a discriminative classifier while naive Bayes is a generative classifier. Consider a visual metaphor: imagine we’re trying to distinguish dog images from cat images.

2020-08-03

Logistic Regression: Probabilistic view

Class label: $$ y_i \in \\{0, 1\\} $$ Conditional probability distribution of the class label is $$ \begin{aligned} p(y=1|\boldsymbol{x}) &= \sigma(\boldsymbol{w}^T\boldsymbol{x}+b) \\\\ p(y=0|\boldsymbol{x}) &= 1 - \sigma(\boldsymbol{w}^T\boldsymbol{x}+b) \end{aligned} $$ with

2020-07-13