Multinomial Logistic Regression

Multinomial Logistic Regression

Motivation

More than two classes?

Use multinomial logistic regression (also called softmax regression, or maxent classifier). The target $y$ is a variable that ranges over more than two classes; we want to know the probability of $y$ being in each potential class $c \in C, p(y=c|x)$.

We use the softmax function to compute $p(y=c|x)$:

  • Takes a vector $z=[z_1, z_2,\dots, z_k]$ of $k$ arbitrary values
  • Maps them to a probability distribution
    • Each value $\in (0, 1)$
    • All the values summing to $1$

For a vector $z$ of dimensionality $k$, the softmax is:

$$ \operatorname{softmax}\left(z_{i}\right)=\frac{e^{z_{i}}}{\sum_{j=1}^{k} e^{z_{j}}} \qquad 1 \leq i \leq k $$

The softmax of an input vector $z=[z_1, z_2,\dots, z_k]$ is thus:

$$ \operatorname{softmax}(z)=\left[\frac{e^{z_{1}}}{\sum_{i=1}^{k} e^{z_{i}}}, \frac{e^{z_{2}}}{\sum_{i=1}^{k} e^{z_{i}}}, \ldots, \frac{e^{z_{k}}}{\sum_{i=1}^{k} e^{z_{i}}}\right] $$
  • The denominator $\sum_{j=1}^{k} e^{z_{j}}$ is used to normalize all the values into probabilities.

Like the sigmoid, the input to the softmax will be the dot product between a weight vector $w$ and an input vector $x$ (plus a bias). But now we’ll need separate weight vectors (and bias) for each of the $K$ classes.

$$ p(y=c | x)=\frac{e^{w_{c} \cdot x+b_{c}}}{\displaystyle\sum_{j=1}^{k} e^{w_{j} \cdot x+b_{j}}} $$

Features in Multinomial Logistic Regression

For multiclass classification, input features are:

  • observation $x$
  • candidate output class $c$

$\Rightarrow$ When we are discussing features we will use the notation $f_i(c, x)$: feature $i$ for a particular class $c$ for a given observation $x$

Example

Suppose we are doing text classification, and instead of binary classification our task is to assign one of the 3 classes +, −, or 0 (neutral) to a document. Now a feature related to exclamation marks might have a negative weight for 0 documents, and a positive weight for + or − documents:

截屏2020-05-29 15.59.37

Learning in Multinomial Logistic Regression

The loss function for a single example $x$ is the sum of the logs of the $K$ output classes:

$$ \begin{aligned} L_{C E}(\hat{y}, y) &=-\sum_{k=1}^{K} \mathbb{1}\\{y=k\\} \log p(y=k | x) \\\\ &=-\sum_{k=1}^{K} \mathbb{1}\\{y=k\\} \log \frac{e^{w_{k} \cdot x+b_{k}}}{\sum_{j=1}^{K} e^{w_{j} \cdot x+b_{j}}} \end{aligned} $$
  • $1\{\}$: evaluates to $1$ if the condition in the brackets is true and to $0$ otherwise.

Gradient:

$$ \begin{aligned} \frac{\partial L_{C E}}{\partial w_{k}} &=-(\mathbb{1}\\{y=k\\}-p(y=k | x)) x_{k} \\\\ &=-\left(\mathbb{1}\\{y=k\\}-\frac{e^{w_{k} \cdot x+b_{k}}}{\sum_{j=1}^{K} e^{w_{j} \cdot x+b_{j}}}\right) x_{k} \end{aligned} $$