Multinomial Logistic Regression

Motivation

More than two classes?

Use multinomial logistic regression (also called softmax regression, or maxent classifier). The target $y$ is a variable that ranges over more than two classes; we want to know the probability of $y$ being in each potential class $c \in C, p(y=c|x)$ .

We use the softmax function to compute $p(y=c|x)$ :

Takes a vector $z=[z_1, z_2,\dots, z_k]$ of $k$ arbitrary values
Maps them to a probability distribution
- Each value $\in (0, 1)$
- All the values summing to $1$

For a vector $z$ of dimensionality $k$ , the softmax is:

\operatorname{softmax}\left(z_{i}\right)=\frac{e^{z_{i}}}{\sum_{j=1}^{k} e^{z_{j}}} \qquad 1 \leq i \leq k

The softmax of an input vector $z=[z_1, z_2,\dots, z_k]$ is thus:

\operatorname{softmax}(z)=\left[\frac{e^{z_{1}}}{\sum_{i=1}^{k} e^{z_{i}}}, \frac{e^{z_{2}}}{\sum_{i=1}^{k} e^{z_{i}}}, \ldots, \frac{e^{z_{k}}}{\sum_{i=1}^{k} e^{z_{i}}}\right]

The denominator $\sum_{j=1}^{k} e^{z_{j}}$ is used to normalize all the values into probabilities.

Like the sigmoid, the input to the softmax will be the dot product between a weight vector $w$ and an input vector $x$ (plus a bias). But now we’ll need separate weight vectors (and bias) for each of the $K$ classes.

p(y=c | x)=\frac{e^{w_{c} \cdot x+b_{c}}}{\displaystyle\sum_{j=1}^{k} e^{w_{j} \cdot x+b_{j}}}

Features in Multinomial Logistic Regression

For multiclass classification, input features are:

observation $x$
candidate output class $c$

$\Rightarrow$ When we are discussing features we will use the notation $f_i(c, x)$ : feature $i$ for a particular class $c$ for a given observation $x$

Example

Suppose we are doing text classification, and instead of binary classification our task is to assign one of the 3 classes +, −, or 0 (neutral) to a document. Now a feature related to exclamation marks might have a negative weight for 0 documents, and a positive weight for + or − documents: