Multinomial Logistic Regression

Multinomial Logistic Regression

Motivation

More than two classes?

Use multinomial logistic regression (also called softmax regression, or maxent classifier). The target yy is a variable that ranges over more than two classes; we want to know the probability of yy being in each potential class c∈C,p(y=c∣x)c \in C, p(y=c|x).

We use the softmax function to compute p(y=c∣x)p(y=c|x):

  • Takes a vector z=[z1,z2,…,zk]z=[z_1, z_2,\dots, z_k] of kk arbitrary values
  • Maps them to a probability distribution
    • Each value ∈(0,1)\in (0, 1)
    • All the values summing to 11

For a vector zz of dimensionality kk, the softmax is:

softmax⁑(zi)=eziβˆ‘j=1kezj1≀i≀k \operatorname{softmax}\left(z_{i}\right)=\frac{e^{z_{i}}}{\sum_{j=1}^{k} e^{z_{j}}} \qquad 1 \leq i \leq k

The softmax of an input vector z=[z1,z2,…,zk]z=[z_1, z_2,\dots, z_k] is thus:

softmax⁑(z)=[ez1βˆ‘i=1kezi,ez2βˆ‘i=1kezi,…,ezkβˆ‘i=1kezi] \operatorname{softmax}(z)=\left[\frac{e^{z_{1}}}{\sum_{i=1}^{k} e^{z_{i}}}, \frac{e^{z_{2}}}{\sum_{i=1}^{k} e^{z_{i}}}, \ldots, \frac{e^{z_{k}}}{\sum_{i=1}^{k} e^{z_{i}}}\right]
  • The denominator βˆ‘j=1kezj\sum_{j=1}^{k} e^{z_{j}} is used to normalize all the values into probabilities.

Like the sigmoid, the input to the softmax will be the dot product between a weight vector ww and an input vector xx (plus a bias). But now we’ll need separate weight vectors (and bias) for each of the KK classes.

p(y=c∣x)=ewcβ‹…x+bcβˆ‘j=1kewjβ‹…x+bj p(y=c | x)=\frac{e^{w_{c} \cdot x+b_{c}}}{\displaystyle\sum_{j=1}^{k} e^{w_{j} \cdot x+b_{j}}}

Features in Multinomial Logistic Regression

For multiclass classification, input features are:

  • observation xx
  • candidate output class cc

β‡’\Rightarrow When we are discussing features we will use the notation fi(c,x)f_i(c, x): feature ii for a particular class cc for a given observation xx

Example

Suppose we are doing text classification, and instead of binary classification our task is to assign one of the 3 classes +, βˆ’, or 0 (neutral) to a document. Now a feature related to exclamation marks might have a negative weight for 0 documents, and a positive weight for + or βˆ’ documents:

ζˆͺ屏2020-05-29 15.59.37

Learning in Multinomial Logistic Regression

The loss function for a single example xx is the sum of the logs of the KK output classes:

LCE(y^,y)=βˆ’βˆ‘k=1K1y=klog⁑p(y=k∣x)=βˆ’βˆ‘k=1K1y=klog⁑ewkβ‹…x+bkβˆ‘j=1Kewjβ‹…x+bj \begin{aligned} L_{C E}(\hat{y}, y) &=-\sum_{k=1}^{K} \mathbb{1}\\{y=k\\} \log p(y=k | x) \\\\ &=-\sum_{k=1}^{K} \mathbb{1}\\{y=k\\} \log \frac{e^{w_{k} \cdot x+b_{k}}}{\sum_{j=1}^{K} e^{w_{j} \cdot x+b_{j}}} \end{aligned}
  • 1{}1\{\}: evaluates to 11 if the condition in the brackets is true and to 00 otherwise.

Gradient:

βˆ‚LCEβˆ‚wk=βˆ’(1y=kβˆ’p(y=k∣x))xk=βˆ’(1y=kβˆ’ewkβ‹…x+bkβˆ‘j=1Kewjβ‹…x+bj)xk \begin{aligned} \frac{\partial L_{C E}}{\partial w_{k}} &=-(\mathbb{1}\\{y=k\\}-p(y=k | x)) x_{k} \\\\ &=-\left(\mathbb{1}\\{y=k\\}-\frac{e^{w_{k} \cdot x+b_{k}}}{\sum_{j=1}^{K} e^{w_{j} \cdot x+b_{j}}}\right) x_{k} \end{aligned}