Use multinomial logistic regression (also called softmax regression, or maxent classifier). The target y is a variable that ranges over more than two classes; we want to know the probability of y being in each potential class cβC,p(y=cβ£x).
We use the softmax function to compute p(y=cβ£x):
Takes a vector z=[z1β,z2β,β¦,zkβ] of k arbitrary values
Maps them to a probability distribution
Each value β(0,1)
All the values summing to 1
For a vector z of dimensionality k, the softmax is:
softmax(ziβ)=βj=1kβezjβeziββ1β€iβ€k
The softmax of an input vector z=[z1β,z2β,β¦,zkβ] is thus:
The denominator βj=1kβezjβ is used to normalize all the values into probabilities.
Like the sigmoid, the input to the softmax will be the dot product between a weight vector w and an input vector x (plus a bias). But now weβll need separate weight vectors (and bias) for each of the K classes.
For multiclass classification, input features are:
observation x
candidate output class c
β When we are discussing features we will use the notation fiβ(c,x): feature i for a particular class c for a given observation x
Example
Suppose we are doing text classification, and instead of binary classification our task is to assign one of the 3 classes +, β, or 0 (neutral) to a document. Now a feature related to exclamation marks might have a negative weight for 0 documents, and a positive weight for + or β documents:
Learning in Multinomial Logistic Regression
The loss function for a single example x is the sum of the logs of the K output classes: