Softmax and Its Derivative Softmax
We use softmax activation function to predict the probability assigned to n n n classes. For example, the probability of assigning input sample to j j j -th class is:
p _ j = softmax β‘ ( z _ j ) = e z _ j β _ k = 1 n e z _ k
p\_j = \operatorname{softmax}(z\_j) = \frac{e^{z\_j}}{\sum\_{k=1}^n e^{z\_k}}
p _ j = softmax ( z _ j ) = β _ k = 1 n e z _ k e z _ j β Furthermore, we use One-Hot encoding to represent the groundtruth y y y , which means
β _ k = 1 n y _ k = 1
\sum\_{k=1}^n y\_k = 1
β _ k = 1 n y _ k = 1 Loss function (Cross-Entropy):
L = β β _ k = 1 n y _ k log β‘ ( p _ k ) = β ( y _ j log β‘ ( p _ j ) + β _ k β j y _ k log β‘ ( p _ k ) )
\begin{aligned}
L &= -\sum\_{k=1}^n y\_k \log(p\_k) \\\\
&= - \left(y\_j \log(p\_j) + \sum\_{k \neq j}y\_k \log(p\_k)\right)
\end{aligned}
L β = β β _ k = 1 n y _ k log ( p _ k ) = β ( y _ j log ( p _ j ) + β _ k ξ = j y _ k log ( p _ k ) ) β Gradient w.r.t z _ j z\_j z _ j :
β β z _ j L = β β z _ j ( β β _ k = 1 n y _ k log β‘ ( p _ k ) ) = β β β z _ j ( y _ j log β‘ ( p _ j ) + β _ k β j y _ k log β‘ ( p _ k ) ) = β ( β β z _ j y _ j log β‘ ( p _ j ) + β β z _ j β _ k β j y _ k log β‘ ( p _ k ) )
\begin{aligned}
\frac{\partial}{\partial z\_j}L
&= \frac{\partial}{\partial z\_j} \left(-\sum\_{k=1}^n y\_k \log(p\_k)\right) \\\\
&= -\frac{\partial}{\partial z\_j} \left(y\_j \log(p\_j) + \sum\_{k \neq j}y\_k \log(p\_k)\right) \\\\
&= -\left(\frac{\partial}{\partial z\_j} y\_j \log(p\_j) + \frac{\partial}{\partial z\_j}\sum\_{k \neq j}y\_k \log(p\_k)\right)
\end{aligned}
β z _ j β β L β = β z _ j β β ( β β _ k = 1 n y _ k log ( p _ k ) ) = β β z _ j β β ( y _ j log ( p _ j ) + β _ k ξ = j y _ k log ( p _ k ) ) = β ( β z _ j β β y _ j log ( p _ j ) + β z _ j β β β _ k ξ = j y _ k log ( p _ k ) ) β Therefore,
β β z _ j L = β ( β β z _ j y _ j log β‘ ( p _ j ) + β β z _ j β _ k β j y _ k log β‘ ( p _ k ) ) = β ( y _ j ( 1 β p _ j ) β β _ k β j y _ k p _ j ) = β ( y _ j β y _ j p _ j β β _ k β j y _ k p _ j ) = β ( y _ j β ( y _ j p _ j + β _ k β j y _ k p _ j ) ) = β ( y _ j β β _ k = 1 n y _ k p _ j ) = β _ k = 1 n y _ k = 1 β ( y _ j β p _ j ) = p _ j β y _ j
\begin{aligned}
\frac{\partial}{\partial z\_j}L
&= -\left(\frac{\partial}{\partial z\_j} y\_j \log(p\_j) + \frac{\partial}{\partial z\_j}\sum\_{k \neq j}y\_k \log(p\_k)\right) \\\\
&= -\left(y\_j(1-p\_j) - \sum\_{k \neq j}y\_kp\_j\right) \\\\
&= -\left(y\_j-y\_jp\_j - \sum\_{k \neq j}y\_kp\_j\right) \\\\
&= -\left(y\_j- (y\_jp\_j + \sum\_{k \neq j}y\_kp\_j)\right) \\\\
&= -\left(y\_j- \sum\_{k=1}^ny\_kp\_j\right)\\\\
&\overset{\sum\_{k=1}^{n} y\_k = 1}{=} -\left(y\_j- p\_j\right) \\\\
&= p\_j - y\_j
\end{aligned}
β z _ j β β L β = β ( β z _ j β β y _ j log ( p _ j ) + β z _ j β β β _ k ξ = j y _ k log ( p _ k ) ) = β ( y _ j ( 1 β p _ j ) β β _ k ξ = j y _ k p _ j ) = β ( y _ j β y _ j p _ j β β _ k ξ = j y _ k p _ j ) = β ( y _ j β ( y _ j p _ j + β _ k ξ = j y _ k p _ j ) ) = β ( y _ j β β _ k = 1 n y _ k p _ j ) = β _ k = 1 n y _ k = 1 β ( y _ j β p _ j ) = p _ j β y _ j β Useful resources