Softmax and Its Derivative

Softmax
Softmax

We use softmax activation function to predict the probability assigned to $n$ classes. For example, the probability of assigning input sample to $j$-th class is: $$ p_j = \operatorname{softmax}(z_j) = \frac{e^{z_j}}{\sum_{k=1}^n e^{z_k}} $$ Furthermore, we use One-Hot encoding to represent the groundtruth $y$, which means $$ \sum_{k=1}^n y_k = 1 $$ Loss function (Cross-Entropy): $$ \begin{aligned} L &= -\sum_{k=1}^n y_k \log(p_k) \\ &= - \left(y_j \log(p_j) + \sum_{k \neq j}y_k \log(p_k)\right) \end{aligned} $$ Gradient w.r.t $z_j$:

$$ \begin{aligned} \frac{\partial}{\partial z_j}L &= \frac{\partial}{\partial z_j} \left(-\sum_{k=1}^n y_k \log(p_k)\right) \\ &= -\frac{\partial}{\partial z_j} \left(y_j \log(p_j) + \sum_{k \neq j}y_k \log(p_k)\right) \\ &= -\left(\frac{\partial}{\partial z_j} y_j \log(p_j) + \frac{\partial}{\partial z_j}\sum_{k \neq j}y_k \log(p_k)\right) \end{aligned} $$

  • $k=j$ $$ \begin{aligned} \frac{\partial}{\partial z_j} y_j \log(p_j) &= \frac{y_j}{p_j} \cdot \left(\frac{\partial}{\partial z_j} p_j\right) \\ &= \frac{y_j}{p_j} \cdot \left(\frac{\partial}{\partial z_j} \frac{e^{z_j}}{\sum_{k=1}^n e^{z_k}}\right) \\ &= \frac{y_j}{p_j} \cdot \frac{(\frac{\partial}{\partial z_j}e^{z_j})\sum_{k=1}^n e^{z_k} - e^{z_j}(\frac{\partial}{\partial z_j}\sum_{k=1}^n e^{z_k}) }{(\sum_{k=1}^n e^{z_k})^2} \\ &= \frac{y_j}{p_j} \cdot \frac{e^{z_j}\sum_{k=1}^n e^{z_k} - e^{z_j}e^{z_j}}{(\sum_{k=1}^n e^{z_k})^2} \\ &= \frac{y_j}{p_j} \cdot \frac{e^{z_j}(\sum_{k=1}^n e^{z_k} - e^{z_j})}{(\sum_{k=1}^n e^{z_k})^2} \\ &= \frac{y_j}{p_j} \cdot \underbrace{\frac{e^{z_j}}{\sum_{k=1}^n e^{z_k}}}_{=p_j} \cdot \frac{\sum_{k=1}^n e^{z_k} - e^{z_j}}{\sum_{k=1}^n e^{z_k}}\\ &= \frac{y_j}{p_j} \cdot p_j \cdot \underbrace{\left( \frac{\sum_{k=1}^n e^{z_k} }{\sum_{k=1}^n e^{z_k}} - \frac{e^{z_j} }{\sum_{k=1}^n e^{z_k}}\right)}_{=1 - p_j} \\ &= y_j(1-p_j) \end{aligned} $$

  • $\forall k \neq j$ $$ \begin{aligned} \frac{\partial}{\partial z_j}\sum_{k \neq j}y_k \log(p_k) &= \sum_{k \neq j} \frac{\partial}{\partial z_j}y_k \log(p_k) \\ &= \sum_{k \neq j} \frac{y_k}{p_k} \cdot \frac{(\overbrace{\frac{\partial}{\partial z_j} e^{z_k}}^{=0})(\sum_i e^{z_i}) - e^{z_k}(\overbrace{\frac{\partial}{\partial z_j}\sum_i e^{z_i}}^{=e^{z_j}})}{(\sum_{k=1}^n e^{z_k})^2} \\ &= \sum_{k \neq j} \frac{y_k}{p_k} \frac{-e^{z_k} e^{z_j}}{(\sum_{k=1}^n e^{z_k})^2} \\ &= -\sum_{k \neq j} \frac{y_k}{p_k} \underbrace{\frac{e^{z_k}}{\sum_{k=1}^n e^{z_k}}}_{=p_k} \underbrace{\frac{e^{z_j}}{\sum_{k=1}^n e^{z_k}}}_{=p_j} \\ &= -\sum_{k \neq j}y_kp_j \end{aligned} $$

Therefore, $$ \begin{aligned} \frac{\partial}{\partial z_j}L &= -\left(\frac{\partial}{\partial z_j} y_j \log(p_j) + \frac{\partial}{\partial z_j}\sum_{k \neq j}y_k \log(p_k)\right) \\ &= -\left(y_j(1-p_j) - \sum_{k \neq j}y_kp_j\right) \\ &= -\left(y_j-y_jp_j - \sum_{k \neq j}y_kp_j\right) \\ &= -\left(y_j- (y_jp_j + \sum_{k \neq j}y_kp_j)\right) \\ &= -\left(y_j- \sum_{k=1}^ny_kp_j\right)\\ &\overset{\sum_{k=1}^{n} y_k = 1}{=} -\left(y_j- p_j\right) \\ &= p_j - y_j \end{aligned} $$

Useful resources

Previous
Next