Math: Softmax

Softmax and Its Derivative

Softmax

Softmax

We use softmax activation function to predict the probability assigned to nn classes. For example, the probability of assigning input sample to jj-th class is:

p_j=softmax⁑(z_j)=ez_jβˆ‘_k=1nez_k p\_j = \operatorname{softmax}(z\_j) = \frac{e^{z\_j}}{\sum\_{k=1}^n e^{z\_k}}

Furthermore, we use One-Hot encoding to represent the groundtruth yy, which means

βˆ‘_k=1ny_k=1 \sum\_{k=1}^n y\_k = 1

Loss function (Cross-Entropy):

L=βˆ’βˆ‘_k=1ny_klog⁑(p_k)=βˆ’(y_jlog⁑(p_j)+βˆ‘_kβ‰ jy_klog⁑(p_k)) \begin{aligned} L &= -\sum\_{k=1}^n y\_k \log(p\_k) \\\\ &= - \left(y\_j \log(p\_j) + \sum\_{k \neq j}y\_k \log(p\_k)\right) \end{aligned}

Gradient w.r.t z_jz\_j:

βˆ‚βˆ‚z_jL=βˆ‚βˆ‚z_j(βˆ’βˆ‘_k=1ny_klog⁑(p_k))=βˆ’βˆ‚βˆ‚z_j(y_jlog⁑(p_j)+βˆ‘_kβ‰ jy_klog⁑(p_k))=βˆ’(βˆ‚βˆ‚z_jy_jlog⁑(p_j)+βˆ‚βˆ‚z_jβˆ‘_kβ‰ jy_klog⁑(p_k)) \begin{aligned} \frac{\partial}{\partial z\_j}L &= \frac{\partial}{\partial z\_j} \left(-\sum\_{k=1}^n y\_k \log(p\_k)\right) \\\\ &= -\frac{\partial}{\partial z\_j} \left(y\_j \log(p\_j) + \sum\_{k \neq j}y\_k \log(p\_k)\right) \\\\ &= -\left(\frac{\partial}{\partial z\_j} y\_j \log(p\_j) + \frac{\partial}{\partial z\_j}\sum\_{k \neq j}y\_k \log(p\_k)\right) \end{aligned}
  • k=jk=j

    βˆ‚βˆ‚z_jy_jlog⁑(p_j)=y_jp_jβ‹…(βˆ‚βˆ‚z_jp_j)=y_jp_jβ‹…(βˆ‚βˆ‚z_jez_jβˆ‘_k=1nez_k)=y_jp_jβ‹…(βˆ‚βˆ‚z_jez_j)βˆ‘_k=1nez_kβˆ’ez_j(βˆ‚βˆ‚z_jβˆ‘_k=1nez_k)(βˆ‘_k=1nez_k)2=y_jp_jβ‹…ez_jβˆ‘_k=1nez_kβˆ’ez_jez_j(βˆ‘_k=1nez_k)2=y_jp_jβ‹…ez_j(βˆ‘_k=1nez_kβˆ’ez_j)(βˆ‘_k=1nez_k)2=y_jp_jβ‹…ez_jβˆ‘_k=1nez_k⏟_=p_jβ‹…βˆ‘_k=1nez_kβˆ’ez_jβˆ‘_k=1nez_k=y_jp_jβ‹…p_jβ‹…(βˆ‘_k=1nez_kβˆ‘_k=1nez_kβˆ’ez_jβˆ‘_k=1nez_k)⏟_=1βˆ’p_j=y_j(1βˆ’p_j) \begin{aligned} \frac{\partial}{\partial z\_j} y\_j \log(p\_j) &= \frac{y\_j}{p\_j} \cdot \left(\frac{\partial}{\partial z\_j} p\_j\right) \\\\ &= \frac{y\_j}{p\_j} \cdot \left(\frac{\partial}{\partial z\_j} \frac{e^{z\_j}}{\sum\_{k=1}^n e^{z\_k}}\right) \\\\ &= \frac{y\_j}{p\_j} \cdot \frac{(\frac{\partial}{\partial z\_j}e^{z\_j})\sum\_{k=1}^n e^{z\_k} - e^{z\_j}(\frac{\partial}{\partial z\_j}\sum\_{k=1}^n e^{z\_k}) }{(\sum\_{k=1}^n e^{z\_k})^2} \\\\ &= \frac{y\_j}{p\_j} \cdot \frac{e^{z\_j}\sum\_{k=1}^n e^{z\_k} - e^{z\_j}e^{z\_j}}{(\sum\_{k=1}^n e^{z\_k})^2} \\\\ &= \frac{y\_j}{p\_j} \cdot \frac{e^{z\_j}(\sum\_{k=1}^n e^{z\_k} - e^{z\_j})}{(\sum\_{k=1}^n e^{z\_k})^2} \\\\ &= \frac{y\_j}{p\_j} \cdot \underbrace{\frac{e^{z\_j}}{\sum\_{k=1}^n e^{z\_k}}}\_{=p\_j} \cdot \frac{\sum\_{k=1}^n e^{z\_k} - e^{z\_j}}{\sum\_{k=1}^n e^{z\_k}}\\\\ &= \frac{y\_j}{p\_j} \cdot p\_j \cdot \underbrace{\left( \frac{\sum\_{k=1}^n e^{z\_k} }{\sum\_{k=1}^n e^{z\_k}} - \frac{e^{z\_j} }{\sum\_{k=1}^n e^{z\_k}}\right)}\_{=1 - p\_j} \\\\ &= y\_j(1-p\_j) \end{aligned}
  • βˆ€kβ‰ j\forall k \neq j

    βˆ‚βˆ‚z_jβˆ‘_kβ‰ jy_klog⁑(p_k)=βˆ‘_kβ‰ jβˆ‚βˆ‚z_jy_klog⁑(p_k)=βˆ‘_kβ‰ jy_kp_kβ‹…(βˆ‚βˆ‚z_jez_k⏞=0)(βˆ‘_iez_i)βˆ’ez_k(βˆ‚βˆ‚z_jβˆ‘_iez_i⏞=ez_j)(βˆ‘_k=1nez_k)2=βˆ‘_kβ‰ jy_kp_kβˆ’ez_kez_j(βˆ‘_k=1nez_k)2=βˆ’βˆ‘_kβ‰ jy_kp_kez_kβˆ‘_k=1nez_k⏟_=p_kez_jβˆ‘_k=1nez_k⏟_=p_j=βˆ’βˆ‘_kβ‰ jy_kp_j \begin{aligned} \frac{\partial}{\partial z\_j}\sum\_{k \neq j}y\_k \log(p\_k) &= \sum\_{k \neq j} \frac{\partial}{\partial z\_j}y\_k \log(p\_k) \\\\ &= \sum\_{k \neq j} \frac{y\_k}{p\_k} \cdot \frac{(\overbrace{\frac{\partial}{\partial z\_j} e^{z\_k}}^{=0})(\sum\_i e^{z\_i}) - e^{z\_k}(\overbrace{\frac{\partial}{\partial z\_j}\sum\_i e^{z\_i}}^{=e^{z\_j}})}{(\sum\_{k=1}^n e^{z\_k})^2} \\\\ &= \sum\_{k \neq j} \frac{y\_k}{p\_k} \frac{-e^{z\_k} e^{z\_j}}{(\sum\_{k=1}^n e^{z\_k})^2} \\\\ &= -\sum\_{k \neq j} \frac{y\_k}{p\_k} \underbrace{\frac{e^{z\_k}}{\sum\_{k=1}^n e^{z\_k}}}\_{=p\_k} \underbrace{\frac{e^{z\_j}}{\sum\_{k=1}^n e^{z\_k}}}\_{=p\_j} \\\\ &= -\sum\_{k \neq j}y\_kp\_j \end{aligned}

Therefore,

βˆ‚βˆ‚z_jL=βˆ’(βˆ‚βˆ‚z_jy_jlog⁑(p_j)+βˆ‚βˆ‚z_jβˆ‘_kβ‰ jy_klog⁑(p_k))=βˆ’(y_j(1βˆ’p_j)βˆ’βˆ‘_kβ‰ jy_kp_j)=βˆ’(y_jβˆ’y_jp_jβˆ’βˆ‘_kβ‰ jy_kp_j)=βˆ’(y_jβˆ’(y_jp_j+βˆ‘_kβ‰ jy_kp_j))=βˆ’(y_jβˆ’βˆ‘_k=1ny_kp_j)=βˆ‘_k=1ny_k=1βˆ’(y_jβˆ’p_j)=p_jβˆ’y_j \begin{aligned} \frac{\partial}{\partial z\_j}L &= -\left(\frac{\partial}{\partial z\_j} y\_j \log(p\_j) + \frac{\partial}{\partial z\_j}\sum\_{k \neq j}y\_k \log(p\_k)\right) \\\\ &= -\left(y\_j(1-p\_j) - \sum\_{k \neq j}y\_kp\_j\right) \\\\ &= -\left(y\_j-y\_jp\_j - \sum\_{k \neq j}y\_kp\_j\right) \\\\ &= -\left(y\_j- (y\_jp\_j + \sum\_{k \neq j}y\_kp\_j)\right) \\\\ &= -\left(y\_j- \sum\_{k=1}^ny\_kp\_j\right)\\\\ &\overset{\sum\_{k=1}^{n} y\_k = 1}{=} -\left(y\_j- p\_j\right) \\\\ &= p\_j - y\_j \end{aligned}

Useful resources