W{hh} Viewer does not support full SVG 1.1
Aggregate the gradients w.r.t. W _ h h W\_{hh} W _ hh over the whole time sequence with back propagation, we can finally yield the following gradient w.r.t W _ h h W\_{hh} W _ hh :
β L β W _ h h = β _ t β _ k = 1 t + 1 β L ( t + 1 ) β z _ t + 1 β z _ t + 1 β h _ t + 1 β h _ t + 1 β h _ k β h _ k β W _ h h
\frac{\partial \mathcal{L}}{\partial W\_{h h}}=\sum\_{t} \sum\_{k=1}^{t+1} \frac{\partial \mathcal{L}(t+1)}{\partial z\_{t+1}} \frac{\partial z\_{t+1}}{\partial \mathbf{h}\_{t+1}} \frac{\partial \mathbf{h}\_{t+1}}{\partial \mathbf{h}\_{k}} \frac{\partial \mathbf{h}\_{k}}{\partial W\_{h h}}
β W _ hh β L β = β _ t β _ k = 1 t + 1 β z _ t + 1 β L ( t + 1 ) β β h _ t + 1 β z _ t + 1 β β h _ k β h _ t + 1 β β W _ hh β h _ k β W _ x h W\_{xh} W _ x h Similar to W _ h h W\_{hh} W _ hh , we consider the time step t + 1 t + 1 t + 1 (only contribution from x _ t + 1 \mathbf{x}\_{t+1} x _ t + 1 ) and calculate the gradient w.r.t. W _ x h W\_{xh} W _ x h :
β L ( t + 1 ) β W _ x h = β L ( t + 1 ) β h _ t + 1 β h _ t + 1 β W _ x h
\frac{\partial \mathcal{L}(t+1)}{\partial W\_{x h}}=\frac{\partial \mathcal{L}(t+1)}{\partial \mathbf{h}\_{t+1}} \frac{\partial \mathbf{h}\_{t+1}}{\partial W\_{x h}}
β W _ x h β L ( t + 1 ) β = β h _ t + 1 β L ( t + 1 ) β β W _ x h β h _ t + 1 β Because h _ t \mathbf{h}\_{t} h _ t and x _ t \mathbf{x}\_t x _ t both make contribution to h _ t + 1 \mathbf{h}\_{t+1} h _ t + 1 , we need to back propagate to h _ t \mathbf{h}\_{t} h _ t as well. If we consider the contribution from the time step t t t , we can further get
β L ( t + 1 ) β W _ x h = β L ( t + 1 ) β h _ t + 1 β h _ t + 1 β W _ x h + β L ( t + 1 ) β h _ t β h _ t β W _ x h = β L ( t + 1 ) β h _ t + 1 β h _ t + 1 β W _ x h + β L ( t + 1 ) β h _ t + 1 β h _ t + 1 β h _ t β h _ t β W _ x h
\begin{aligned}
& \frac{\partial \mathcal{L}(t+1)}{\partial W\_{x h}}=\frac{\partial \mathcal{L}(t+1)}{\partial \mathbf{h}\_{t+1}} \frac{\partial \mathbf{h}\_{t+1}}{\partial W\_{x h}}+\frac{\partial \mathcal{L}(t+1)}{\partial \mathbf{h}\_{t}} \frac{\partial \mathbf{h}\_{t}}{\partial W\_{x h}} \\\\
=& \frac{\partial \mathcal{L}(t+1)}{\partial \mathbf{h}\_{t+1}} \frac{\partial \mathbf{h}\_{t+1}}{\partial W\_{x h}}+\frac{\partial \mathcal{L}(t+1)}{\partial \mathbf{h}\_{t+1}} \frac{\partial \mathbf{h}\_{t+1}}{\partial \mathbf{h}\_{t}} \frac{\partial \mathbf{h}\_{t}}{\partial W\_{x h}}
\end{aligned}
= β β W _ x h β L ( t + 1 ) β = β h _ t + 1 β L ( t + 1 ) β β W _ x h β h _ t + 1 β + β h _ t β L ( t + 1 ) β β W _ x h β h _ t β β h _ t + 1 β L ( t + 1 ) β β W _ x h β h _ t + 1 β + β h _ t + 1 β L ( t + 1 ) β β h _ t β h _ t + 1 β β W _ x h β h _ t β β Thus, summing up all contributions from t t t to 0 0 0 via backpropagation, we can yield the gradient at the time step t + 1 t + 1 t + 1 :
β L ( t + 1 ) β W _ x h = β k = 1 t + 1 β L ( t + 1 ) β h _ t + 1 β h _ t + 1 β h _ k β h _ k β W _ x h
\frac{\partial \mathcal{L}(t+1)}{\partial W\_{x h}}=\sum_{k=1}^{t+1} \frac{\partial \mathcal{L}(t+1)}{\partial \mathbf{h}\_{t+1}} \frac{\partial \mathbf{h}\_{t+1}}{\partial \mathbf{h}\_{k}} \frac{\partial \mathbf{h}\_{k}}{\partial W\_{x h}}
β W _ x h β L ( t + 1 ) β = k = 1 β t + 1 β β h _ t + 1 β L ( t + 1 ) β β h _ k β h _ t + 1 β β W _ x h β h _ k β Example: t = 2 t=2 t = 2
Computational graph for W_xh
Further, we can take derivative w.r.t. W _ x h W\_{xh} W _ x h over the whole sequence as
β L β W x h = β _ t β k = 1 t + 1 β L ( t + 1 ) β z _ t + 1 β z t + 1 β h _ t + 1 β h _ t + 1 β h _ k β h _ k β W _ x h
\frac{\partial \mathcal{L}}{\partial W_{x h}}=\sum\_{t} \sum_{k=1}^{t+1} \frac{\partial \mathcal{L}(t+1)}{\partial z\_{t+1}} \frac{\partial z_{t+1}}{\partial \mathbf{h}\_{t+1}} \frac{\partial \mathbf{h}\_{t+1}}{\partial \mathbf{h}\_{k}} \frac{\partial \mathbf{h}\_{k}}{\partial W\_{x h}}
β W x h β β L β = β _ t k = 1 β t + 1 β β z _ t + 1 β L ( t + 1 ) β β h _ t + 1 β z t + 1 β β β h _ k β h _ t + 1 β β W _ x h β h _ k β Gradient vanishing or exploding problems Notice that β h _ t + 1 β h _ k \frac{\partial \mathbf{h}\_{t+1}}{\partial \mathbf{h}\_{k}} β h _ k β h _ t + 1 β in the equation above indicates matrix multiplication over the sequence. And RNNs need to backpropagate gradients over a long sequence
With small values in the matrix multiplication
Gradient value will shrink layer over layer, and eventually vanish after a few time steps. Thus, the states that are far away from the current time step does not contribute to the parametersβ gradient computing (or parameters that RNNs is learning)!
β \to β Gradient vanishing
With large values in the matrix multiplication
Gradient value will get larger layer over layer, and eventually become extremly large!
β \to β Gradient exploding
BPTT in LSTM Unit structure of LSTM
How LSTM works Given a sequence data \left\\{\mathbf{x}\_{1}, \dots,\mathbf{x}\_{T}\right\\} , we have
Derivatives At the time step t t t :
h _ t = o _ t β tanh β‘ ( c _ t ) β \mathbf{h}\_{t}=\mathbf{o}\_{t} \circ \tanh \left(\mathbf{c}\_{t}\right)\Rightarrow h _ t = o _ t β tanh ( c _ t ) β d o _ t = tanh β‘ ( c _ t ) d h _ t d \mathbf{o}\_{t}=\tanh \left(\mathbf{c}\_{t}\right) d \mathbf{h}\_{t} d o _ t = tanh ( c _ t ) d h _ t d c _ t = ( 1 β tanh β‘ ( c _ t ) 2 ) o _ t d h _ t d \mathbf{c}\_{t}=\left(1-\tanh \left(\mathbf{c}\_{t}\right)^{2}\right) \mathbf{o}\_{t} d \mathbf{h}\_{t} d c _ t = ( 1 β tanh ( c _ t ) 2 ) o _ t d h _ t (see: Derivation of t a n h tanh t anh )c _ t = f _ t β c _ t β 1 + i _ t β g _ t β \mathbf{c}\_{t}=\mathbf{f}\_{t} \circ \mathbf{c}\_{t-1}+\mathbf{i}\_{t} \circ \mathbf{g}\_{t} \Rightarrow c _ t = f _ t β c _ t β 1 + i _ t β g _ t β d i _ t = g _ t d c _ t d \mathbf{i}\_{t}=\mathbf{g}\_{t} d \mathbf{c}\_{t} d i _ t = g _ t d c _ t d g _ t = i _ t d c _ t d \mathbf{g}\_{t}=\mathbf{i}\_{t} d \mathbf{c}\_{t} d g _ t = i _ t d c _ t d f _ t = c _ t β 1 d c _ t d \mathbf{f}\_{t}=\mathbf{c}\_{t-1} d \mathbf{c}\_{t} d f _ t = c _ t β 1 d c _ t d c _ t β 1 + = f _ t β d c _ t d \mathbf{c}\_{t-1}+=\mathbf{f}\_{t} \circ d \mathbf{c}\_{t} d c _ t β 1 + = f _ t β d c _ t (derivation see: Error propagation )Whatβs more, we backpropagate activation functions over the whole sequence (As the weights W _ x o , W _ x i , W _ x f , W _ x c W\_{xo}, W\_{xi}, W\_{xf}, W\_{xc} W _ x o , W _ x i , W _ x f , W _ x c are shared across the whole sequence, we need to take the same summation over t t t as in RNNs) :
d W _ x o = β _ t o _ t ( 1 β o _ t ) x _ t d o _ t d W _ x i = β t i _ t ( 1 β i _ t ) x _ t d i _ t d W _ x f = β _ t f _ t ( 1 β f _ t ) x _ t d f _ t d W _ x c = β ( 1 β g _ t 2 ) x _ t d g _ t
\begin{aligned}
d W\_{x o} &=\sum\_{t} \mathbf{o}\_{t}\left(1-\mathbf{o}\_{t}\right) \mathbf{x}\_{t} d \mathbf{o}\_{t} \\\\
d W\_{x i} &=\sum_{t} \mathbf{i}\_{t}\left(1-\mathbf{i}\_{t}\right) \mathbf{x}\_{t} d \mathbf{i}\_{t} \\\\
d W\_{x f} &=\sum\_{t} \mathbf{f}\_{t}\left(1-\mathbf{f}\_{t}\right) \mathbf{x}\_{t} d \mathbf{f}\_{t} \\\\
d W\_{x c} &=\sum\left(1-\mathbf{g}\_{t}^{2}\right) \mathbf{x}\_{t} d \mathbf{g}\_{t}
\end{aligned}
d W _ x o d W _ x i d W _ x f d W _ x c β = β _ t o _ t ( 1 β o _ t ) x _ t d o _ t = t β β i _ t ( 1 β i _ t ) x _ t d i _ t = β _ t f _ t ( 1 β f _ t ) x _ t d f _ t = β ( 1 β g _ t 2 ) x _ t d g _ t β Similarly, we have
d W _ h o = β t o _ t ( 1 β o _ t ) h _ t β 1 d o _ t d W _ h i = β _ t i _ t ( 1 β i _ t ) h _ t β 1 d i _ t d W _ h f = β _ t f _ t ( 1 β f _ t ) h _ t β 1 d f _ t d W _ h c = β _ t ( 1 β g _ t 2 ) h _ t β 1 d g _ t
\begin{aligned}
d W\_{h o} &=\sum_{t} \mathbf{o}\_{t}\left(1-\mathbf{o}\_{t}\right) \mathbf{h}\_{t-1} d \mathbf{o}\_{t} \\\\
d W\_{h i} &=\sum\_{t} \mathbf{i}\_{t}\left(1-\mathbf{i}\_{t}\right) \mathbf{h}\_{t-1} d \mathbf{i}\_{t} \\\\
d W\_{h f} &=\sum\_{t} \mathbf{f}\_{t}\left(1-\mathbf{f}\_{t}\right) \mathbf{h}\_{t-1} d \mathbf{f}\_{t} \\\\
d W\_{h c} &=\sum\_{t}\left(1-\mathbf{g}\_{t}^{2}\right) \mathbf{h}\_{t-1} d \mathbf{g}\_{t}
\end{aligned}
d W _ h o d W _ hi d W _ h f d W _ h c β = t β β o _ t ( 1 β o _ t ) h _ t β 1 d o _ t = β _ t i _ t ( 1 β i _ t ) h _ t β 1 d i _ t = β _ t f _ t ( 1 β f _ t ) h _ t β 1 d f _ t = β _ t ( 1 β g _ t 2 ) h _ t β 1 d g _ t β Since h _ t β 1 h\_{t-1} h _ t β 1 , hidden state at time step t β 1 t-1 t β 1 , is used in forget gate, input gate, candidate of new cell state, and output gate, therefore:
d h _ t β 1 = o _ t ( 1 β o _ t ) W h o d o _ t + i _ t ( 1 β i _ t ) W _ h i d i _ t + f _ t ( 1 β f _ t ) W h f d f _ t + ( 1 β g _ t 2 ) W _ h c d g _ t
\begin{aligned}
d \mathbf{h}\_{t-1} = &\mathbf{o}\_{t}\left(1-\mathbf{o}\_{t}\right) W_{h o} d \mathbf{o}\_{t}+\mathbf{i}\_{t}\left(1-\mathbf{i}\_{t}\right) W\_{h i} d \mathbf{i}\_{t} \\\\
&+\mathbf{f}\_{t}\left(1-\mathbf{f}\_{t}\right) W_{h f} d \mathbf{f}\_{t}+\left(1-\mathbf{g}\_{t}^{2}\right) W\_{h c} d \mathbf{g}\_{t}
\end{aligned}
d h _ t β 1 = β o _ t ( 1 β o _ t ) W h o β d o _ t + i _ t ( 1 β i _ t ) W _ hi d i _ t + f _ t ( 1 β f _ t ) W h f β d f _ t + ( 1 β g _ t 2 ) W _ h c d g _ t β Alternatively, we can derive d h _ t β 1 d \mathbf{h}\_{t-1} d h _ t β 1 from the objective function at time step t β 1 t-1 t β 1 :
d h _ t β 1 = d h _ t β 1 + W h z d z _ t β 1
d \mathbf{h}\_{t-1}=d \mathbf{h}\_{t-1}+W_{h z} d z\_{t-1}
d h _ t β 1 = d h _ t β 1 + W h z β d z _ t β 1 Error backpropagation Suppose we have the least square objective function
L ( x , ΞΈ ) = min β‘ β _ t 1 2 ( y _ t β z _ t ) 2
\mathcal{L}(\mathbf{x}, \theta)=\min \sum\_{t} \frac{1}{2}\left(y\_{t}-z\_{t}\right)^{2}
L ( x , ΞΈ ) = min β _ t 2 1 β ( y _ t β z _ t ) 2 where \boldsymbol{\theta}=\left\\{W\_{h z}, W\_{x o}, W\_{x i}, W\_{x f}, W\_{x c}, W\_{h o}, W\_{h i}, W\_{h f}, W\_{h c}\right\\} with bias ignored. For the sake of brevity, we use the following notation
L ( t ) = 1 2 ( y _ t β z _ t ) 2
\mathcal{L}(t)=\frac{1}{2}\left(y\_{t}-z\_{t}\right)^{2}
L ( t ) = 2 1 β ( y _ t β z _ t ) 2 At time step T T T , we take derivative w.r.t. c _ T \mathbf{c}\_T c _ T
β L ( T ) β c _ T = β L ( T ) β h _ T β h _ T β c _ T
\frac{\partial \mathcal{L}(T)}{\partial \mathbf{c}\_{T}}=\frac{\partial \mathcal{L}(T)}{\partial \mathbf{h}\_{T}} \frac{\partial \mathbf{h}\_{T}}{\partial \mathbf{c}\_{T}}
β c _ T β L ( T ) β = β h _ T β L ( T ) β β c _ T β h _ T β At time step T β 1 T-1 T β 1 , we take derivative of L ( t β 1 ) \mathcal{L}(t-1) L ( t β 1 ) w.r.t. c _ T β 1 \mathbf{c}\_{T-1} c _ T β 1 as
β L ( T β 1 ) β c _ T β 1 = β L ( T β 1 ) β h _ T β 1 β h _ T β 1 β c _ T β 1
\frac{\partial \mathcal{L}(T-1)}{\partial \mathbf{c}\_{T-1}}=\frac{\partial \mathcal{L}(T-1)}{\partial \mathbf{h}\_{T-1}} \frac{\partial \mathbf{h}\_{T-1}}{\partial \mathbf{c}\_{T-1}}
β c _ T β 1 β L ( T β 1 ) β = β h _ T β 1 β L ( T β 1 ) β β c _ T β 1 β h _ T β 1 β However, according to the following unfolded unit of structure,
Unfolded unit of LSTM, in order to make it easy to understand error backpropagation
the error is not only backpropagated via L ( T β 1 ) \mathcal{L}(T-1) L ( T β 1 ) , but also from c _ T \mathbf{c}\_T c _ T . Therefore, the gradient w.r.t. c _ T β 1 \mathbf{c}\_{T-1} c _ T β 1 should be
β L ( T β 1 ) β c _ T β 1 = β L ( T β 1 ) β c _ T β 1 + β L ( T ) β c _ T β 1 = β L ( T β 1 ) β h _ T β 1 β h _ T β 1 β c _ T β 1 + β L ( T ) β h _ T β h _ T β c _ T β _ = d c _ T β c _ T β c _ T β 1 β _ = f _ T
\begin{array}{ll}
\frac{\partial \mathcal{L}(T-1)}{\partial \mathbf{c}\_{T-1}} &= \frac{\partial \mathcal{L}(T-1)}{\partial \mathbf{c}\_{T-1}}+\frac{\partial \mathcal{L}(T)}{\partial \mathbf{c}\_{T-1}} \\\\
&=\frac{\partial \mathcal{L}(T-1)}{\partial \mathbf{h}\_{T-1}} \frac{\partial \mathbf{h}\_{T-1}}{\partial \mathbf{c}\_{T-1}} + \underbrace{\frac{\partial \mathcal{L}(T)}{\partial \mathbf{h}\_{T}} \frac{\partial \mathbf{h}\_{T}}{\partial \mathbf{c}\_{T}}}\_{=d\mathbf{c}\_T} \underbrace{\frac{\partial \mathbf{c}\_{T}}{\partial \mathbf{c}\_{T-1}}}\_{=\mathbf{f}\_T}
\end{array}
β c _ T β 1 β L ( T β 1 ) β β = β c _ T β 1 β L ( T β 1 ) β + β c _ T β 1 β L ( T ) β = β h _ T β 1 β L ( T β 1 ) β β c _ T β 1 β h _ T β 1 β + β h _ T β L ( T ) β β c _ T β h _ T β β _ = d c _ T β c _ T β 1 β c _ T β β _ = f _ T β
β d c _ T β 1 = d c _ T β 1 + f _ T β d c _ T
\Rightarrow \qquad d \mathbf{c}\_{T-1}=d \mathbf{c}\_{T-1}+\mathbf{f}\_{T} \circ d \mathbf{c}\_{T}
β d c _ T β 1 = d c _ T β 1 + f _ T β d c _ T Parameters learning Forward
Use the equations in How LSTM works to update states as the feedforward neural network from the time step 1 1 1 to T T T .
Compute loss
Backward
Backpropage the error from T T T to 1 1 1 using equations in Derivatives . Then use gradient d ΞΈ d\boldsymbol{\theta} d ΞΈ to update the parameters \boldsymbol{\theta}=\left\\{W\_{h z}, W\_{x o}, W\_{x i}, W\_{x f}, W\_{x c}, W\_{h o}, W\_{h i}, W\_{h f}, W\_{h c}\right\\} . For example, if we use SGD, we have:
ΞΈ = ΞΈ β Ξ· d ΞΈ
\boldsymbol{\theta}=\boldsymbol{\theta}-\eta d \boldsymbol{\theta}
ΞΈ = ΞΈ β Ξ· d ΞΈ where Ξ· \eta Ξ· is the learning rate.
Derivative of softmax z = softmax β‘ ( W _ h z h + b _ z ) z=\operatorname{softmax}\left(W\_{h z} \mathbf{h}+\mathbf{b}\_{z}\right) z = softmax ( W _ h z h + b _ z ) : predicts the probability assigned to K K K classes.
Furthermore, we can use 1 1 1 of K K K (One-hot) encoding to represent the groundtruth y y y but with probability vector to represent z = [ p ( y ^ _ 1 ) , β¦ , p ( y ^ _ K ) ] z=\left[p\left(\hat{y}\_{1}\right), \ldots, p\left(\hat{y}\_{K}\right)\right] z = [ p ( y ^ β _ 1 ) , β¦ , p ( y ^ β _ K ) ] . Then, we can consider the gradient in each dimension, and then generalize it to the vector case in the objective function (cross-entropy loss):
L ( W _ h z , b _ z ) = β y log β‘ z
\mathcal{L}\left(W\_{h z}, \mathbf{b}\_{z}\right)=-y \log z
L ( W _ h z , b _ z ) = β y log z Let
Ξ± j ( Ξ ) = W _ h z ( : , j ) h _ t
\alpha_{j}(\Theta)=W\_{h z}(:, j) \mathbf{h}\_{t}
Ξ± j β ( Ξ ) = W _ h z ( : , j ) h _ t Then
p ( y ^ _ j β£ h _ t ; Ξ ) = exp β‘ ( Ξ± _ j ( Ξ ) ) β _ k exp β‘ ( Ξ± _ k ( Ξ ) )
p\left(\hat{y}\_{j} \mid \mathbf{h}\_{t} ; \Theta\right)=\frac{\exp \left(\alpha\_{j}(\Theta)\right)}{\sum\_{k} \exp \left(\alpha\_{k}(\Theta)\right)}
p ( y ^ β _ j β£ h _ t ; Ξ ) = β _ k exp ( Ξ± _ k ( Ξ ) ) exp ( Ξ± _ j ( Ξ ) ) β β k = j \forall k = j β k = j :
β β Ξ± _ j y _ j log β‘ p ( y ^ _ j β£ h _ t ; Ξ ) = y _ j ( β β Ξ± _ j log β‘ p ( y ^ _ j β£ h _ t ; Ξ ) ) ( β β Ξ± _ j p ( y ^ _ j β£ h _ t ; Ξ ) ) = y _ j β
1 p ( y ^ _ j β£ h _ t ; Ξ ) β
( β β Ξ± _ j exp β‘ ( Ξ± _ j ( Ξ ) ) β _ k exp β‘ ( Ξ± _ k ( Ξ ) ) ) = y _ j p ( y ^ _ j ) exp β‘ ( Ξ± _ j ( Ξ ) ) β k exp β‘ ( Ξ± _ k ( Ξ ) ) β exp β‘ ( Ξ± _ j ( Ξ ) ) exp β‘ ( Ξ± _ j ( Ξ ) ) [ β _ k exp β‘ ( Ξ± _ k ( Ξ ) ) ] 2 = y _ j p ( y ^ _ j ) exp β‘ ( Ξ± _ j ( Ξ ) ) β _ k exp β‘ ( Ξ± _ k ( ΞΈ ) ) β _ = p ( y ^ _ j ) β _ k exp β‘ ( Ξ± _ k ( ΞΈ ) ) β exp β‘ ( Ξ± _ j ( Ξ ) ) β _ k exp β‘ ( Ξ± _ k ( ΞΈ ) ) = y _ j ( β _ k exp β‘ ( Ξ± _ k ( ΞΈ ) ) β _ k exp β‘ ( Ξ± _ k ( ΞΈ ) ) β exp β‘ ( Ξ± _ j ( Ξ ) ) β _ k exp β‘ ( Ξ± _ k ( ΞΈ ) ) β _ = p ( y ^ _ j ) ) = y _ j ( 1 β p ( y ^ _ j ) )
\begin{array}{ll}
&\frac{\partial}{\partial \alpha\_{j}} y\_{j} \log p\left(\hat{y}\_{j} \mid \mathbf{h}\_{t} ; \Theta\right) \\\\
\\\\
=&y\_{j} \left(\frac{\partial}{\partial \alpha\_{j}} \log p\left(\hat{y}\_{j} \mid \mathbf{h}\_{t} ; \Theta\right)\right) \left(\frac{\partial}{\partial \alpha\_{j}} p\left(\hat{y}\_{j} \mid \mathbf{h}\_{t} ; \Theta\right)\right) \\\\ \\\\
=&y\_{j} \cdot \frac{1}{p\left(\hat{y}\_{j} \mid \mathbf{h}\_{t} ; \Theta\right)} \cdot \left(\frac{\partial}{\partial \alpha\_{j}} \frac{\exp \left(\alpha\_{j}(\Theta)\right)}{\sum\_{k} \exp \left(\alpha\_{k}(\Theta)\right)}\right) \\\\ \\\\
=&\frac{y\_{j}}{p\left(\hat{y}\_{j}\right)} \frac{\exp \left(\alpha\_{j}(\Theta)\right) \sum_{k} \exp \left(\alpha\_{k}(\Theta)\right)-\exp \left(\alpha\_{j}(\Theta)\right) \exp \left(\alpha\_{j}(\Theta)\right)}{\left[\sum\_{k} \exp \left(\alpha\_{k}(\Theta)\right)\right]^{2}} \\\\ \\\\
=& \frac{y\_{j}}{p\left(\hat{y}\_{j}\right)} \underbrace{\frac{\exp \left(\alpha\_{j}(\Theta)\right)}{\sum\_k \exp(\alpha\_k(\theta))}}\_{=p\left(\hat{y}\_{j}\right)} \frac{\sum\_k \exp(\alpha\_k(\theta)) - \exp \left(\alpha\_{j}(\Theta)\right)}{\sum\_k \exp(\alpha\_k(\theta))}\\\\ \\\\
=&y\_{j} (\frac{\sum\_k \exp(\alpha\_k(\theta))}{\sum\_k \exp(\alpha\_k(\theta))} - \underbrace{\frac{\exp \left(\alpha\_{j}(\Theta)\right)}{\sum\_k \exp(\alpha\_k(\theta))}}\_{=p\left(\hat{y}\_{j}\right)}) \\\\ \\\\
=&y\_{j}\left(1-p\left(\hat{y}\_{j}\right)\right)
\end{array}
= = = = = = β β Ξ± _ j β β y _ j log p ( y ^ β _ j β£ h _ t ; Ξ ) y _ j ( β Ξ± _ j β β log p ( y ^ β _ j β£ h _ t ; Ξ ) ) ( β Ξ± _ j β β p ( y ^ β _ j β£ h _ t ; Ξ ) ) y _ j β
p ( y ^ β _ j β£ h _ t ; Ξ ) 1 β β
( β Ξ± _ j β β β _ k e x p ( Ξ± _ k ( Ξ ) ) e x p ( Ξ± _ j ( Ξ ) ) β ) p ( y ^ β _ j ) y _ j β [ β _ k e x p ( Ξ± _ k ( Ξ ) ) ] 2 e x p ( Ξ± _ j ( Ξ ) ) β k β e x p ( Ξ± _ k ( Ξ ) ) β e x p ( Ξ± _ j ( Ξ ) ) e x p ( Ξ± _ j ( Ξ ) ) β p ( y ^ β _ j ) y _ j β β _ k exp ( Ξ± _ k ( ΞΈ )) exp ( Ξ± _ j ( Ξ ) ) β β _ = p ( y ^ β _ j ) β _ k e x p ( Ξ± _ k ( ΞΈ )) β _ k e x p ( Ξ± _ k ( ΞΈ )) β e x p ( Ξ± _ j ( Ξ ) ) β y _ j ( β _ k e x p ( Ξ± _ k ( ΞΈ )) β _ k e x p ( Ξ± _ k ( ΞΈ )) β β β _ k exp ( Ξ± _ k ( ΞΈ )) exp ( Ξ± _ j ( Ξ ) ) β β _ = p ( y ^ β _ j ) ) y _ j ( 1 β p ( y ^ β _ j ) ) β β k β j \forall k \neq j β k ξ = j :
β β Ξ± _ j y _ j log β‘ p ( y ^ _ j β£ h _ t ; Ξ ) = y _ j p ( y ^ _ j ) β exp β‘ ( Ξ± _ k ( Ξ ) ) exp β‘ ( Ξ± _ j ( Ξ ) ) [ β _ s exp β‘ ( Ξ± _ s ( Ξ ) ) ] 2 = β y _ k p ( y ^ _ j )
\begin{array}{ll}
&\frac{\partial}{\partial \alpha\_{j}} y\_{j} \log p\left(\hat{y}\_{j} \mid \mathbf{h}\_{t} ; \Theta\right) \\\\ \\\\
=&\frac{y\_{j}}{p\left(\hat{y}\_{j}\right)} \frac{-\exp \left(\alpha\_{k}(\Theta)\right) \exp \left(\alpha\_{j}(\Theta)\right)}{\left[\sum\_{s} \exp \left(\alpha\_{s}(\Theta)\right)\right]^{2}} \\\\ \\\\
=&-y\_{k}p\left(\hat{y}\_{j}\right)
\end{array}
= = β β Ξ± _ j β β y _ j log p ( y ^ β _ j β£ h _ t ; Ξ ) p ( y ^ β _ j ) y _ j β [ β _ s e x p ( Ξ± _ s ( Ξ ) ) ] 2 β e x p ( Ξ± _ k ( Ξ ) ) e x p ( Ξ± _ j ( Ξ ) ) β β y _ k p ( y ^ β _ j ) β We can yield the following gradient w.r.t. Ξ± _ j ( Ξ ) \alpha\_j(\Theta) Ξ± _ j ( Ξ ) :
β p ( y ^ ) β Ξ± j = β _ j β y _ j log β‘ p ( y ^ _ j β£ h _ i ; Ξ ) β Ξ± _ j = β log β‘ p ( y ^ _ j β£ h _ i ; Ξ ) β Ξ± _ j + β k β j β log β‘ p ( y ^ _ k β£ h _ i ; Ξ ) β Ξ± j = ( y _ j β y _ j p ( y ^ _ j ) ) β ( β k β j y _ k p ( y ^ _ j ) ) = y _ j β p ( y ^ _ j ) ( y j + β k β j y _ k ) = y _ j β p ( y ^ _ j )
\begin{aligned}
\frac{\partial p(\hat{\mathbf{y}})}{\partial \alpha_{j}} &=\sum\_{j} \frac{\partial y\_{j} \log p\left(\hat{y}\_{j} \mid \mathbf{h}\_{i} ; \Theta\right)}{\partial \alpha\_{j}} \\\\
&=\frac{\partial \log p\left(\hat{y}\_{j} \mid \mathbf{h}\_{i} ; \Theta\right)}{\partial \alpha\_{j}}+\sum_{k \neq j} \frac{\partial \log p\left(\hat{y}\_{k} \mid \mathbf{h}\_{i} ; \Theta\right)}{\partial \alpha_{j}} \\\\
&=\left(y\_{j}-y\_{j} p\left(\hat{y}\_{j}\right)\right)-\left(\sum_{k \neq j} y\_{k} p\left(\hat{y}\_{j}\right)\right) \\\\
&=y\_{j}-p\left(\hat{y}\_{j}\right)\left(y_{j}+\sum_{k \neq j} y\_{k}\right)\\\\
&=y\_{j}-p\left(\hat{y}\_{j}\right)
\end{aligned}
β Ξ± j β β p ( y ^ β ) β β = β _ j β Ξ± _ j β y _ j log p ( y ^ β _ j β£ h _ i ; Ξ ) β = β Ξ± _ j β log p ( y ^ β _ j β£ h _ i ; Ξ ) β + k ξ = j β β β Ξ± j β β log p ( y ^ β _ k β£ h _ i ; Ξ ) β = ( y _ j β y _ j p ( y ^ β _ j ) ) β β k ξ = j β β y _ k p ( y ^ β _ j ) β = y _ j β p ( y ^ β _ j ) β y j β + k ξ = j β β y _ k β = y _ j β p ( y ^ β _ j ) β
β β L β Ξ± _ j = β β Ξ± _ j ( β β _ j β y _ j log β‘ p ( y ^ _ j β£ h _ i ; Ξ ) β Ξ± _ j ) = β ( y _ j β p ( y ^ _ j ) ) = p ( y ^ _ j ) β y _ j
\begin{aligned}
\Rightarrow \frac{\partial \mathcal{L}}{\partial \alpha\_j} &= \frac{\partial }{\partial \alpha\_j} \left(-\sum\_{j} \frac{\partial y\_{j} \log p\left(\hat{y}\_{j} \mid \mathbf{h}\_{i} ; \Theta\right)}{\partial \alpha\_{j}}\right) \\\\
&= -\left(y\_{j}-p\left(\hat{y}\_{j}\right)\right) \\\\
&= p\left(\hat{y}\_{j}\right) - y\_{j}
\end{aligned}
β β Ξ± _ j β L β β = β Ξ± _ j β β ( β β _ j β Ξ± _ j β y _ j log p ( y ^ β _ j β£ h _ i ; Ξ ) β ) = β ( y _ j β p ( y ^ β _ j ) ) = p ( y ^ β _ j ) β y _ j β See also:
Derivative of tanh β tanh β‘ ( x ) β ( x ) = β sinh β‘ ( x ) cosh β‘ ( x ) β x = β sinh β‘ ( x ) β x cosh β‘ ( x ) β sinh β‘ ( x ) β cosh β‘ ( x ) β x ( cosh β‘ ( x ) ) 2 = [ cosh β‘ ( x ) ] 2 β [ sinh β‘ ( x ) ] 2 ( cosh β‘ ( x ) ) 2 = 1 β [ tanh β‘ ( x ) ] 2
\begin{aligned}
\frac{\partial \tanh (x)}{\partial(x)}=& \frac{\partial \frac{\sinh (x)}{\cosh (x)}}{\partial x} \\\\
=& \frac{\frac{\partial \sinh (x)}{\partial x} \cosh (x)-\sinh (x) \frac{\partial \cosh (x)}{\partial x}}{(\cosh (x))^{2}} \\\\
=& \frac{[\cosh (x)]^{2}-[\sinh (x)]^{2}}{(\cosh (x))^{2}} \\\\
=& 1-[\tanh (x)]^{2}
\end{aligned}
β ( x ) β tanh ( x ) β = = = = β β x β c o s h ( x ) s i n h ( x ) β β ( cosh ( x ) ) 2 β x β s i n h ( x ) β cosh ( x ) β sinh ( x ) β x β c o s h ( x ) β β ( cosh ( x ) ) 2 [ cosh ( x ) ] 2 β [ sinh ( x ) ] 2 β 1 β [ tanh ( x ) ] 2 β Derivative of Sigmoid Sigmoid:
Ο ( x ) = 1 1 + e β x
\sigma(x)=\frac{1}{1+e^{-x}}
Ο ( x ) = 1 + e β x 1 β Derivative:
d d x Ο ( x ) = d d x [ 1 1 + e β x ] = d d x ( 1 + e β x ) β 1 = β ( 1 + e β x ) β 2 ( β e β x ) = e β x ( 1 + e β x ) 2 = 1 1 + e β x β
e β x 1 + e β x = 1 1 + e β x β
( 1 + e β x ) β 1 1 + e β x = 1 1 + e β x β
( 1 + e β x 1 + e β x β 1 1 + e β x ) = 1 1 + e β x β
( 1 β 1 1 + e β x ) = Ο ( x ) β
( 1 β Ο ( x ) )
\begin{aligned}
\frac{d}{d x} \sigma(x) &=\frac{d}{d x}\left[\frac{1}{1+e^{-x}}\right] \\\\
&=\frac{d}{d x}\left(1+\mathrm{e}^{-x}\right)^{-1} \\\\
&=-\left(1+e^{-x}\right)^{-2}\left(-e^{-x}\right) \\\\
&=\frac{e^{-x}}{\left(1+e^{-x}\right)^{2}} \\\\
&=\frac{1}{1+e^{-x}} \cdot \frac{e^{-x}}{1+e^{-x}} \\\\
&=\frac{1}{1+e^{-x}} \cdot \frac{\left(1+e^{-x}\right)-1}{1+e^{-x}} \\\\
&=\frac{1}{1+e^{-x}} \cdot\left(\frac{1+e^{-x}}{1+e^{-x}}-\frac{1}{1+e^{-x}}\right) \\\\
&=\frac{1}{1+e^{-x}} \cdot\left(1-\frac{1}{1+e^{-x}}\right) \\\\
&=\sigma(x) \cdot(1-\sigma(x))
\end{aligned}
d x d β Ο ( x ) β = d x d β [ 1 + e β x 1 β ] = d x d β ( 1 + e β x ) β 1 = β ( 1 + e β x ) β 2 ( β e β x ) = ( 1 + e β x ) 2 e β x β = 1 + e β x 1 β β
1 + e β x e β x β = 1 + e β x 1 β β
1 + e β x ( 1 + e β x ) β 1 β = 1 + e β x 1 β β
( 1 + e β x 1 + e β x β β 1 + e β x 1 β ) = 1 + e β x 1 β β
( 1 β 1 + e β x 1 β ) = Ο ( x ) β
( 1 β Ο ( x )) β Reference