👍 Activation Functions

Activation functions should be

Q: Why can’t the mapping between layers be linear?

A: Compositions of linear functions is still linear, whole network collapses to regression.

Sigmoid

\sigma(x)=\frac{1}{1+\exp (-x)}

Squashes numbers to range $[0,1]$
✅ Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron
⛔️ Problems
- Vanishing gradients: functions gradient at either tail of $1$ or $0$ is almost zero
- Sigmoid outputs are not zero-centered (important for initialization)
- $\exp()$ is a bit compute expensive
Derivative
$\frac{d}{dx} \sigma(x) = \sigma(x)(1 - \sigma(x))$
(See: Derivative of Sigmoid Function)

Python implementation

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

Derivative

def dsigmoid(y):
    return y * (1 - y)

f(x) = \max(0, x)

✅ Advantages
- Does not saturate (in $[0, \infty]$ )
- Very computationally efficient
- Converges much faster than sigmoid/tanh in practice
⛔️ Problems
- Not zero-centred output
- No gradient for $x < 0$ (dying ReLU)

Python implementation

import numpy as np

def ReLU(x):
	return np.maximum(0, x)

f(x) = \max(0.1x, x)

Parametric Rectifier (PReLu)
$f(x) = \max(\alpha x, x)$
- Also learn $\alpha$
✅ Advantages
- Does not saturate
- Computationally efficient
- Converges much faster than sigmoid/tanh in practice!
- will not “die”

Python implementation

import numpy as np

def ReLU(x):
	return np.maximum(0.1 * x, x)

f(x) = \begin{cases} x &\text{if }x > 0 \\\\ \alpha(\exp (x)-1) & \text {if }x \leq 0\end{cases}

✅ Advantages
- All benefits of ReLU
- Closer to zero mean outputs
- Negative saturation regime compared with Leaky ReLU (adds some robustness to noise)
⛔️ Problems
- Computation requires $\exp()$

f(x) = \max \left(w\_{1}^{T} x+b\_{1}, w\_{2}^{T} x+b\_{2}\right)

Softmax: probability that feature $x$ belongs to class $c\_k$ $o\_k = \theta\_k^Tx \qquad \forall k = 1, \dots, j$

p\left(y=c\_{k} \mid x ; \boldsymbol{\theta}\right)= p\left(c\_{k} = 1 \mid x ; \boldsymbol{\theta}\right) = \frac{e^{o\_k}}{\sum\_{j} e^{o\_j}}

Derivative: $\frac{\partial p(\hat{\mathbf{y}})}{\partial o\_{j}} =y\_{j}-p\left(\hat{y}\_{j}\right)$

Last updated on 2024-09-05