Generalization: Dropout

Dropout

Model Overfitting

截屏2020-08-23 22.00.46

In order to give more “capacity” to capture different features, we give neural nets a lot of neurons. But this can cause overfitting.

截屏2020-08-23 21.59.37

Reason: Co-adaptation

  • Neurons become dependent on others
  • Imagination: neuron H_iH\_i captures a particular feature XX which however, is very frequenly seen with some inputs.
    • If H_iH\_i receives bad inputs (partial of the combination), then there is a chance that the feature is ignored 🤪

Solution: Dropout! 💪

Dropout

With dropout the layer inputs become more sparse, forcing the network weights to become more robust.

截屏2020-08-23 22.06.16

Dropout a neuron = all the inputs and outputs to this neuron will be disabled at the current iteration.

Training

  • Given

    • input XRDX \in \mathbb{R}^D
    • weights WW
    • survival rate pp
      • Usually p=0.5p=0.5
  • Sample mask M{0,1}DM \in \{0, 1\}^D with M_iBernoulli(p)M\_i \sim \operatorname{Bernoulli}(p)

  • Dropped input:

    X^=XM \hat{X} = X \circ M
  • Perform backward pass and mask the gradients:

    δLδX=δLδX^M \frac{\delta L}{\delta X}=\frac{\delta L}{\delta \hat{X}} \circ M

Evaluation/Testing/Inference

  • ALL input neurons XX are presented WITHOUT masking

  • Because each neuron appears with probability pp in training

    \to So we have to scale XX with pp (or scale X^\hat{X} with 11p\frac{1}{1-p} during training) to match its expectation

Why Dropout works?

  • Intuition: Dropout prevents the network to be too dependent on a small number of neurons, and forces every neuron to be able to operate independently.
  • Each of the “dropped” instance is a different network configuration
  • 2n2^n different networks sharing weights
  • The inference process can be understood as an ensemble of 2n2^n different configuration
  • This interpretation is in-line with Bayesian Neural Networks
截屏2020-08-23 22.20.36