Dropout

Sun, 16 Aug 2020 00:00:00 +0000

Model Overfitting

In order to give more “capacity” to capture different features, we give neural nets a lot of neurons. But this can cause overfitting.

Reason: Co-adaptation

Neurons become dependent on others
Imagination: neuron $H\_i$ captures a particular feature $X$ which however, is very frequenly seen with some inputs.
- If $H\_i$ receives bad inputs (partial of the combination), then there is a chance that the feature is ignored 🤪

Solution: Dropout! 💪

Dropout

With dropout the layer inputs become more sparse, forcing the network weights to become more robust.

Dropout a neuron = all the inputs and outputs to this neuron will be disabled at the current iteration.

Training

Given
- input $X \in \mathbb{R}^D$
- weights $W$
- survival rate $p$
  - Usually $p=0.5$
Sample mask $M \in \{0, 1\}^D$ with $M\_i \sim \operatorname{Bernoulli}(p)$
Dropped input:
$$ \hat{X} = X \circ M $$
Perform backward pass and mask the gradients:
$$ \frac{\delta L}{\delta X}=\frac{\delta L}{\delta \hat{X}} \circ M $$

Evaluation/Testing/Inference

ALL input neurons $X$ are presented WITHOUT masking
Because each neuron appears with probability $p$ in training

$\to$ So we have to scale $X$ with $p$ (or scale $\hat{X}$ with $\frac{1}{1-p}$ during training) to match its expectation

Why Dropout works?

Intuition: Dropout prevents the network to be too dependent on a small number of neurons, and forces every neuron to be able to operate independently.
Each of the “dropped” instance is a different network configuration
$2^n$ different networks sharing weights
The inference process can be understood as an ensemble of $2^n$ different configuration
This interpretation is in-line with Bayesian Neural Networks

👍 Data Augmentation

Sun, 16 Aug 2020 00:00:00 +0000

Motivation

Overfitting happens because of having too few examples to train on, resulting in a model that has poor generalization performance 😢. If we had infinite training data, we wouldn’t overfit because we would see every possible instance.

However, in most machine learning applications, especially in image classification tasks, obtaining new training data is not easy. Therefore we need to make do with the training set at hand. 💪

Data augmentation is a way to generate more training data from our current set. It enriches or “augments” the training data by generating new examples via random transformation of existing ones. This way we artificially boost the size of the training set, reducing overfitting. So data augmentation can also be considered as a regularization technique.

Data augmentation is done dynamically during training time. We need to generate realistic images, and the transformations should be learnable, simply adding noise won’t help. Common transformations are

rotation
shifting
resizing
exposure adjustment
contrast change
etc.

This way we can generate a lot of new samples from a single training example.

Notice that data augmentation is ONLY performed on the training data, we don’t touch the validation or test set.

Popular Augmentation Techniques

Flip

Left: original image. Middle: image flipped horizontally. Right: image flipped vertically

Rotation

Example of square images rotated at right angles. From left to right: The images are rotated by 90 degrees clockwise with respect to the previous one.

Note: image dimensions may not be preserved after rotation

If image is a square, rotating it at right angles will preserve the image size.
If image is a rectangle, rotating it by 180 degrees would preserve the size.

Scale

Left: original image. Middle: image scaled outward by 10%. Right: image scaled outward by 20%

The image can be scaled outward or inward. While scaling outward, the final image size will be larger than the original image size. Most image frameworks cut out a section from the new image, with size equal to the original image.

Crop

Left: original image. Middle: a square section cropped from the top-left. Right: a square section cropped from the bottom-right. The cropped sections were resized to the original image size.

Random cropping

Randomly sample a section from the original image
Resize this section to the original image size

Translation

Left: original image. Middle: the image translated to the right. Right: the image translated upwards.

Translation = moving the image along the X or Y direction (or both)

This method of augmentation is very useful as most objects can be located at almost anywhere in the image. This forces your convolutional neural network to look everywhere.

Gaussian Noise

Left: original image. Middle: image with added Gaussian noise. Right: image with added salt and pepper noise.

One reason of overfitting ist that neural network tries to learn high frequency features (patterns that occur a lot) that may not be useful.

Gaussian noise, which has zero mean, essentially has data points in all frequencies, effectively distorting the high frequency features. This also means that lower frequency components (usually, your intended data) are also distorted, but your neural network can learn to look past that. Adding just the right amount of noise can enhance the learning capability.

A toned down version of this is the salt and pepper noise, which presents itself as random black and white pixels spread through the image. This is similar to the effect produced by adding Gaussian noise to an image, but may have a lower information distortion level.

Generalization | Haobin Tan

Dropout

Model Overfitting

Dropout

Training

Evaluation/Testing/Inference

Why Dropout works?

👍 Data Augmentation

Motivation

Popular Augmentation Techniques

Flip

Rotation

Scale

Crop

Translation

Gaussian Noise

Reference