CNN History

LeNet (1998)

Image followed by multiple convolutional / pooling layers
- Build up hierarchical filter structures
- Subsampling / pooling increases robustness
Fully connected layers towards the end
- Brings all information together, combines it once more
Output layer of 10 units, one for each digit class

Standard benchmark for vision:

ILSVRC Classification Task
- 1000 object classes
- 1.2 million training images (732 – 1300 per class)
- 50 thousand validation images (50 per class)
- 100 thousand test images (100 per class)

Small filters, Deeper networks

How can we train such deep networks?

Solution: Use network layers to fit a residual mapping instead of directly trying to fit a

desired underlying mapping

Stack residual blocks
Every residual block has two 3x3 conv layers
Periodically, double # of filters and downsample spatially using stride 2 (/2 in each dimension)
Additional conv layer at the beginning
No FC layers at the end (only FC 1000 to output classes)

ImageNet has 1.2 million images! Typically, we do not have that many! Can we also use these methods with less images?

Yes! With transfer learning!

Very little data, very similar dataset:
- Use Linear classifier on top layer
Very little data, very different dataset:
- You’re in trouble… Try linear classifier from different stages and pray 🤪
A lot of data, very similar dataset:
- Finetune a few layers
A lot of data, very different dataset:
- Finetune a larger number of layers

Siamese Networks (FaceNet)
Distance $$ d\left(x_{1}, x_{2}\right)=\left|f\left(x_{1}\right)-f\left(x_{2}\right)\right|_{2}^{2} $$
- If $d(x_1, x_2)$ small: same person
- Otherwise different person
Training: Triplet loss $$ L(A, P, N)=\max \left(|f(A)-f(P)|^{2}-|f(A)-f(N)|^{2}+\alpha, 0\right) $$

Last updated on Apr 3, 2022