Auto Encoder
Supervised vs. Unsupervised Learning

Supervised vs. unsupervised
- Supervised learning
- Given data
- Estimate the posterior
- Unsupervised learning
- Concern with the structure (unseen) of the data
- Try to estimate (implicitly or explicitly) the data distribution
Auto-Encoder structure
In supervised learning, the hidden layers encapsulate the features useful for classification. Even there are no labels or no output layer, it is still possible to learn features in the hidden layer! 💪
Linear auto-encoder

- Similar to linear compression method (such as PCA)
- Trying to find linear surfaces that most data points can lie on
- Not very useful for complicated data 🤪
Non-linear auto-encoder

When , the activation function also prevents the network to simply copy over the data
Goal: find optimized weights to minimize
- Optimized with Stochastic Gradient Descent (SGD)
- Gradients computed with Backpropagation
General auto-encoder structure

2 components in general
- Encoder: maps input to hidden
- Decoder: reconstructs from
( and depend on input data type)
Encoder and Decoder often have similar/reversed architectures
Why Auto-Encoders?
With auto-encoders we can do
- Compression & Reconstruction
- MLP training assistance
- Feature learning
- Representation learning
- Sampling different variations of the inputs
There’re many types and variations of auto-encoders
Different architectures for different data types
Different loss functions for different learning purposes
Compression and Reconstruction

- For example a flattened image:
- Common hidden layer sizes: or
Sending takes less bandwidth then
Sender uses and to compress into
Receiver uses and to reconstruct
With corrupted inputs

Deliberately corrupt inputs
Train auto-encoders to regenerate the inputs before corruption
NOT required (no risk of learning an identity function)
Benefit from a network with large capacity
Different ways of corruption
- Images
- Adding noise filters
- downscaling
- shifting
- …
- Speech
- simulating background noise
- Creating high-articulation effect
- …
- Text: masking words/characters
- Images
Application
Deep Learning super sampling
- use neural auto-encoders to generate HD frames from SD frames
Denoising Speech from Microphones
Unsupervised Pretraining

Normal training regime
- Initialize the networks with random
- Forward pass to compute output
- Get the loss function
- Backward pass and update weights to minimize
Pretraining regime
- Find a way to have pretrained 💪
- They are used to optimize auxiliary functions before training
Layer-wise pretraining
Pretraining first layer

Initialize to encode, to decode
Forward pass
Reconstruction loss:
Backward pass
- Compute gradients and
Update , with SGD
Repeat 1 to 4 until convergence
Pretraining next layers

- Use from previous pretraining
Initialize to encode, to decode
Forward pass
Reconstruction loss:
Backward pass
- Compute gradients and
Update , with SGD and keep the same
Hidden layers pretraining in general

- Each layer is pretrained as an AE to reconstruct the input of that layer (i.e )
- The backward pass is stopped at the input to prevent changing previous weights and ONLY update
- Complexity of each AE increases over depth (since the forward pass requires all previously pretrained layers)
Finetuning
- Start the networks with pretrained
- Go back to supervised training:
- Forward pass to compute output
- Get the loss function
- Backward pass and update weights to minimize
This process is called finetuning because the weights are NOT randomly initialized, but carried over from an external process
What does “unsupervised pretraining” help?
According to Why Does Unsupervised Pre-training Help Deep Learning?
- Pretraining helps to make networks with 5 hidden layers converge
- Lower classification error rate
- Create a better starting point for the non-convex optimization process
Restricted Boltzmann Machine

Structure
- Visible units (Input data points )
- Hidden units ()
Given input , we can generate the probabilities of hidden units being On(1)/Off (0)
Given the hidden units, we can generate the probabilities of visible units being On/Off
Energy function of a visible-hidden system
- Train the network to minimize the energy function
- Use Contrastive Divergence algorithm
Layer-wise pretraining with RBM



Finetuning RBM: Deep Belief Network
- The end result is called a Deep Belief Network
- Use pretrained to convert the network into a typical MLP
- Go back to supervised training:
- Forward pass to compute output
- Get the loss function
- Backward pass and update weights to minimize
RBM Pretraining application in Speech
Speech Recognition = Looking for the most probable transcription given an audio

💡 We can use (deep) neural networks to replace the non-neural generative models (Gaussian Mixture Models) in the Acoustic Models
Variational Auto-Encoder
💡 Main idea:Enforcing the hidden units to follow an Unit Gaussian Distribution (or a known distribution)

- In AE we didn’t know the “distribution” of the (hidden) code
- Knowing the distribution in advance will make sampling easier

Get the Gaussian restriction
- Each Gaussian is represented by Mean and Variance
Why do we sample?
- The hidden layers’ neurons are then “arranged” in the gaussian distribution
We wanted to enforce the hidden layers to follow a known distribution, for example , so we can add a loss function to do so:
Variational methods allow us to take a sample of the distribution being estimated, then get a “noisy” gradient for SGD
Convergence can be achieved in practice
Structure Prediction
Beyond auto-encoder
- Auto-Encoder
- Given the object: reconstruct the object
- is (implicitly) estimated via reconstructing the inputs
- Structure prediction
- Given a part of the object: predict the remaining
- is estimated by factorizing the inputs
- Auto-Encoder
Example
Pixel Models
Assumption (biased): The pixels are generated from left to right, from top to bottom.
(I.e. the content of each pixel depends only on the pixels on its left, and its top rows, like image)
We can estimate a probabilistic function to learn how to generate pixels
- Image with pixels
Closer look:

- But this is quite difficult
- The number of input pixels is a variable
- There are many pixels in an image
- We can model such context dependency using many types of neural networks:
Recurrentneuralnetworks
Convolutional neural networks
Transformers/Self-attentionNN
(Neural) Language Models
A common model/application in natural language processing and generation (E.g. chatbots, translation, question answering)
Similar to the pixel models, we can assume the words are generated from left to right
Each term can be estimated using neural networks under the form with context being a series of words
- Input: context
- Output: classification with classes (the vocabulary size)
- Most classes will have near 0 probabilities given the context
Summary
- Structure prediction is
- An explicit and flexible method to deal with estimating the likelihood of data that can be factorized (with bias)
- Motivation to develop a lot of flexible techniques
- Such as sequence to sequence models, attention models
- The bias is often the weakness 🤪