Unsupervised Learning | Haobin Tan

Unsupervised Learning

Sun, 16 Aug 2020 00:00:00 +0000

Auto Encoder

Sun, 16 Aug 2020 00:00:00 +0000

Supervised vs. Unsupervised Learning

Supervised vs. unsupervised

Supervised learning
- Given data $(X, Y)$
- Estimate the posterior $P(Y|X)$
Unsupervised learning
- Concern with the structure (unseen) of the data
- Try to estimate (implicitly or explicitly) the data distribution $P(X)$

Auto-Encoder structure

In supervised learning, the hidden layers encapsulate the features useful for classification. Even there are no labels or no output layer, it is still possible to learn features in the hidden layer! 💪

Linear auto-encoder

$$ \begin{array}{l} H=W\_{I} I+b\_{I} \\\\ \tilde{I}=W\_{O} H+b\_{O} \end{array} $$

Similar to linear compression method (such as PCA)
Trying to find linear surfaces that most data points can lie on
Not very useful for complicated data 🤪

Non-linear auto-encoder

$$ \begin{array}{l} H=f(W\_{I} I+b\_{I}) \\\\ \tilde{I}=W\_{O} H+b\_{O} \end{array} $$

When $D\_H > D\_I$, the activation function also prevents the network to simply copy over the data
Goal: find optimized weights to minimize
$$ L=\frac{1}{2}(\tilde{I}-\mathrm{I})^{2} $$
- Optimized with Stochastic Gradient Descent (SGD)
- Gradients computed with Backpropagation

General auto-encoder structure

2 components in general
- Encoder: maps input $I$ to hidden $H$
- Decoder: reconstructs $\tilde{I}$ from $H$
($f$ and $f^*$ depend on input data type)
Encoder and Decoder often have similar/reversed architectures

Why Auto-Encoders?

With auto-encoders we can do

Compression & Reconstruction
MLP training assistance
Feature learning
Representation learning
Sampling different variations of the inputs

There’re many types and variations of auto-encoders

Different architectures for different data types
Different loss functions for different learning purposes

Compression and Reconstruction

$D\_H < D\_I$
- For example a flattened image: $D\_I = 1920 \times 1080 \times 3$
- Common hidden layer sizes: $512$ or $1024$
$\to$ Sending $H$ takes less bandwidth then $I$
Sender uses $W\_I$ and $b\_I$ to compress $I$ into $H$
Receiver uses $W\_O$ and $b\_O$ to reconstruct $\tilde{I}$

With corrupted inputs

Deliberately corrupt inputs
Train auto-encoders to regenerate the inputs before corruption
$D\_H < D\_I$ NOT required (no risk of learning an identity function)
Benefit from a network with large capacity
Different ways of corruption
- Images
  - Adding noise filters
  - downscaling
  - shifting
  - …
- Speech
  - simulating background noise
  - Creating high-articulation effect
  - …
- Text: masking words/characters
Application
- Deep Learning super sampling
  - use neural auto-encoders to generate HD frames from SD frames
- Denoising Speech from Microphones

Unsupervised Pretraining

Normal training regime

Initialize the networks with random $W\_1, W\_2, W\_3$
Forward pass to compute output $O$
Get the loss function $L(O, Y)$
Backward pass and update weights to minimize $L$

Pretraining regime

Find a way to have $W\_1, W\_2, W\_3$ pretrained 💪
They are used to optimize auxiliary functions before training

Layer-wise pretraining

Pretraining first layer

Initialize $W\_1$ to encode, $W\_1^*$ to decode
Forward pass
- $I \to H\_1 \to I^*$
- Reconstruction loss:
  $$ L = \frac{1}{2}(I^* - I)^2 $$
Backward pass
- Compute gradients $\frac{\delta L}{\delta W_{1}}$ and $\frac{\delta L}{\delta W_{1}^*}$
Update $W\_1$, $W\_1^*$ with SGD
Repeat 1 to 4 until convergence

Pretraining next layers

Use $W\_1$ from previous pretraining

Initialize $W\_2$ to encode, $W\_2^*$ to decode
Forward pass
- $I \to H\_1 \to H\_2 \to I^*$
- Reconstruction loss:
  $$ L = \frac{1}{2}(H\_1^* - H\_1)^2 $$
Backward pass
- Compute gradients $\frac{\delta L}{\delta W_{2}}$ and $\frac{\delta L}{\delta W_{2}^*}$
Update $W\_2$, $W\_2^*$ with SGD and keep $W\_1$ the same

Hidden layers pretraining in general

Each layer $H\_n$ is pretrained as an AE to reconstruct the input of that layer (i.e $H\_{n-1}$)
The backward pass is stopped at the input to prevent changing previous weights $W\_1, \dots, W\_{n-1}$ and ONLY update $W\_n, W\_n^*$
Complexity of each AE increases over depth (since the forward pass requires all previously pretrained layers)

Finetuning

Start the networks with pretrained $W\_1, W\_2, W\_3$
Go back to supervised training:
1. Forward pass to compute output $O$
2. Get the loss function $L(O, Y)$
3. Backward pass and update weights to minimize $L$

This process is called finetuning because the weights are NOT randomly initialized, but carried over from an external process

What does “unsupervised pretraining” help?

According to Why Does Unsupervised Pre-training Help Deep Learning?

Pretraining helps to make networks with 5 hidden layers converge
Lower classification error rate
Create a better starting point for the non-convex optimization process

Restricted Boltzmann Machine

Structure
- Visible units (Input data points $I$)
- Hidden units ($H$)
Given input $V$, we can generate the probabilities of hidden units being On(1)/Off (0)
$$ p\left(h\_{j}=1 \mid V\right)=\sigma\left(b\_{j}+\sum\_{i=1}^{m} W\_{i j} v\_{i}\right) $$
Given the hidden units, we can generate the probabilities of visible units being On/Off
$$ p\left(v\_{i} \mid H\right)=\sigma\left(b\_{i}+\sum\_{j=1}^{F} W\_{i j} h\_{j}\right) $$
Energy function of a visible-hidden system
$$ E(V, H)=-\sum\_{i=1}^{m} \sum\_{j=1}^{F} W\_{i j} h\_{j} v\_{i}-\sum\_{i=1}^{m} v\_{i} a\_{i}-\sum\_{j=1}^{F} h\_{j} b\_{j} $$
- Train the network to minimize the energy function
- Use Contrastive Divergence algorithm

Layer-wise pretraining with RBM

Finetuning RBM: Deep Belief Network

The end result is called a Deep Belief Network
Use pretrained $W\_1, W\_2, W\_3$ to convert the network into a typical MLP
Go back to supervised training:
1. Forward pass to compute output $O$
2. Get the loss function $L(O, Y)$
3. Backward pass and update weights to minimize $L$

RBM Pretraining application in Speech

Speech Recognition = Looking for the most probable transcription given an audio

💡 We can use (deep) neural networks to replace the non-neural generative models (Gaussian Mixture Models) in the Acoustic Models

Variational Auto-Encoder

💡 Main idea:Enforcing the hidden units to follow an Unit Gaussian Distribution (or a known distribution)

In AE we didn’t know the “distribution” of the (hidden) code
Knowing the distribution in advance will make sampling easier

Get the Gaussian restriction
- Each Gaussian is represented by Mean $𝜇$ and Variance $𝜎$
Why do we sample?
- The hidden layers’ neurons are then “arranged” in the gaussian distribution
We wanted to enforce the hidden layers to follow a known distribution, for example $𝑁(0, 𝐼)$, so we can add a loss function to do so:
$$ L=\frac{1}{2}(O-I)^{2}+\mathrm{KL}(\mathrm{N}(0, I), \mathrm{N}(\mu, \sigma)) $$
Variational methods allow us to take a sample of the distribution being estimated, then get a “noisy” gradient for SGD
Convergence can be achieved in practice

Structure Prediction

Beyond auto-encoder
- Auto-Encoder
  - Given the object: reconstruct the object
  - $P(X)$ is (implicitly) estimated via reconstructing the inputs
- Structure prediction
  - Given a part of the object: predict the remaining
  - $P(X)$ is estimated by factorizing the inputs
Example

Pixel Models

Assumption (biased): The pixels are generated from left to right, from top to bottom.

(I.e. the content of each pixel depends only on the pixels on its left, and its top rows, like image)
We can estimate a probabilistic function to learn how to generate pixels
- Image $X = \\{x\_1, x\_2, \dots, x\_n\\}$ with $n$ pixels $$ P(X)=\prod\_{i=1}^{n} p\left(x\_{i} \mid x\_{1}, \ldots x\_{i-1}\right) $$
Closer look:

But this is quite difficult
- The number of input pixels is a variable
- There are many pixels in an image
We can model such context dependency using many types of neural networks:
- Recurrentneuralnetworks
- Convolutional neural networks
- Transformers/Self-attentionNN

(Neural) Language Models

A common model/application in natural language processing and generation (E.g. chatbots, translation, question answering)
Similar to the pixel models, we can assume the words are generated from left to right
$$ P( \text{the end of our life} )=P( \text{the} ) \times P( \text{end} \mid {the} ) \times P(\text{of} \mid \text{the end} ) \times P( \text{our} \mid \text{the end of} ) \times P(\text{life} \mid \text{the end of our}) $$
Each term can be estimated using neural networks under the form $P(x|context)$ with context being a series of words
- Input: context
- Output: classification with $V$ classes (the vocabulary size)
  - Most classes will have near 0 probabilities given the context

Summary

Structure prediction is
- An explicit and flexible method to deal with estimating the likelihood of data that can be factorized (with bias)
- Motivation to develop a lot of flexible techniques
  - Such as sequence to sequence models, attention models
The bias is often the weakness 🤪

Reference

Hopfield Nets

Tue, 18 Aug 2020 00:00:00 +0000

Binary Hopfield Nets

Basic Structure: Binary Unit

Single layer of processing units
Each unit $i$ has an activity value or “state” $u\_i$
- Binary: $-1$ or $1$
- Denoted as $+$ and $–$ respectively
Example

Connections

Processing units fully interconnected
Weights from unit $j$ to unit $i$: $T\_{ij}$
No unit has a connection with itself
$$ \forall i : \qquad T\_{ii} = 0 $$
Weights between a pair of units are symmetric
$$ T\_{ji} = T\_{ij} $$
- Symmetric weights lead to the fact that the network will converge (relax in stable state)
Example

Unit vector:

$$ U = (+1, -1, -1, +1)^T $$

Weight matrix:

$$ T=\left(\begin{array}{cccc} T\_{11} & T\_{12} & T\_{13} & T\_{14} \\\\ T\_{21} & T\_{22} & T\_{23} & T\_{24} \\\\ T\_{31} & T\_{32} & T\_{33} & T\_{34} \\\\ T\_{41} & T\_{42} & T\_{43} & T\_{44} \end{array}\right) = \left(\begin{array}{cccc} 0 & -1 & -1 & +1 \\\\ -1 & 0 & +1 & -1 \\\\ -1 & +1 & 0 & -1 \\\\ +1 & -1 & -1 & 0 \end{array}\right) $$

Update Binary Unit

$$ u\_i = \operatorname{sign}(\sum\_{j} T\_{ji} u\_j) = \begin{cases} +1 & \text{if }\sum\_{j} T\_{ji} u\_j \geq 0 \\\\ -1 & \text {otherwise } \end{cases} $$

Evaluate the sum of the weighted inputs
Set state $1$ if the sum is greater or equal $0$, else $-1$

Update Procedure

Network state is initialized in the beginning
Update
- Asynchronous: Update one unit at a time
- Synchronous: Update all nodes in parallel
Continue updating until the network state does not change anymore

Example

$$ > u\_4 = \operatorname{sign}(+1 \cdot (-1) + (-1) \cdot 1 + (-1) \cdot 1) = \operatorname{sign}(-3) = -1 > $$
So the new state of unit 4 is $-$

Order of updating

Could be sequentially
Random order (Hopfield networks)
- Same average update rate
- Advantages in implementation
- Advantages in function (equiprobable stable states)
Randomized asynchronous updating is a closer match to the biological neuronal nets

Energy function

Assign a numerical value to each possible state of the system (Lyapunov Function)
Corresponds to the “energy” of the net
$$ \begin{aligned} E &= -\frac{1}{2} \sum\_{j} \sum\_{i \neq j} u\_{i} T\_{j i} u\_{j} \\\\ &= -\frac{1}{2}U^T TU \end{aligned} $$

Proof on Convergence

Each updating step leads to lower or same energy in the net.

Let’s say only unit $j$ is updated at a time. Energy changes only for unit $j$ is

$$ E\_{j}=-\frac{1}{2} \sum\_{i \neq j} u\_{i}T\_{j i} u\_{j} $$

Given a change in state, the difference in Energy $E$ is

$$ \begin{aligned} \Delta E\_{j}&=E\_{j\_{n e w}}-E\_{j\_{o l d}} \\\\ &=-\frac{1}{2} \Delta u\_{j} \sum\_{i \neq j} u\_{j} T\_{j i} \end{aligned} $$ $$ \Delta u\_{j}=u\_{j\_{n e w}}-u\_{j\_{o l d}} $$

Change from $-1$ to $1$:
$$ \Delta u\_{j}=2, \Sigma T\_{j i} u\_{i} \geq 0 \Rightarrow \Delta E\_{j} \leq 0 $$
Change from $1$ to $-1$:
$$ \Delta u\_{j}=-2, \Sigma T\_{j i} u\_{i}<0 \Rightarrow \Delta E\_{j}<0 $$

Stable States

Stable states are minima of the energy function
- Can be global or local minima
Analogous to finding a minimum in a mountainous terrain

Applications

Associative memory

Optimization

Limitations

Found stable state (memory) is not guaranteed the most similar pattern to the input pattern

Not all memories are remembered with same emphasis (attractors region is not the same size)

Spurious States

Retrieval States
Reversed States
Mixture States: Any linear combination of an odd number of patterns
“Spinglass” states: Stable states that are no linear combination of stored patterns (occur when too many patterns are stored)

Efficiency

In a net of $N$ units, patterns of length $N$ can be stored
Assuming uncorrelated patterns, the capacity $C$ of a hopfield net is
$$ C \approx 0.15N $$
- Tighter bound $$ \frac{N}{4 \ln N}

Reference

Bolzmann Machine

Tue, 18 Aug 2020 00:00:00 +0000

Boltzmann Machine

Stochastic recurrent neural network
Introduced by Hinton and Sejnowski
Learn internal representations
Problem: unconstrained connectivity

Representation

Model can be represented by Graph:

Undirected graph
Nodes: States
Edges: Dependencies between states

States

Types:

Visible states
- Represent observed data
- Can be input/output data
Hidden states
- Latent variable we want to learn
Bias states
- Always one to encode the bias

Binary states

unit value $\in \\{0, 1\\}$

Stochastic

Decision of whether state is active or not is stochastically
Depend on the input
$$ z\_{i}=b\_{i}+\sum\_{j} s\_{j} w\_{i j} $$
- $b\_i$: Bias
- $S\_j$: State $j$
- $w\_{ij}$: Weight between state $j$ and state $i$
$$ p\left(s\_{i}=1\right)=\frac{1}{1+e^{-z\_{i}}} $$

Connections

Graph can be fully connected (no restrictions)
Unidircted:
$$ w\_{ij} = w\_{ji} $$
No self connections:
$$ w\_{ii} = 0 $$

Energy

Energy of the network

$$ \begin{aligned} E &= -S^TWS - b^TS \\\\ &= -\sum\_{i Probability of input vector $v$

$$ p(v)= \frac{e^{-E(v)}}{\displaystyle \sum\_{u} e^{-E(u)}} $$

Updating the nodes

decrease the Energy of the network in average
reach Local Minimum (Equilibrium)
Stochastic process will avoid local minima
$$ \begin{array}{c} p\left(s\_{i}=1\right)=\frac{1}{1+e^{-z\_{i}}} \\\\ z\_{i}=\Delta E\_{i}=E\_{i=0}-E\_{i=1} \end{array} $$

Simulated Annealing

Use Temperature to allow for more changes in the beginning

Start with high temperature
“anneal” by slowing lowering T
Can escape from local minima 👏

Search Problem

Input is set and fixed (clamped)
Annealing is done
Answer is presented at the output
Hidden units add extra representational power

Learning problem

Situations
- Present data vectors to the network
Problem
- Learn weights that generate these data with high probability
Approach
- Perform small updates on the weights
- Each time perform search problem

Pros & Cons

✅ Pros

Boltzmann machine with enough hidden units can compute any function

⛔️ Cons

Training is very slow and computational expensive 😢

Restricted Boltzmann Machine

Energy

Energy:

$$ \begin{aligned} E(v, h) &= -a^{\mathrm{T}} v-b^{\mathrm{T}} h-v^{\mathrm{T}} W h \\\\ &= -\sum\_{i} a\_{i} v\_{i}-\sum\_{j} b\_{j} h\_{j}-\sum_{i} \sum_{j} v_{i} w_{i j} h_{j} \end{aligned} $$

Probability of hidden unit:

$$ p\left(h\_{j}=1 \mid V\right)=\sigma\left(b\_{j}+\sum\_{i=1}^{m} W\_{i j} v\_{i}\right) $$

Probability of input vector:

$$ p\left(v\_{i} \mid H\right)=\sigma\left(a\_{i}+\sum\_{j=1}^{F} W\_{i j} h\_{j}\right) $$

$$ > \sigma(x)=\frac{1}{1+e^{-x}} > $$

Free Energy:

$$ \begin{array}{l} e^{-F(V)}=\sum\_{j=1}^{F} e^{-E(v, h)} \\\\ F(v)=-\sum\_{i=1}^{m} v\_{i} a\_{i}-\sum_{j=1}^{F} \log \left(1+e^{z_{j}}\right) \\\\ z_{j}=b\_{j}+\sum\_{i=1}^{m} W\_{i j} v\_{i} \end{array} $$

Restricted Boltzmann Machines (RBMs)

Sun, 16 Aug 2020 00:00:00 +0000

Definition

Invented by Geoffrey Hinton, a Restricted Boltzmann machine is an algorithm useful for

dimensionality reduction
classification
regression
collaborative filtering
feature learning
topic modeling

Given their relative simplicity and historical importance, restricted Boltzmann machines are the first neural network we’ll tackle.

While RBMs are occasionally used, most practitioners in the machine-learning community have deprecated them in favor of generative adversarial networks or variational autoencoders. RBMs are the Model T’s of neural networks – interesting for historical reasons, but surpassed by more up-to-date models.

Structure

RBMs are shallow, two-layer neural nets that constitute the building blocks of deep-belief networks.

The first layer of the RBM is called the visible, or input, layer.
The second is the hidden layer.

Each circle in the graph above represents a neuron-like unit called a node

Nodes are simply where calculations take place
Nodes are connected to each other across layers, but NO two nodes of the SAME layer are linked

$\to$ NO intra-layer communication (restriction in a restricted Boltzmann machine)
Each node is a locus of computation that processes input, and begins by making stochastic decisions about whether to transmit that input or not

Stochastic means “randomly determined”, and in this case, the coefficients that modify inputs are randomly initialized.

Each visible node takes a low-level feature from an item in the dataset to be learned.

E.g., from a dataset of grayscale images, each visible node would receive one pixel-value for each pixel in one image. (MNIST images have 784 pixels, so neural nets processing them must have 784 input nodes on the visible layer.)

Forward pass

One input path

Now let’s follow that single pixel value, x, through the two-layer net. At node 1 of the hidden layer,

x is multiplied by a weight and added to a so-called bias
The result of those two operations is fed into an activation function, which produces the node’s output

activation f((weight w * input x) + bias b ) = output a

Weighted inputs combine

Each x is multiplied by a separate weight
The products are summed and added to a bias
The result is passed through an activation function to produce the node’s output.

Because inputs from all visible nodes are being passed to all hidden nodes, an RBM can be defined as a symmetrical bipartite graph

Symmetrical: each visible node is connected with each hidden node
Bipartite: it has two parts, or layers, and the graph is a mathematical term for a web of nodes

Multiple inputs

At each hidden node, each input x is multiplied by its respective weight w.
- 12 weights altogether (4 input nodes x 3 hidden nodes)
- The weights between two layers will always form a matrix
  - #rows = #input nodes
  - #columns = #output nodes
Each hidden node
- receives the four inputs multiplied by their respective weights
- The sum of those products is again added to a bias (which forces at least some activations to happen)
- The result is passed through the activation algorithm producing one output for each hidden node

Multiple hidden layers

If these two layers were part of a deeper neural network, the outputs of hidden layer no. 1 would be passed as inputs to hidden layer no. 2, and from there through as many hidden layers as you like until they reach a final classifying layer.

(For simple feed-forward movements, the RBM nodes function as an autoencoder and nothing more.)

Reconstructions

In this section, we’ll focus on how they learn to reconstruct data by themselves in an unsupervised fashion, making several forward and backward passes between the visible layer and hidden layer no. 1 without involving a deeper network.

The activations of hidden layer no. 1 become the input in a backward pass.
They are multiplied by the same weights, one per internode edge, just as x was weight-adjusted on the forward pass.
The sum of those products is added to a visible-layer bias at each visible node
The output of those operations is a reconstruction; i.e. an approximation of the original input.

We can think of reconstruction error as the difference between the values of r and the input values, and that error is then backpropagated against the RBM’s weights, again and again, in an iterative learning process until an error minimum is reached.

Kullback Leibler Divergence

On its forward pass, an RBM uses inputs to make predictions about node activations, or the probability of output given a weighted x: p(a|x; w).

on its backward pass, an RBM is attempting to estimate the probability of inputs x given activations a, which are weighted with the same coefficients as those used on the forward pass: p(x|a; w)

Together, those two estimates will lead us to the joint probability distribution of inputs x and activations a, or p(x, a).

Reconstruction is making guesses about the probability distribution of the original input; i.e. the values of many varied points at once. And this is known as generative learning.

Imagine that both the input data and the reconstructions are normal curves of different shapes, which only partially overlap. To measure the distance between its estimated probability distribution and the ground-truth distribution of the input, RBMs use Kullback Leibler Divergence.

KL-Divergence measures the non-overlapping, or diverging, areas under the two curves

An RBM’s optimization algorithm attempts to minimize those areas so that the shared weights, when multiplied by activations of hidden layer one, produce a close approximation of the original input. By iteratively adjusting the weights according to the error they produce, an RBM learns to approximate the original data.

The learning process looks like two probability distributions converging, step by step.

Probabilistic View

For example, image datasets have unique probability distributions for their pixel values, depending on the kind of images in the set.

Assuming an RBM that was only fed images of elephants and dogs, and which had only two output nodes, one for each animal.

The question the RBM is asking itself on the forward pass is: Given these pixels, should my weights send a stronger signal to the elephant node or the dog node?
The question the RBM asks on the backward pass is: Given an elephant, which distribution of pixels should I expect?

That’s joint probability: the simultaneous probability of x given a and of a given x, expressed as the shared weights between the two layers of the RBM.

The process of learning reconstructions is, in a sense, learning which groups of pixels tend to co-occur for a given set of images. The activations produced by nodes of hidden layers deep in the network represent significant co-occurrences; e.g. “nonlinear gray tube + big, floppy ears + wrinkles” might be one.

Reference

Unsupervised Learning

Mon, 07 Sep 2020 00:00:00 +0000

Gaussian Mixture Model

Sat, 07 Nov 2020 00:00:00 +0000

Gaussian Distribution

Univariate: The Probability Density Function (PDF) is:

$$ P(x | \theta)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp \left(-\frac{(x-\mu)^{2}}{2 \sigma^{2}}\right) $$

$\mu$: mean
$\sigma$: standard deviation

Multivariate: The Probability Density Function (PDF) is:

$$ P(x | \theta)=\frac{1}{(2 \pi)^{\frac{D}{2}}|\Sigma|^{\frac{1}{2}}} \exp \left(-\frac{(x-\mu)^{T} \Sigma^{-1}(x-\mu)}{2}\right) $$

$\mu$: mean
$\Sigma$: covariance
$D$: dimension of data

Learning

For univariate Gaussian model, we can use Maximum Likelihood Estimation (MLE) to estimate parameter $\theta$ :

$$ \theta= \underset{\theta}{\operatorname{argmax}} L(\theta) $$

Assuming data are i.i.d, we have:

$$ L(\theta)=\prod\_{j=1}^{N} P\left(x\_{j} | \theta\right) $$

For numerical stability, we usually use Maximum Log-Likelihood:

$$ \begin{align} \theta &= \underset{\theta}{\operatorname{argmax}} L(\theta) \\\\ &= \underset{\theta}{\operatorname{argmax}} \log(L(\theta)) \\\\ &= \underset{\theta}{\operatorname{argmax}} \sum\_{j=1}^{N} \log P\left(x\_{j} | \theta\right)\end{align} $$

Gaussian Mixture Model

A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. One can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians.

Define:

$x\_j$: the $j$-th observed data, $j=1, 2,\dots, N$
$K$: number of Gaussian model components
$\alpha\_k$: probability that the observed data belongs to the $k$-th model component
- $\alpha\_k \geq 0$
- $\displaystyle \sum\_{k=1}^{K}\alpha\_k=1$
$\phi(x|\theta\_k)$: probability density function of the $k$-th model component
- $\theta\_k = (\mu\_k, \sigma\_k^2)$
$\gamma\_{jk}$: probability that the $j$-th obeserved data belongs to the $k$-th model component
Probability density function of Gaussian mixture model:
$$ P(x | \theta)=\sum\_{k=1}^{K} \alpha\_{k} \phi\left(x | \theta\_{k}\right) $$
For this model, parameter is $\theta=\left(\tilde{\mu}\_{k}, \tilde{\sigma}\_{k}, \tilde{\alpha}\_{k}\right)$.

Expectation-Maximum (EM)

Expectation-Maximization (EM) is a statistical algorithm for finding the right model parameters. We typically use EM when the data has missing values, or in other words, when the data is incomplete.

These missing variables are called latent variables.

NEVER observed
We do NOT know the correct values in advance

Since we do not have the values for the latent variables, Expectation-Maximization tries to use the existing data to determine the optimum values for these variables and then finds the model parameters. Based on these model parameters, we go back and update the values for the latent variable, and so on.

The Expectation-Maximization algorithm has two steps:

E-step: In this step, the available data is used to estimate (guess) the values of the missing variables
M-step: Based on the estimated values generated in the E-step, the complete data is used to update the parameters

EM in Gaussian Mixture Model

Initialize the parameters ($K$ Gaussian distributionw with the mean $\mu\_1, \mu\_2,\dots,\mu\_k$ and covariance $\Sigma\_1, \Sigma\_2, \dots, \Sigma\_k$)
Repeat
- E-step: For each point $x\_j$, calculate the probability that it belongs to cluster/distribution $k$
$$ \begin{align} \gamma\_{j k} &= \frac{\text{Probability } x\_j \text{ belongs to cluster } k}{\text{Sum of probability } x\_j \text{ belongs to cluster } 1, 2, \dots, k} \\\\ &= \frac{\alpha\_{k} \phi\left(x\_{j} | \theta\_{k}\right)}{\sum\_{k=1}^{K} \alpha\_{k} \phi\left(x\_{j} | \theta\_{k}\right)}\qquad j=1,2, \ldots, N ; k=1,2 \ldots, K \end{align} $$
The value will be high when the point is assigned to the right cluster and lower otherwise
- M-step: update parameters

$$ \alpha\_k = \frac{\text{Number of points assigned to cluster } k}{\text{Total number of points}} = \frac{\sum\_{j=1}^{N} \gamma\_{j k}}{N} \qquad k=1,2, \ldots, K $$ $$ \mu\_{k}=\frac{\sum\_{j}^{N}\left(\gamma\_{j k} x\_{j}\right)}{\sum\_{j}^{N} \gamma\_{j k}}\qquad k=1,2, \ldots, K $$ $$ \Sigma\_{k}=\frac{\sum\_{j}^{N} \gamma\_{j k}\left(x\_{j}-\mu\_{k}\right)\left(x\_{j}-\mu\_{k}\right)^{T}}{\sum\_{j}^{N} \gamma\_{j k}} \qquad k=1,2, \ldots, K $$

until convergence ($\left\|\theta\_{i+1}-\theta\_{i}\right\|<\varepsilon$)

Visualization:

Reference

Principle Components Analysis (PCA)

Sat, 07 Nov 2020 00:00:00 +0000

TL;DR

The usual procedure to calculate the $d$-dimensional principal component analysis consists of the following steps:

Calculate
- average
  $$ \bar{m}=\sum\_{i=1}^{N} m_{i} \in \mathbb{R} $$
- data matrix
  $$ \mathbf{M}=\left(m\_{1}-\bar{m}, \ldots, m\_{N}-\bar{m}\right) \in \mathbb{R}^{d \times \mathrm{N}} $$
- scatter matrix (covariance matrix)
  $$ \mathbf{S}=\mathbf{M M}^{\mathrm{T}} \in \mathbb{R}^{d \times d} $$
of all feature vectors $m\_{1}, \ldots, m\_{N}$
Calculate the normalized ($\\|\cdot\\|=1$) eigenvectors $\mathbf{e}\_1, \dots, \mathbf{e}\_d$ and sort them such that the corresponding eigenvalues $\lambda\_1, \dots, \lambda\_d$ are decreasing, i.e. $\lambda\_1 > \lambda\_2 > \dots > \lambda\_d$
Construct a matrix
$$ \mathbf{A}:=\left(e\_{1}, \ldots, e\_{d^{\prime}}\right) \in \mathbb{R}^{d \times d^{\prime}} $$
with the first $d^{\prime}$ eigenvectors as its columns
Transform each feature vector $m\_i$ into a new feature vector
$$ \mathrm{m}\_{\mathrm{i}}^{\prime}=\mathrm{A}^{\mathrm{T}}\left(\mathrm{m}\_{\mathrm{i}}-\overline{\mathrm{m}}\right) \quad \text { for } i=1, \ldots, N $$
of smaller dimension $d^{\prime}$

Dimensionality reduction

Goal: represent instances with fewer variables
- Try to preserve as much structure in the data as possible
- Discriminative: only structure that affects class separability
Feature selection
- Pick a subset of the original dimensions
- Discriminative: pick good class “predictors”
Feature extraction
- Construct a new set of dimensions
  $$ E\_{i} = f(X\_1 \dots X\_d) $$
  - $X\_1, \dots, X\_d$: features
- (Linear) combinations of original

Direction of greatest variance

Define a set of principal components
- 1st: direction of the greatest variability in the data (i.e. Data points are spread out as far as possible)
- 2nd: perpendicular to 1st, greatest variability of what’s left
- …and so on until $d$ (original dimensionality)
First $m \ll d$ components become $m$ dimensions
- Change coordinates of every data point to these dimensions

Q: Why greatest variablility?

A: If you pick the dimension with the highest variance, that will preserve the distances as much as possible

How to PCA?

“Center” the data at zero (subtract mean from each attribute)
$$ x\_{i, a} = x\_{i, a} - \mu $$
Compute covariance matrix $\Sigma$

The covariance between two attributes is an indication of whether they change together (positive correlation) or in opposite directions (negative correlation).

For example, $cov(x\_1, x\_2) = 0.8 > 0 \Rightarrow$ When $x\_1$ increases/decreases, $x\_2$ also increases/decreases.

$$ cov(b, a) = \frac{1}{n} \sum\_{i=1}^{n} x\_{ib} x\_{ia} $$
We want vectors $\mathbf{e}$ which aren’t turned by covariance matrix $\Sigma$:
$$ \Sigma \mathbf{e} = \lambda \mathbf{e} $$
$\Rightarrow$ $\mathbf{e}$ are eigenvectors of $\Sigma$, and $\lambda$ are corresponding eigenvalues

Principle components = eigenvectors with largest eigenvalues

Finding principle components

Find eigenvalues by solving Characteristic Polynomial
$$ \operatorname{det}(\Sigma - \lambda \mathbf{I}) = 0 $$
- $\mathbf{I}$: Identity matrix
Find $i$-th eigenvector by solving
$$ \Sigma \mathbf{e}\_i = \lambda\_i \mathbf{e}\_i $$
and we want $\mathbf{e}\_{i}$ to have unit length ($\\|\mathbf{e}\_{i}\\| = 1$)
Eigenvector with the largest eigenvalue will be the first principle component, eigenvector with the second largest eigenvalue will be the second priciple component, so on and so on.

Example

Projecting to new dimension

We pick $m
For instance $\mathbf{x} = \{x\_1, \dots, x\_d\}$ (original coordinates), we want new coordinates $\mathbf{x}^{\prime} = \{x^{\prime}\_1, \dots, x^{\prime}\_d\}$
- “Center” the instance (subtract the mean): $\mathbf{x} - \mathbf{\mu}$
- “Project” to each dimension: $(\mathbf{x} - \mathbf{\mu})^T \mathbf{e}\_j$ for $j=1, \dots, m$
Example

Go deeper in details

Why eigenvectors = greatest variance?

Why eigenvalue = variance along eigenvector?

How many dimensions should we reduce to?

Now we have eigenvectors $\mathbf{e}\_1, \dots, \mathbf{e}\_d$ and we want new dimension $m \ll d$
We pick $\mathbf{e}\_i$ that “explain” the most variance:
- Sort eigenvectors s.t. $\lambda\_1 \geq \dots \geq \lambda\_d$
- Pick first $m$ eigenvectors which explain 90% or the total variance (typical threshold values: 0.9 or 0.95)
Or we can use a scree plot

PCA in a nutshell

PCA example: Eigenfaces

Perform PCA on bitmap images of human faces:

Belows are the eigenvectors after we perform PCA on the dataset:

Then we can project new face to space of eigen-faces, and represent vector of new face as a linear combination of principle components.

As we use more and more eigenvectors in this decomposition, we can end up with a face that looks more and more like the original guy

Why is eigenface neat and interesting?

This is neat because by taking the first few eigenvectors you can get a pretty close representation of the face. Suppose that this corresponds to maybe 20 eigenvectors. This means you’re using only 20 numbers to represent a face bitmap which looks kind of like the original guy! Can you use only 20 pixels to represent him nearly? No, there’s no way!
You’re effectively picking 20 numbers/mixture coefficients/coordinates. One really nice way to use this is you can use this for massive compression of the data. If you communicate to others if they all have access to the same eigenvectors, all they need to send between each other are just the projection coordinates. Then they can transmit arbitrary faces between them. This is massive reduction in the size of data.
Your classifier or your regression system now operate in low dimensional space. So they have plenty of redundancy to grab on to and learn a better hyperplane. 👏

Application of eigenface

Face similarity
- in the reduced space
- insensitive to lighting expression, orientation
Projecting new “faces”

Pratical issues of PCA

PCA is based on covariance matrix and covariance is extremely sensitive to large values
- E.g. multiple some dimension by 1000. Then this dimension dominates covariance and become a principle component.
- Solution: normalize each dimension to zero mean and unit variacne
  $$ x^{\prime} = \frac{x - \text{mean}}{\text{standard deviation}} $$
PCA assumes underlying subspace is linear.
PCA can sometimes hurt the performace of classification
- Because PCA doesn’t see the labels
- Solution: Linear Discriminant Analysis (LDA)
  - Picks a new dimension that gives
    - maximum separation between means of prejected classes
    - minimum variance within each projected class
  - But this relies on some assumptions of the data and does not always work. 🤪

Reference

Principle Component Analysis: a great series of video tutorials explaining PCA clearly 👍