Machine Learning

Principle Components Analysis (PCA)

TL;DR The usual procedure to calculate the $d$-dimensional principal component analysis consists of the following steps: Calculate average $$ \bar{m}=\sum\_{i=1}^{N} m_{i} \in \mathbb{R} $$ data matrix $$ \mathbf{M}=\left(m\_{1}-\bar{m}, \ldots, m\_{N}-\bar{m}\right) \in \mathbb{R}^{d \times \mathrm{N}} $$ scatter matrix (covariance matrix) $$ \mathbf{S}=\mathbf{M M}^{\mathrm{T}} \in \mathbb{R}^{d \times d} $$ of all feature vectors $m\_{1}, \ldots, m\_{N}$

2020-11-07

Gaussian Mixture Model

Gaussian Distribution Univariate: The Probability Density Function (PDF) is: $$ P(x | \theta)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp \left(-\frac{(x-\mu)^{2}}{2 \sigma^{2}}\right) $$ $\mu$: mean $\sigma$: standard deviation Multivariate: The Probability Density Function (PDF) is: $$ P(x | \theta)=\frac{1}{(2 \pi)^{\frac{D}{2}}|\Sigma|^{\frac{1}{2}}} \exp \left(-\frac{(x-\mu)^{T} \Sigma^{-1}(x-\mu)}{2}\right) $$ $\mu$: mean $\Sigma$: covariance $D$: dimension of data Learning For univariate Gaussian model, we can use Maximum Likelihood Estimation (MLE) to estimate parameter $\theta$ : $$ \theta= \underset{\theta}{\operatorname{argmax}} L(\theta) $$ Assuming data are i.

2020-11-07

Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) also called Fisher’s Linear Discriminant reduces dimension (like PCA) but focuses on maximizing seperability among known categories 💡 Idea Create a new axis Project the data onto this new axis in a way to maximize the separation of two categories How it works?

2020-11-07

Linear Discriminant Functions

No assumption about distributions -> non-parametric Linear decision surfaces Begin by supervised training (given class of training data) Linear Discriminant Functions and Decision Surfaces A discriminant function that is a linear combination of the components of $x$ can be written as $$ g(\mathbf{x})=\mathbf{w}^{T} \mathbf{x}+w\_{0} $$ $\mathbf{x}$: feature vector $\mathbf{w}$: weight vector $w\_0$: bias or threshold weight The two category case Decision rule:

2020-11-07

AdaBoost

Adaptive Boosting: Correct its predecessor by paying a bit more attention to the training instance that the predecessor underfitted. This results in new predictors focusing more and more on the hard cases.

2020-11-07

Bagging and Pasting

TL;DR Bootstrap Aggregating (Boosting): Sampling with replacement Pasting: Sampling without replacement Explaination Ensemble methods work best when the predictors are as independent from one another as possible. One way to get a diverse set of classifiers: use the same training algorithm for every predictor, but to train them on different random subsets of the training set

2020-11-07

Boosting

Boosting Refers to any Ensemble method that can combine serval weak learners into a strong learner 💡 General idea: train predictors sequentially, each trying to correct its predecessor. Popular boosting methods:

2020-11-07

Ensemble Learners

Why emsemble learners? Lower error Each learner (model) has its own bias. It we put them together, the bias tend to be reduced (they fight against each other in some sort of way) Less overfitting Tastes great

2020-11-07

Random Forest

Train a group of Decision Tree classifiers (generally via the bagging method (or sometimes pasting)), each on a different random subset of the training set To make predictions, just obtain the preditions of all individual trees, then predict the class that gets the most votes.

2020-11-07

Voting Classifier

Suppose we have trained a few classifiers, each one achieving about 80% accuracy. A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes.

2020-11-07