Machine Learning | Haobin Tan

Machine Learning Fundamentals

Mon, 07 Sep 2020 00:00:00 +0000

Math Basics

Mon, 17 Aug 2020 00:00:00 +0000

Linear Algebra

Vectors

Vector: multi-dimensional quantity

Each dimension contains different information (e.g.: Age, Weight, Height…)
represented as bold symbols
A vector $\boldsymbol{x}$ is always a column vector
$$ \boldsymbol{x}=\left[\begin{array}{l} {1} \\\\ {2} \\\\ {4} \end{array}\right] $$
A transposed vector $\boldsymbol{x}^T$ is a row vector
$$ \boldsymbol{x}^{T}=\left[\begin{array}{lll} {1} & {2} & {4} \end{array}\right] $$

Vector Operations

Multiplication by scalars
$$ 2\left[\begin{array}{l} {1} \\\\ {2} \end{array}\right]=\left[\begin{array}{l} {2} \\\\ {4} \end{array}\right] $$
Addtition of vectors
$$ \left[\begin{array}{l}{1} \\\\ {2} \end{array}\right]+\left[\begin{array}{l}{3} \\\\ {1}\end{array}\right]=\left[\begin{array}{l}{4} \\\\ {3} \end{array}\right] $$
Scalar (Inner) products: Sum the element-wise products
$$ \boldsymbol{v}=\left[\begin{array}{c}{1} \\\\ {2} \\\\ {4}\end{array}\right], \quad \boldsymbol{w}=\left[\begin{array}{l}{2} \\\\ {4} \\\\ {8}\end{array}\right] $$

$$ \langle v, w\rangle= 1 \cdot 2+2 \cdot 4+4 \cdot 8=42 $$

Length of a vector: Square root of the inner product with itself $$ \|\boldsymbol{v}\|=\langle\boldsymbol{v}, \boldsymbol{v}\rangle^{\frac{1}{2}}=\left(1^{2}+2^{2}+4^{2}\right)^{\frac{1}{2}}=\sqrt{21} $$

Matrices

Matrix: rectangular array of numbers arranged in rows and columns

denoted with bold upper-case letters
$$ \boldsymbol{X}=\left[\begin{array}{ll}{1} & {3} \\\\ {2} & {3} \\\\ {4} & {7}\end{array}\right] $$
Dimension: $\\#rows \\times \\#columns$ (E.g.: 👆$X \in \mathbb{R}^{3 \times 2}$)
Vectors are special cases of matrices
$$ \boldsymbol{x}^{T}=\underbrace{\left[\begin{array}{ccc}{1} & {2} & {4}\end{array}\right]}_{1 \times 3 \text { matrix }} $$

####Matrices in ML

Data set can be represented as matrix, where single samples are vectors

e.g.:

Age Weight Height

Joe 37 72 175

Mary 10 30 61

Carol 25 65 121

Brad 66 67 175

$$ \text { Joe: } \boldsymbol{x}\_{1}=\left[\begin{array}{c}{37} \\\\ {72} \\\\ {175}\end{array}\right], \qquad \text { Mary: } \boldsymbol{x}\_{2}=\left[\begin{array}{c}{10} \\\\ {30} \\\\ {61}\end{array}\right] \\\\ $$ $$ \text { Carol: } \boldsymbol{x}\_{3}=\left[\begin{array}{c}{25} \\\\ {65} \\\\ {121}\end{array}\right], \qquad \text { Brad: } \boldsymbol{x}\_{4}=\left[\begin{array}{c}{66} \\\\ {67} \\\\ {175}\end{array}\right] $$
Most typical representation:
- row ~ data sample (e.g. Joe)
- column ~ data entry (e.g. age)
$$ \boldsymbol{X}=\left[\begin{array}{l}{\boldsymbol{x}\_{1}^{T}} \\\\ {\boldsymbol{x}\_{2}^{T}} \\\\ {\boldsymbol{x}\_{3}^{T}} \\\\ {\boldsymbol{x}\_{4}^{T}}\end{array}\right]=\left[\begin{array}{ccc}{37} & {72} & {175} \\\\ {10} & {30} & {61} \\\\ {25} & {65} & {121} \\\\ {66} & {67} & {175}\end{array}\right] $$

	Age	Weight	Height
Joe	37	72	175
Mary	10	30	61
Carol	25	65	121
Brad	66	67	175

Matrice Operations

Multiplication with scalar
$$ 3 \boldsymbol{M}=3\left[\begin{array}{lll}{3} & {4} & {5} \\\\ {1} & {0} & {1}\end{array}\right]=\left[\begin{array}{ccc}{9} & {12} & {15} \\\\ {3} & {0} & {3}\end{array}\right] $$
Addition of matrices
$$ \boldsymbol{M} + \boldsymbol{N}=\left[\begin{array}{lll}{3} & {4} & {5} \\\\ {1} & {0} & {1}\end{array}\right]+\left[\begin{array}{lll}{1} & {2} & {1} \\\\ {3} & {1} & {1}\end{array}\right]=\left[\begin{array}{lll}{4} & {6} & {6} \\\\ {4} & {1} & {2}\end{array}\right] $$
Transposed
$$ \boldsymbol{M}=\left[\begin{array}{lll}{3} & {4} & {5} \\\\ {1} & {0} & {1}\end{array}\right], \boldsymbol{M}^{T}=\left[\begin{array}{ll}{3} & {1} \\\\ {4} & {0} \\\\ {5} & {1}\end{array}\right] $$
Matrix-Vector product (Vector need to have same dimensionality as number of columns)
$$ \underbrace{\left[\boldsymbol{w}\_{1}, \ldots, \boldsymbol{w}\_{n}\right]}_{\boldsymbol{W}} \underbrace{\left[\begin{array}{c}{v\_{1}} \\\\ {\vdots} \\\\ {v\_{n}}\end{array}\right]}\_{\boldsymbol{v}}=\underbrace{\left[\begin{array}{c}{v\_{1} \boldsymbol{w}\_{1}+\cdots+v\_{n} \boldsymbol{w}\_{n}}\end{array}\right]}\_{\boldsymbol{u}} $$
E.g.:
$$ \boldsymbol{u}=\boldsymbol{W} \boldsymbol{v}=\left[\begin{array}{ccc}{3} & {4} & {5} \\\\ {1} & {0} & {1}\end{array}\right]\left[\begin{array}{l}{1} \\\\ {0} \\\\ {2}\end{array}\right]=\left[\begin{array}{l}{3 \cdot 1+4 \cdot 0+5 \cdot 2} \\\\ {1 \cdot 1+0 \cdot 0+1 \cdot 2}\end{array}\right]=\left[\begin{array}{c}{13} \\\\ {3}\end{array}\right] $$
💡 Think as: We sum over the columns $\boldsymbol{w}_i$ of $\boldsymbol{W}$ weighted by $v_i$

$$ u=v\_{1} w\_{1}+\cdots+v\_{n} w\_{n}=1\left[\begin{array}{l}{3} \\\\ {1}\end{array}\right]+0\left[\begin{array}{l}{4} \\\\ {0}\end{array}\right]+2\left[\begin{array}{l}{5} \\\\ {1}\end{array}\right]=\left[\begin{array}{c}{13} \\\\ {3}\end{array}\right] $$

Matrix-Matrix product
$$ \boldsymbol{U} = \boldsymbol{W} \boldsymbol{V}=\left[\begin{array}{lll}{3} & {4} & {5} \\\\ {1} & {0} & {1}\end{array}\right]\left[\begin{array}{ll}{1} & {0} \\\\ {0} & {3} \\\\ {2} & {4}\end{array}\right]=\left[\begin{array}{ll}{3 \cdot 1+4 \cdot 0+5 \cdot 2} & {3 \cdot 0+4 \cdot 3+5 \cdot 4} \\\\ {1 \cdot 1+0 \cdot 0+1 \cdot 2} & {1 \cdot 0+0 \cdot 3+1 \cdot 4}\end{array}\right]=\left[\begin{array}{cc}{13} & {32} \\\\ {3} & {4}\end{array}\right] $$
💡 Think of it as: Each column $\boldsymbol{u}\_i = \boldsymbol{W} \boldsymbol{v}\_i$ can be computed by a matrix-vector product
$$ \boldsymbol{W} \underbrace{\left[\boldsymbol{v}\_{1}, \ldots, \boldsymbol{v}\_{n}\right]}\_{\boldsymbol{V}}=[\underbrace{\boldsymbol{W} \boldsymbol{v}\_{1}}_{\boldsymbol{u}\_{1}}, \ldots, \underbrace{\boldsymbol{W} \boldsymbol{v}\_{n}}\_{\boldsymbol{u}\_{n}}]=\boldsymbol{U} $$
- Non-commutative: $\boldsymbol{V} \boldsymbol{W} \neq \boldsymbol{W} \boldsymbol{V}$
- Associative: $\boldsymbol{V}(\boldsymbol{W} \boldsymbol{X})=(\boldsymbol{V} \boldsymbol{W}) \boldsymbol{X}$
- Transpose product:
  $$ (\boldsymbol{V} \boldsymbol{W}) ^{T}=\boldsymbol{W}^{T} \boldsymbol{V}^{T} $$
Matrix inverse
- scalar
  $$ w \cdot w^{-1}=1 $$
- matrices
  $$ \boldsymbol{W} \boldsymbol{W}^{-1}=\boldsymbol{I}, \quad \boldsymbol{W}^{-1} \boldsymbol{W}=\boldsymbol{I} $$

Important Special Cases

Scalar (Inner) product:
$$ \langle\boldsymbol{w}, \boldsymbol{v}\rangle = \boldsymbol{w}^{T} \boldsymbol{v}=\left[w\_{1}, \ldots, w\_{n}\right]\left[\begin{array}{c}{v\_{1}} \\\\ {\vdots} \\\\ {v\_{n}}\end{array}\right]=w\_{1} v\_{1}+\cdots+w\_{n} v\_{n} $$
Compute row/column averages of matrix
$$ \boldsymbol{X}=\underbrace{\left[\begin{array}{ccc}{X\_{1,1}} & {\dots} & {X\_{1, m}} \\\\ {\vdots} & {} & {\vdots} \\\\ {X\_{n, 1}} & {\dots} & {X\_{n, m}}\end{array}\right]}\_{n \text { (samples) } \times m \text { (entries) }} $$
- Vector of row averages (average over all entries per sample)
  $$ \left[\begin{array}{cc}{\frac{1}{m} \sum\_{i=1}^{m} X\_{1, i}} \\\\ {\vdots} & {} \\\\ {\frac{1}{m} \sum_{i=1}^{m} X\_{n, i}}\end{array}\right]=\boldsymbol{X}\left[\begin{array}{c}{\frac{1}{m}} \\\\ {\vdots} \\\\ {\frac{1}{m}}\end{array}\right]=\boldsymbol{X} \boldsymbol{a}, \quad \text { with } \boldsymbol{a}=\left[\begin{array}{c}{\frac{1}{m}} \\\\ {\vdots} \\\\ {\frac{1}{m}}\end{array}\right] $$
- Vector of column averages (average over all samples per entry)
  $$ \left[\frac{1}{n} \sum_{i=1}^{n} X\_{i, 1}, \ldots, \frac{1}{n} \sum\_{i=1}^{n} X\_{i, m}\right]=\left[\frac{1}{n}, \ldots, \frac{1}{n}\right] \boldsymbol{X}=\boldsymbol{b}^{T} \boldsymbol{X}, \text { with } \boldsymbol{b}=\left[\begin{array}{c}{\frac{1}{n}} \\\\ {\vdots} \\\\ {\frac{1}{n}}\end{array}\right] $$

Calculus

“The derivative of a function of a real variable measures the sensitivity to change of a quantity (a function value or dependent variable) which is determined by another quantity (the independent variable)”

	Scalar	Vector
Function	$f(x)$	$f(\boldsymbol{x})$
Derivative	$\frac{\partial f(x)}{\partial x}=g$	$\frac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}}=\left[\frac{\partial f(\boldsymbol{x})}{\partial x\_{1}}, \ldots, \frac{\partial f(\boldsymbol{x})}{\partial x\_{d}}\right]^{T} =: \nabla f(x)\quad$ (👆 gradient of function $f$ at $\boldsymbol{x}$)
Min/Max	$\frac{\partial f(x)}{\partial x}=0$	$\frac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}}=[0, \ldots, 0]^{T}=\mathbf{0}$

Matrix Calculus

	Scalar	Vector
Linear	$\frac{\partial a x}{\partial x}=a$	$\nabla\_{\boldsymbol{x}} \boldsymbol{A} \boldsymbol{x}=\boldsymbol{A}^{T}$
Quadratic	$\frac{\partial x^{2}}{\partial x}=2 x$	$\begin{array}{l}{\nabla\_{\boldsymbol{x}} \boldsymbol{x}^{T} \boldsymbol{x}=2 \boldsymbol{x}} \\\\ {\nabla\_{\boldsymbol{x}} \boldsymbol{x}^{T} \boldsymbol{A} \boldsymbol{x}=2 \boldsymbol{A} \boldsymbol{x}}\end{array}$

End-to-End Machine Learning Project

Mon, 17 Aug 2020 00:00:00 +0000

1. Look at the big picture

1.1 Frame the problem

Consider the business objective: How do we expect to use and benefit from this model?

1.2 Select a performance measure

1.3 Check the assumptions

List and verify the assumptions.

2. Get the data

2.1 Download the data

Automate this process: Create a small function to handle downloading, extracting, and storing data.

2.2 Take a quick look at the data

Use pandas.head() to look at the top rows of the data
Use pandas.info() to get a quick description of the data
- For categorical attributes, use value_counts() to see categories and the #samples of each category
- For numerical attributes, use describe() to get a summary of the numerical attributes.

Create a test set

If dataset is large enough, use purely random sampling. (train_test_split)
If the test set need to be representative of the overall data, use stratified sampling.

3. Discover and visualize the data to gain insights

Make sure put the test set aside and only explore the training set
If the trainingset is very large, sample an exploration set to make manipulations easy and fast

3.1 Visualizing data

3.2 Look for correlations

Two ways:

Compute the standard correlation coefficient (also called Pearson’s r) between every pair of attributes using the corr() method.
Or use panda’s scatter_matrix function

3.3 Experimenting with attribute combinations

4. Prepare the data for ML algorithms

Firstly, ensure a clean training set and separate the predictors and labels.

4.1 Data cleaning

Handle missing features:

Get rid of the corresponding samples (districts) -> use dropna()
Get rid of the whole attribute -> use drop()
Set the values to some value (zero, the mean, the median, etc.) -> use fillna()

Or apply SimpleImputer from Scikit-Learn to all the numerical attributes.

4.2 Handle text and categorical attributes

Most ML algorithms prefer to work with numbers anyway. Transform text and categorical attributes to numerical attributes Using One-hot encoding.

4.3 Custom transformers

The custom transformer should work seamlessly with Scikit-Learn functionalities (such as pipelines). -> Create a class and implement three methods:

fit()
transform()
fit_transform() (can get it by simply adding TransfromerMixin as a base class)

If we add BaseEstimator as a bass class, we can get two extra methods

get_params()
set_params() that will be useful for automatic hyperparameter tuning.

4.4 Feature scaling

Comman ways:

Min-max scaling (normalization): Use MinMaxScalar
Standardization
- Use StandardScalar
- Less affected by outliners

4.5 Transformation pipelines

Group sequences of transformations into one step.

Pipeline from scikit-learn:

a list of name/estimator pairs defining a sequence of steps
the last estimator must be transformers (must have a fit_transform() method)
names can be anything but must be unique and don’t contain double underscores “__”

More convenient is to use a single transformer to handle the categorical columns and the numerical columns. -> Use ColumbTransformer: handle all columns, applying the appropriate transformations to each column and also works great with Pandas DataFrames.

5. Select a model and train it

5.1 Train and evaluate on the trainging set

5.2 Better evaluation using Cross-Validation

6. Fine-tune the model

6.1 Grid search

When exploring relatively few combinations, use GridSearchCV: Tell it which hyperparameters we want to experiment with, and what values to try out. Then it will evaluate all the possible combinations of hyperparameter values, using cross-validation.

6.2 Randomized search

When the hyperparameter search space is large, use RandomizedSearchCV. It evaluates a given number of random combinations by selecting a random value for each hyperparameter at every iteration.

6.3 Ensemble methods

Try to combine the models that perform best.

6.4 Analyze the best models and their errors

Gain good insights on the problem by inspecting the best models.

6.5 Evaluate the system on the test set

Get the predictors and labels from test set
Run full pipeline to transform the data
Evaluate the final model on the test set

7. Present the solution

8. Launch, monitor, and maintain the system

Plug the production input data source into the system and write test
Write monitoring code to check system’s live performance at regular intervals and trigger callouts when it drops
Evaluate the system’s input data quality
Train the models on a regular basis using fresh data (automate this precess as much as possible!)

Evaluation

Mon, 17 Aug 2020 00:00:00 +0000

TL;DR

Confusion matrix, ROC, and AUC

Confuse matrix

A confusion matrix tells you what your ML algorithm did right and what it did wrong.

		Known Truth
		Positive	Negative
Prediction	Positive	True Positive (TP)	False Positive (FP)	Precision = TP / (TP+FP)
Prediction	Negative	False Negative (FN)	True Negative (TN)
		TPR = Sensitivity = Recall = TP / (TP + FN)	Specificity = TN / (FP+TN) FPR = FP / (FP + TN) = 1 - Specificity

Row: Prediction
Column: Known truth

Each cell:

Positive/negative: refers to the prediction
True/False: Whether this prediction matches to the truth
The numbers along the diagonal (green) tell us how many times the samples were correctly classified
The numbers not on the diagonal (red) are samples the algorithm messed up.

Definition

Precision

How many selected items are relevant?

$$ \text{ Precision } = \frac{TP}{TP + FP} =\frac{\\# \text{ relevant item retrieved }}{\\# \text{ of items retrieved }} $$

Recall / True Positive Rate (TPR) / Sensitivity

How many relevant items are selected?

$$ \text { Recall } = \frac{TP}{TP + FN} =\frac{\\# \text { relevant item retrieved }}{\\# \text { of relevant items in collection }} $$

Example

F-score / F-measure

$F\_1$ score

The traditional F-measure or balanced F-score ($F\_1$ score) is the harmonic mean of precision and recall:

$$ F\_1=\frac{2 \cdot \text {precison} \cdot \text {recall}}{\text {precision}+\text {recall}} = \frac{2TP}{2TP + FP + FN} $$

$F\_\beta$ score

$F\_\beta$ uses a positive real factor $\beta$, where $\beta$ is chosen such that recall is considered $\beta$ times as important as precision

$$ F\_{\beta}=\left(1+\beta^{2}\right) \cdot \frac{\text { precision } \cdot \text { recall }}{\left(\beta^{2} \cdot \text { precision }\right)+\text { recall }} $$

Two commonly used values for $\beta$:

$2$: weighs recall higher than precision
$0.5$: weighs recall lower than precision

Specificity

$$ \text{Specifity} = \frac{TN}{FP + TN} $$

False Positive Rate (FPR)

$$ \text{FPR} = \frac{FP}{FP + TN} \left(= 1- \frac{TN}{FP + TN} = 1- \text{Specifity}\right) $$

Relation between Sensitivity, Specificity, FPR and Threshold

Assuming that the distributions of the actual postive and negative classes looks like this:

And we have already defined our threshold. What greater than the threshold will be predicted as positive, and smaller than the threshold will be predicted as negative.

If we set a lower threshold, we’ll get the following diagram:

We can notice that FP ⬆️ , and FN ⬇️ .

Therefore, we have the relationship:

Threshold ⬇️
- FP ⬆️ , FN ⬇️
- $\text{Sensitivity} (= TPR) = \frac{TP}{TP + FN}$ ⬆️ , $\text{Specificity} = \frac{TN}{TN + FP}$ ⬇️
- $FPR (= 1 - \text{Specificity})$⬆️
And vice versa

AUC-ROC curve

AUC (Area Under The Curve)-ROC (Receiver Operating Characteristics) curve

Performance measurement for the classification problems at various threshold settings.
- ROC is a probability curve
- AUC represents the degree or measure of separability
Tells how much the model is capable of distinguishing between classes.
The higher the AUC, the better the model is at predicting 0s as 0s and 1s as 1s

How is ROC plotted?

for threshold in thresholds: # iterate over all thresholds
 TPR, FPR = classify(threshold) # calculate TPR and FPR based on threshold
 plot_point(FPR, TPR) # plot coordinate (FPR, TPR) in the diagram

connect_points() # connect all plotted points to get ROC curve

Example:

Suppose that the probability of a series of samples being classified into positive classes has been derived and we sort them descendingly:

Class: actual label of test sample
Score: probability of classifying test sample as positive

Next, we use the “Score” value as the threshold (from high to low).

When the probability that the test sample is a positive sample is greater than or equal to this threshold, we consider it a positive sample, otherwise it is a negative sample.
- For example, for the 4-th sample, its “Score” has value 0.6. So Sample 1, 2, 3, 4 will be considered as positive, because their “Score” values are $\geq$ 0.6. Other samples are classified as negative.
By picking a different threshold each time, we can get a set of FPR and TPR, i.e., a point on the ROC curve. In this way, we get a total of 20 sets of FPR and TPR values. We plot them in the diagram:

How to speculate about the performance of the model?

An excellent model has AUC near to the 1 which means it has a good measure of separability.

Ideal situation: two curves don’t overlap at all means model has an ideal measure of separability. It is perfectly able to distinguish between positive class and negative class.
When $0.5 < \text{AUC} < 1$, there is a high chance that the classifier will be able to distinguish the positive class values from the negative class values. This is because the classifier is able to detect more numbers of True positives and True negatives than False negatives and False positives.

When AUC is 0.7, it means there is a 70% chance that the model will be able to distinguish between positive class and negative class.
When AUC is 0.5, it means the model has no class separation capacity whatsoever.
A poor model has AUC near to the 0 which means it has the worst measure of separability.

When AUC is approximately 0, the model is actually reciprocating the classes. It means the model is predicting a negative class as a positive class and vice versa.

🎥 Video tutorials

The confusion matrix

Sensitivity and specificity

ROC and AUC

Reference

Understanding AUC - ROC Curve
What is the F-score?: very nice explanation with examples
机器学习之分类器性能指标之ROC曲线、AUC值

Overview of Machine Learning Algorithms

Mon, 17 Aug 2020 00:00:00 +0000

Supervised/Unsupervised Learning

Supervised learning

The training data you feed to the algorithm includes the desired solutions, called labels

Typical task:

Classification
Regression

Important supervised learning algo:

k-Nearest Neighbors
Linear Regression
Logistic Regression
Support Vector Machine (SVM)
Decision Trees and Random Forests
Neural Networks

Unsupervised learning

Training data is unlabeled.

Important unsupervised learning algo:

Clustering
- K-Means
- DBSCAN
- Hierarchical Cluster Analysis (HCA)
Anomaly detection and novelty detection
- One-class SVM
- Isolation Forest
Visualization and dimensionality reduction
- Principal Component Analysis (PCA)
- Kernel PCA
- Locally-Linear Embedding (LLE)
- t-distributed Stochastic Neighbor Embedding (t-SNE)
Association rule learning
- Apriori
- Eclat

Semisupervised learning (supervised + unsupervised)

Deal with partially labeled training data, usually a lot of unlabeled data and a little bit of labeled data

Reinforcement Learning

The learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards in return or penalties in the form of negative rewards.

It must then learn by itself what is the best strategy, called a policy, to get the most reward over time.

A policy defines what action the agent should choose when it is in a given situation.

Batch and Online Learning

whether the system can learn incrementally from a stream of incoming data or not

Batch Learning

The system muss be trained using all the available data (I.e., it is incapable of learning incrementally)

First the system is trained, and then it is launched into production and runs without learning anymore; it just applies what it has learned. This is called offline learning.

Want a batch learning system to know about new data?

Need to train a new version of the system from scratch on the full dataset (not just the new data, but also the old data). Then stop the old system and replace it with the new one.

Online Learning

Train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches.

Each learning step is fast and cheap, so the system can learn about new data on the fly, as it arrives.

👍 Advantages:

Great for systems that receive data as a continuous flow and need to adapt to chagne rapidly or autonomously
Save a huge amount of space (After learning the new data instance, do not need them anymore and can just discard them)

😠 Challenge: if bad data is fed to the system, the system’s performance will gradually decline.

🔧 Solution:

monitor the system closely
promptly switch learning off if detect a drop in performance
monitor the input data and react to abnormal data

Instance-Based Vs. Model-Based Learning

Instance-based learning

The system learns the examples by heart, then generalizes to new cases by comparing them to the learned examples (or a subset of them), using a similarity measure

Model-based learning

Build a model of these examples, then use that model to make predictions

Model Selection

Mon, 07 Sep 2020 00:00:00 +0000

Objective Function

Mon, 06 Jul 2020 00:00:00 +0000

How does the objective function look like?

Objective function:

$$ \operatorname{Obj}(\Theta)= \overbrace{L(\Theta)}^{\text {Training Loss}} + \underbrace{\Omega(\Theta)}_{\text{Regularization}} $$

Training loss: measures how well the model fit on training data
$$ L=\sum_{i=1}^{n} l\left(y_{i}, g_{i}\right) $$
- Square loss: $$ l(y_i, \hat{y}_i) = (y_i - \hat{y}_i)^2 $$
- Logistic loss: $$ l(y_i, \hat{y}_i) = y_i \log(1 + e^{-\hat{y}_i}) + (1 - y_i) \log(1 + e^{\hat{y}_i}) $$
Regularization: How complicated is the model?
- $L_2$ norm (Ridge): $\omega(w) = \lambda \|w\|^2$
- $L_1$ norm (Lasso): $\omega(w) = \lambda \|w\|$

	Objective Function	Linear model?	Loss	Regularization
Ridge regression	$\sum_{i=1}^{n}\left(y_{i}-w^{\top} x_{i}\right)^{2}+\lambda\\|w\\|^{2}$	✅	square	$L_2$
Lasso regression	$\sum_{i=1}^{n}\left(y_{i}-w^{\top} x_{i}\right)^{2}+\lambda\\|w\\|$	✅	square	$L_1$
Logistic regression	$\sum_{i=1}^{n}\left[y_{i} \cdot \ln \left(1+e^{-w^{\top} x_{i}}\right)+\left(1-y_{i}\right) \cdot \ln \left(1+e^{w^{\top} x_{i}}\right)\right]+\lambda\\|w\\|^{2}$	✅	logistic	$L_2$

Why do we want to contain two component in the objective?

Optimizing training loss encourages predictive models
- Fitting well in training data at least get you close to training data which is hopefully close to the underlying distribution
Optimizing regularization encourages simple models
- Simpler models tends to have smaller variance in future predictions, making prediction stable

Regression

Mon, 07 Sep 2020 00:00:00 +0000

Machine Learning (ML)

Mon, 07 Sep 2020 00:00:00 +0000

Classification

Mon, 07 Sep 2020 00:00:00 +0000

Logistic Regression: Basics

Mon, 13 Jul 2020 00:00:00 +0000

💡 Use regression algorithm for classification

Logistic regression: estimate the probability that an instance belongs to a particular class

If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class (called the positive class, labeled “1”),
or else it predicts that it does not (i.e., it belongs to the negative class, labeled “0”).

This makes it a binary classifier.

Logistic / Sigmoid function

$\sigma(t)=\frac{1}{1+\exp (-t)}$

Bounded: $\sigma(t) \in (0, 1)$
Symmetric: $1 - \sigma(t) = \sigma(-t)$
Derivative: $\sigma^{\prime}(t)=\sigma(t)(1-\sigma(t))$

Estimating probabilities and making prediction

Computes a weighted sum of the input features (plus a bias term)
Outputs the logistic of this result

$\hat{p}=h_{\theta}(\mathbf{x})=\sigma\left(\mathbf{x}^{\mathrm{T}} \boldsymbol{\theta}\right)$
Prediction:
$$ \hat{y} = \begin{cases} 0 & \text{ if } \hat{p}<0.5\left(\Leftrightarrow h_{\theta}(\mathbf{x})<0\right) \\\\ 1 & \text{ if }\hat{p} \geq 0.5\left(\Leftrightarrow h_{\theta}(\mathbf{x}) \geq 0\right)\end{cases} $$

Train and cost function

Objective of training: to set the parameter vector $\boldsymbol{\theta}$ so that the model estimates:

high probabilities ($\geq 0.5$) for positive instances ($y=1$)
low probabilities ($< 0.5$) for negative instances ($y=0$)

Cost function of a single training instance:

$$ c(\boldsymbol{\theta}) = \begin{cases} -\log (\hat{p}) & \text{ if } y=1 \\\\ -\log (1-\hat{p}) & \text{ if } y=0\end{cases} $$

Actual lable: $y=1$, Misclassification: $\hat{y} = 0 \Leftrightarrow$ $\hat{p} = \sigma(h_{\boldsymbol{\theta}}(x))$ close to 0 $\Leftrightarrow c(\boldsymbol{\theta})$ large

Actual lable: $y=0$, Misclassification: $\hat{y} = 1 \Leftrightarrow$ $\hat{p} = \sigma(h_{\boldsymbol{\theta}}(x))$ close to 1 $\Leftrightarrow c(\boldsymbol{\theta})$ large

The cost function over the whole training set

Simply the average cost over all training instances (Combining the expressions of two different cases above into one single expression):

$\begin{aligned} J(\boldsymbol{\theta}) &=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \log \left(\hat{p}^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-\hat{p}^{(i)}\right)\right] \\\\ &=\frac{1}{m} \sum_{i=1}^{m}\left[-y^{(i)} \log \left(\hat{p}^{(i)}\right)-\left(1-y^{(i)}\right) \log \left(1-\hat{p}^{(i)}\right)\right] \end{aligned}$

$y^{(i)} =1:-y^{(i)} \log \left(\hat{p}^{(i)}\right)-\left(1-y^{(i)}\right) \log \left(1-\hat{p}^{(i)}\right)=-\log \left(\hat{p}^{(i)}\right)$

$y^{(i)} =0:-y^{(i)} \log \left(\hat{p}^{(i)}\right)-\left(1-y^{(i)}\right) \log \left(1-\hat{p}^{(i)}\right)=-\log \left(1-\hat{p}^{(i)}\right)$ (Exactly the same as $c(\boldsymbol{\theta})$ for a single instance above 👏)

Training

No closed-form equation 🤪
But it is convex so Gradient Descent (or any other optimization algorithm) is guaranteed to find the global minimum
Partial derivatives of the cost function with regards to the $j$-th model parameter $\theta_j$:
$$ \frac{\partial}{\partial \theta_{j}} J(\boldsymbol{\theta})=\frac{1}{m} \displaystyle \sum_{i=1}^{m}\left(\sigma\left(\boldsymbol{\theta}^{T} \mathbf{x}^{(l)}\right)-y^{(i)}\right) x_{j}^{(i)} $$

Logistic Regression: Probabilistic view

Mon, 13 Jul 2020 00:00:00 +0000

Class label:

$$ y_i \in \\{0, 1\\} $$

Conditional probability distribution of the class label is

$$ \begin{aligned} p(y=1|\boldsymbol{x}) &= \sigma(\boldsymbol{w}^T\boldsymbol{x}+b) \\\\ p(y=0|\boldsymbol{x}) &= 1 - \sigma(\boldsymbol{w}^T\boldsymbol{x}+b) \end{aligned} $$

with

$$ \sigma(x) = \frac{1}{1+\operatorname{exp}(-x)} $$

This is a conditional Bernoulli distribution. Therefore, the probability can be represented as

$$ \begin{array}{ll} p(y|\boldsymbol{x}) &= p(y=1|\boldsymbol{x})^y p(y=0|\boldsymbol{x})^{1-y} \\\\ & = \sigma(\boldsymbol{w}^T\boldsymbol{x}+b)^y (1 - \sigma(\boldsymbol{w}^T\boldsymbol{x}+b))^{1-y} \end{array} $$

The conditional Bernoulli log-likelihood is (assuming training data is i.i.d)

$$ \begin{aligned} \operatorname{loglik}(\boldsymbol{w}, \mathcal{D}) &= \log(\operatorname{lik}(\boldsymbol{w}, \mathcal{D})) \\\\ &= \log(\displaystyle\prod_i p(y_i|\boldsymbol{x}_i)) \\\\ &= \log\left(\displaystyle\prod_i \sigma(\boldsymbol{w}^T\boldsymbol{x}_i+b)^y \left(1 - \sigma(\boldsymbol{w}^T\boldsymbol{x}_i+b)\right)^{1-y}\right) \\\\ &= \displaystyle\sum_i y\log\left(\sigma(\boldsymbol{w}^T\boldsymbol{x}_i+b)\right)+ (1-y)\log\left(1 - \sigma(\boldsymbol{w}^T\boldsymbol{x}_i+b)\right) \end{aligned} $$

Let

$$ \tilde{\boldsymbol{w}}=\left(\begin{array}{c}1 \\\\ \boldsymbol{w} \end{array}\right), \quad \tilde{\boldsymbol{x}_i}=\left(\begin{array}{c}b \\\\ \boldsymbol{x}_i \end{array}\right) $$

Then:

$$ \operatorname{loglik}(\boldsymbol{w}, \mathcal{D}) = \operatorname{loglik}(\tilde{\boldsymbol{w}}, \mathcal{D}) = \displaystyle\sum_i y\log\left(\sigma(\tilde{\boldsymbol{w}}^T\tilde{\boldsymbol{x}_i})\right)+ (1-y)\log\left(1 - \sigma(\tilde{\boldsymbol{w}}^T\tilde{\boldsymbol{x}_i}))\right) $$

Our objective is to find the $\tilde{\boldsymbol{w}}^*$ that maximize the log-likelihood, i.e.

$$ \begin{array}{cl} \tilde{\boldsymbol{w}}^* &= \underset{\tilde{\boldsymbol{w}}}{\arg \max} \quad \operatorname{loglik}(\tilde{\boldsymbol{w}}, \mathcal{D}) \\\\ &= \underset{\tilde{\boldsymbol{w}}}{\arg \min} \quad -\operatorname{loglik}(\tilde{\boldsymbol{w}}, \mathcal{D})\\\\ &= \underset{\tilde{\boldsymbol{w}}}{\arg \min} \quad \underbrace{-\left(\displaystyle\sum_i y\log\left(\sigma(\tilde{\boldsymbol{w}}^T\tilde{\boldsymbol{x}_i})\right) + (1-y)\log\left(1 - \sigma(\tilde{\boldsymbol{w}}^T\tilde{\boldsymbol{x}_i}))\right)\right)}_{\text{cross-entropy loss}} \end{array} $$

In other words, maximizing the (log-)likelihood is the same as minimizing the cross entropy.

SVM: Basics

Mon, 13 Jul 2020 00:00:00 +0000

🎯 Goal of SVM

To find the optimal separating hyperplane which maximizes the margin of the training data

it correctly classifies the training data
it is the one which will generalize better with unseen data (as far as possible from data points from each category)

SVM math formulation

Assuming data is linear separable

Decision boundary: Hyperplane $\mathbf{w}^{T} \mathbf{x}+b=0$
Support Vectors: Data points closes to the decision boundary (Other examples can be ignored)
- Positive support vectors: $\mathbf{w}^{T} \mathbf{x}_{+}+b=+1$
- negative support vectors: $\mathbf{w}^{T} \mathbf{x}_{-}+b=-1$
Why do we use 1 and -1 as class labels?
- This makes the math manageable, because -1 and 1 are only different by the sign. We can write a single equation to describe the margin or how close a data point is to our separating hyperplane and not have to worry if the data is in the -1 or +1 class.
- If a point is far away from the separating plane on the positive side, then $w^Tx+b$ will be a large positive number, and $label*(w^Tx+b)$ will give us a large number. If it’s far from the negative side and has a negative label, $label*(w^Tx+b)$ will also give us a large positive number.
Margin $\rho$ : distance between the support vectors and the decision boundary and should be maximized
$$ \rho = \frac{\mathbf{w}^{T} \mathbf{x}\_{+}+b}{\|\mathbf{w}\|}-\frac{\mathbf{w}^{T} \mathbf{x}\_{-}+b}{\|\mathbf{w}\|}=\frac{2}{\|\mathbf{w}\|} $$

SVM optimization problem

Requirement:

Maximal margin
Correct classification

Based on these requirements, we have:

Reformulation:

$$ \begin{aligned} \underset{\mathbf{w}}{\operatorname{argmin}} \quad &\\|\mathbf{w}\\|^{2} \\\\ \text {s.t.} \quad & y_{i}\left(\mathbf{w}^{T} \mathbf{x}\_{i}+b\right) \geq 1 \end{aligned} $$

This is the hard margin SVM.

Soft margin SVM

💡 Idea

“Allow the classifier to make some mistakes” (Soft margin)

➡️ Trade-off between margin and classification accuracy

Slack-variables: ${\color {blue}{\xi_{i}}} \geq 0$
💡Allows violating the margin conditions
$$ y_{i}\left(\mathbf{w}^{T} \mathbf{x}_{i}+b\right) \geq 1- \color{blue}{\xi_{i}} $$
- $0 \leq \xi\_{i} \leq 1$ : sample is between margin and decision boundary (margin violation)
- $\xi\_{i} \geq 1$ : sample is on the wrong side of the decision boundary (misclassified)

Soft Max-Margin

Optimization problem

$$ \begin{array}{lll} \underset{\mathbf{w}}{\operatorname{argmin}} \quad &\|\mathbf{w}\|^{2} + \color{blue}{C \sum_i^N \xi_i} \qquad \qquad & \text{(Punish large slack variables)}\\\\ \text { s.t. } \quad & y_{i}\left(\mathbf{w}^{T} \mathbf{x}_{i}+b\right) \geq 1 -\color{blue}{\xi_i}, \quad \xi_i \geq 0 \qquad \qquad & \text{(Condition for soft-margin)}\end{array} $$

$C$ : regularization parameter, determines how important $\xi$ should be
- Small $C$: Constraints have little influence ➡️ large margin
- Large $C$: Constraints have large influence ➡️ small margin
- $C$ infinite: Constraints are enforced ➡️ hard margin

Soft SVM Optimization

Reformulate into an unconstrained optimization problem

Rewrite constraints: $\xi_{i} \geq 1-y_{i}\left(\mathbf{w}^{T} \mathbf{x}_{i}+b\right)=1-y_{i} f\left(\boldsymbol{x}_{i}\right)$
Together with $\xi_{i} \geq 0 \Rightarrow \xi_{i}=\max \left(0,1-y_{i} f\left(\boldsymbol{x}_{i}\right)\right)$

Unconstrained optimization (over $\mathbf{w}$):

$$ \underset{{\mathbf{w}}}{\operatorname{argmin}} \underbrace{\|\mathbf{w}\|^{2}}\_{\text {regularization }}+C \underbrace{\sum_{i=1}^{N} \max \left(0,1-y\_{i} f\left(\boldsymbol{x}\_{i}\right)\right)}_{\text {loss function }} $$

Points are in 3 categories:

$y\_{i} f\left(\boldsymbol{x}\_{i}\right) > 1$ : Point outside margin, no contribution to loss
$y\_{i} f\left(\boldsymbol{x}\_{i}\right) = 1$: Point is on the margin, no contribution to loss as in hard margin
$y\_{i} f\left(\boldsymbol{x}\_{i}\right) < 1$: Point violates the margin, contributes to loss

Loss function

SVMs uses “hinge” loss (approximation of 0-1 loss)

Hinge loss

For an intended output $t=\pm 1$ and a classifier score $y$, the hinge loss of the prediction $y$ is defined as
$$ > \ell(y)=\max (0,1-t \cdot y) > $$
Note that $y$ should be the “raw” output of the classifier’s decision function, not the predicted class label. For instance, in linear SVMs, $y = \mathbf{w}\cdot \mathbf{x}+ b$, where $(\mathbf{w},b)$ are the parameters of the hyperplane and $mathbf{x}$ is the input variable(s).

The loss function of SVM is convex:

I.e.,

There is only one minimum
We can find it with gradient descent
However: Hinge loss is not differentiable! 🤪

Sub-gradients

For convex function $f: \mathbb{R}^d \to \mathbb{R}$ :

$$ f(\boldsymbol{z}) \geq f(\boldsymbol{x})+\nabla f(\boldsymbol{x})^{T}(\boldsymbol{z}-\boldsymbol{x}) $$

(Linear approximation underestimates function)

A subgradient of a convex function $f$ at point $\boldsymbol{x}$ is any $\boldsymbol{g}$ such that

$$ f(\boldsymbol{z}) \geq f(\boldsymbol{x})+\nabla \mathbf{g}^{T}(\boldsymbol{z}-\boldsymbol{x}) $$

Always exists (even $f$ is not differentiable)
If $f$ is differentiable at $\boldsymbol{x}$, then: $\boldsymbol{g}=\nabla f(\boldsymbol{x})$

Example

$f(x)=|x|$

$x \neq 0$ : unique sub-gradient is $g= \operatorname{sign}(x)$
$x =0$ : $g \in [-1, 1]$

Sub-gradient Method

Sub-gradient Descent

Given convex $f$, not necessarily differentiable
Initialize $\boldsymbol{x}_0$
Repeat: $\boldsymbol{x}\_{t+1}=\boldsymbol{x}\_{t}+\eta \boldsymbol{g}$, where $\boldsymbol{g}$ is any sub-gradient of $f$ at point $\boldsymbol{x}_{t}$

‼️ Notes:

Sub-gradients do not necessarily decrease $f$ at every step (no real descent method)
Need to keep track of the best iterate $\boldsymbol{x}^*$

Sub-gradients for hinge loss

$$ \mathcal{L}\left(\mathbf{x}\_{i}, y\_{i} ; \mathbf{w}\right)=\max \left(0,1-y\_{i} f\left(\mathbf{x}\_{i}\right)\right) \quad f\left(\mathbf{x}\_{i}\right)=\mathbf{w}^{\top} \mathbf{x}\_{i}+b $$

Sub-gradient descent for SVMs

Recall the Unconstrained optimization for SVMs:

$$ \underset{{\mathbf{w}}}{\operatorname{argmin}} \quad C \underbrace{\sum\_{i=1}^{N} \max \left(0,1-y_{i} f\left(\boldsymbol{x}\_{i}\right)\right)}\_{\text {loss function }} + \underbrace{\|\mathbf{w}\|^{2}}\_{\text {regularization }} $$

At each iteration, pick random training sample $(\boldsymbol{x}_i, y_i)$

If $y_{i} f\left(\boldsymbol{x}_{i}\right)<1$:
$$ \boldsymbol{w}{t+1}=\boldsymbol{w}{t}-\eta\left(2 \boldsymbol{w}{t}-C y{i} \boldsymbol{x}_{i}\right) $$
Otherwise:
$$ \quad \boldsymbol{w}\_{t+1}=\boldsymbol{w}\_{t}-\eta 2 \boldsymbol{w}\_{t} $$

Application of SVMs

Pedestrian Tracking
text (and hypertext) categorization
image classification
bioinformatics (Protein classification, cancer classification)
hand-written character recognition

Yet, in the last 5-8 years, neural networks have outperformed SVMs on most applications.🤪☹️😭

SVM: Kernel Methods

Mon, 13 Jul 2020 00:00:00 +0000

Kernel function

Given a mapping function $\phi: \mathcal{X} \rightarrow \mathcal{V}$, the function

$$ \mathcal{K}: x \rightarrow v, \quad \mathcal{K}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\left\langle\phi(\mathbf{x}), \phi\left(\mathbf{x}^{\prime}\right)\right\rangle_{\mathcal{V}} $$

is called a kernel function.

“A kernel is a function that returns the result of a dot product performed in another space.”

Kernel trick

Applying the kernel trick simply means replacing the dot product of two examples by a kernel function.

Typical kernels

Kernel Type	Definition
Linear kernel	$k\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)=\left\langle\boldsymbol{x}, \boldsymbol{x}^{\prime}\right\rangle$
Polynomial kernel	$k\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)=\left\langle\boldsymbol{x}, \boldsymbol{x}^{\prime}\right\rangle^{d}$
Gaussian / Radial Basis Function (RBF) kernel	$k \left(\boldsymbol{x}, \boldsymbol{y}\right)=\exp \left(-\frac{\\|\boldsymbol{x}-\boldsymbol{y}\\|^{2}}{2 \sigma^{2}}\right)$

Why do we need kernel trick?

Kernels can be used for all feature based algorithms that can be rewritten such that they contain inner products of feature vectors
- This is true for almost all feature based algorithms (Linear regression, SVMs, …)
Kernels can be used to map the data $\mathbf{x}$ in an infinite dimensional feature space (i.e., a function space)
- The feature vector never has to be represented explicitly
- As long as we can evaluate the inner product of two feature vectors

➡️ We can obtain a more powerful representation than standard linear feature models.

SVM: Kernelized SVM

Mon, 13 Jul 2020 00:00:00 +0000

SVM (with features)

Maximum margin principle
Slack variables allow for margin violation
$$ \begin{array}{ll} \underset{\mathbf{w}}{\operatorname{argmin}} \quad &\|\mathbf{w}\|^{2} + C \sum_i^N \xi_i \\\\ \text { s.t. } \quad & y_{i}\left(\mathbf{w}^{T} \color{red}{\phi(\mathbf{x}_{i})} + b\right) \geq 1 -\xi_i, \quad \xi_i \geq 0\end{array} $$

Math basics

Solve the constrained optimization problem: Method of Lagrangian Multipliers

Primal optimization problem:

$$ \begin{array}{ll} \underset{\boldsymbol{x}}{\min} \quad & f(\boldsymbol{x}) \\\\ \text { s.t. } \quad & h_{i}(\boldsymbol{x}) \geq b_{i}, \text { for } i=1 \ldots K \end{array} $$

Lagrangian optimization:

$$ \begin{array}{ll} \underset{\boldsymbol{x}}{\min} \underset{\boldsymbol{\lambda}}{\max} \quad & L(\boldsymbol{x}, \boldsymbol{\lambda}) = f(\boldsymbol{x}) - \sum_{i=1}^K \lambda_i(h_i(\boldsymbol{x}) - b_i) \\\\ \text{ s.t. } &\lambda_i\geq 0, \quad i = 1\dots K \end{array} $$

Dual optimization problem
$$ \begin{aligned} \boldsymbol{\lambda}^{\*}=\underset{\boldsymbol{\lambda}}{\arg \max } g(\boldsymbol{\lambda}), \quad & g(\boldsymbol{\lambda})=\min \_{\boldsymbol{x}} L(\boldsymbol{x}, \boldsymbol{\lambda}) \\\\ \text { s.t. } \quad \lambda_{i} \geq 0, & \text { for } i=1 \ldots K \end{aligned} $$
- $g$ : dual function of the optimization problem
- Essentially swapped min and max in the definition of $L$
Slaters condition: For a convex objective and convex constraints, solving the dual is equivalent to solving the primal
- I.e., optimal primal parameters can be obtained from optimal dual parameters $$ \boldsymbol{x}^* = \underset{\boldsymbol{x}}{\operatorname{argmin}}L(\boldsymbol{x}, \boldsymbol{\lambda}^*) $$

Dual derivation of the SVM

SVM optimization:
$$ \begin{array}{ll} &\underset{\boldsymbol{w}}{\operatorname{argmin}} \quad &\|\boldsymbol{w}\|^2 \\\\ &\text{ s.t. } \quad &y_i(\boldsymbol{w}^T\phi(\mathbf{x}_i) + b) \geq 1 \end{array} $$
Lagrangian function:
$$ L(\boldsymbol{w}, \boldsymbol{\lambda})=\frac{1}{2} \boldsymbol{w}^{T} \boldsymbol{w}-\sum_{i} \alpha_{i}\left(y_{i}\left(\boldsymbol{w}^{T} \phi\left(\boldsymbol{x}_{i}\right)+b\right)-1\right) $$
Compute optimal $\boldsymbol{w}$
$$ \begin{align} &\frac{\partial L}{\partial \boldsymbol{w}} = \boldsymbol{w} - \sum_i \alpha_i y_i \phi(\boldsymbol{x}_i) \overset{!}{=} 0 \\\\ \Leftrightarrow \quad & \color{CornflowerBlue}{\boldsymbol{w}^* = \sum_i \alpha_i y_i \phi(\boldsymbol{x}_i)} \end{align} $$
- Many of the $\alpha_i$ will be zero (constraint satisfied)
- If $\alpha_i \neq 0 \overset{\text{complementary slackness}}{\Rightarrow} y_{i}\left(\boldsymbol{w}^{T} \phi\left(\boldsymbol{x}_{i}\right)+b\right)-1 =0$
  
  $\Rightarrow \phi(\boldsymbol{x}_i)$ is a support vector
- The optimal weight vector $\boldsymbol{w}$ is a linear combination of the support vectors! 👏
Optimality condition for $b$:
$$ \frac{\partial L}{\partial b} = - \sum_i \alpha_i y_i \overset{!}{=} 0 \quad \Rightarrow \sum_i \alpha_i y_i = 0 $$
- We do not obtain a solution for $b$
- But an additional condition for $\alpha$
$b$ can be computed from $w$:

If $\alpha\_i > 0$, then $\boldsymbol{x}\_i$ is on the margin due to complementary slackness condition. I.e.:
$$ \begin{align}y_{i}\left(\boldsymbol{w}^{T} \phi\left(\boldsymbol{x}_{i}\right)+b\right)-1 &= 0 \\\\y_{i}\left(\boldsymbol{w}^{T} \phi\left(\boldsymbol{x}_{i}\right)+b\right) &= 1 \\\\ \underbrace{y_{i} y_{i}}_{=1}\left(\boldsymbol{w}^{T} \phi\left(\boldsymbol{x}_{i}\right)+b\right) &= y_{i} \\\\ \Rightarrow b = y_{i} - \boldsymbol{w}^{T} \phi\left(\boldsymbol{x}_{i}\right)\end{align} $$

Apply kernel tricks for SVM

Lagrangian:

$$ L(\boldsymbol{w}, \boldsymbol{\lambda}) = {\color{red}{\frac{1}{2} \boldsymbol{w}^{T} \boldsymbol{w}}} - \sum_{i} \alpha\_{i}\left({\color{green}{y\_{i} (w^{T} \phi\left(x_{i}\right)}}+ b)-\color{CornflowerBlue}{1}\right), \quad \boldsymbol{w}^{\*}=\sum\_{i} \alpha_{i} y\_{i} \phi\left(\boldsymbol{x}\_{i}\right) $$

Dual function (Wolfe Dual Lagrangian function):

$$ \begin{aligned} g(\boldsymbol{\alpha}) &=L\left(\boldsymbol{w}^{*}, \boldsymbol{\alpha}\right) \\\\ &=\color{red}{\frac{1}{2} \underbrace{\sum_{i} \sum_{j} \alpha_{i} \alpha_{j} y_{i} y_{j} \phi\left(\boldsymbol{x}_{i}\right)^{T} \phi\left(\boldsymbol{x}_{j}\right)}_{{\boldsymbol{w}^*}^T \boldsymbol{w}^*}} - \color{green}{\sum_{i} \alpha_{i} y_{i}(\underbrace{\sum_{j} \alpha_{j} y_{j} \phi\left(x_{j}\right)}_{\boldsymbol{w}^*})^{T} \phi\left(x_{i}\right)} + \color{CornflowerBlue}{\sum_{i} \alpha_{i}} \\\\ &=\sum_{i} \alpha_{i}-\frac{1}{2} \sum_{i} \sum_{j} \alpha_{i} \alpha_{j} y_{i} y_{j} \underbrace{\phi\left(\boldsymbol{x}_{i}\right)^{T} \phi\left(\boldsymbol{x}_{j}\right)}_{\overset{}{=} \boldsymbol{k}(\boldsymbol{x}_i, \boldsymbol{x}_j)} \\\\ &= \sum_{i} \alpha_{i}-\frac{1}{2} \sum_{i} \sum_{j} \alpha_{i} \alpha_{j} y_{i} y_{j} \boldsymbol{k}(\boldsymbol{x}_i, \boldsymbol{x}_j ) \end{aligned} $$

Wolfe dual optimization problem:

$$ \begin{array}{ll} \underset{\boldsymbol{\alpha}}{\min} \quad & \sum_{i} \alpha_{i}-\frac{1}{2} \sum_{i} \sum_{j} \alpha_{i} \alpha_{j} y_{i} y_{j} \boldsymbol{k}(\boldsymbol{x}_i, \boldsymbol{x}_j ) \\\\ \text{ s.t } \quad & \alpha_i \geq 0 \quad \forall i = 1, \dots, N \\\\ & \sum_i \alpha_i y_i = 0 \end{array} $$

Compute primal from dual parameters:
- Weight vector
  $$ \boldsymbol{w}^{*}=\sum_{i} \alpha_{i} y_{i} \phi\left(\boldsymbol{x}_{i}\right) \label{eq:weight vector} $$
  - Can not be represented (as it is potentially infinite dimensional). But don’t worry, we don’t need the explicit representation
- Bias: For any $i$ with $\alpha_i > 0$ :
$$ \begin{array}{ll} b &=y_{k}-\mathbf{w}^{T} \phi\left(\boldsymbol{x}_{k}\right) \\\\ &=y_{k}-\sum_{i} y_{i} \alpha_{i} k\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{k}\right) \end{array} $$
- Decision function (Again, we use the kernel trick and therefore we don’t need the explicit representation of the weight vector $\boldsymbol{w}^*$)
$$ \begin{aligned}f(\boldsymbol{x}) &= (\boldsymbol{w}^{*})^{T} \boldsymbol{\phi}(\boldsymbol{x}) + b \\\\ &\overset{}{=} \left(\sum_{i} \alpha_{i} y_{i} \phi\left(\boldsymbol{x}_{i}\right)\right)^{T} \boldsymbol{\phi}(\boldsymbol{x}) + b \\\\ &= \sum_{i} \alpha_{i} y_{i} \boldsymbol{\phi}(\boldsymbol{x}_i)^{T} \boldsymbol{\phi}(\boldsymbol{x}) + b \\\\ & \overset{}{=}\sum_i y_{i} \alpha_{i} k\left(\boldsymbol{x}_{i}, \boldsymbol{x}\right)+b\end{aligned} $$

Relaxed constraints with slack variable

Primal optimization problem
$$ \begin{array}{ll} \underset{\mathbf{w}}{\operatorname{argmin}} \quad &\|\mathbf{w}\|^{2} + \color{CornflowerBlue}{C \sum_i^N \xi_i} \\\\ \text { s.t. } \quad & y_{i}\left(\mathbf{w}^{T} \mathbf{x}_{i} + b\right) \geq 1 - \color{CornflowerBlue}{\xi_i}, \quad \color{CornflowerBlue}{\xi_i} \geq 0\end{array} $$
Dual optimization problem
$$ \begin{array}{ll}\underset{\boldsymbol{\alpha}}{\min} \quad & \sum_{i} \alpha_{i}-\frac{1}{2} \sum_{i} \sum_{j} \alpha_{i} \alpha_{j} y_{i} y_{j} \boldsymbol{k}(\boldsymbol{x}_i, \boldsymbol{x}_j ) \\\\ \text{ s.t } \quad & \color{CornflowerBlue}{C \geq} \alpha_i \geq 0 \quad \forall i = 1, \dots, N \\\\ & \sum_i \alpha_i y_i = 0\end{array} $$
Add upper bound of $\color{CornflowerBlue}{C}$ on $\color{CornflowerBlue}{\alpha_i}$
- Without slack, $\alpha_i \to \infty$ when constraints are violated (points misclassified)
- Upper bound of $C$ limits the $\alpha_i$, so misclassifications are allowed

Decision Trees

Mon, 07 Sep 2020 00:00:00 +0000

Ensemble Learning

Mon, 07 Sep 2020 00:00:00 +0000

Why ensemble learning?

Sat, 07 Nov 2020 00:00:00 +0000

wisdom of the crowd : In many cases you will find that this aggregated answer is better than an expert’s answer.

Similarly, if you aggregate the predictions of a group of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor.

A group of predictors is called an ensemble;

thus, this technique is called Ensemble Learning,

and an Ensemble Learning algorithm is called an Ensemble method.

Popular Emsemble methods:

Voting Classifier

Sat, 07 Nov 2020 00:00:00 +0000

Suppose we have trained a few classifiers, each one achieving about 80% accuracy.

A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes.

This majority-vote classifier is called a hard voting classifier

Surprisingly, this voting classifier often achieves a higher accuracy than the best classifier in the ensemble. In fact, even if each classifier is a weak learner (meaning it does only slightly better than random guessing), the ensemble can still be a strong learner (achieving high accuracy), provided there are a sufficient number of weak learners and they are sufficiently diverse. (Reason behind: the law of large numbers)

Ensemble methods work best when the predictors are as independent from one another as possible.

One way to get diverse classifiers is to train them using very different algorithms. This increases the chance that they will make very different types of errors, improving the ensemble’s accuracy.
Another approach is to use the same training algorithm for every predictor, but to train them on different random subsets of the training set. (See Bagging and Pasting)

Random Forest

Sat, 07 Nov 2020 00:00:00 +0000

Train a group of Decision Tree classifiers (generally via the bagging method (or sometimes pasting)), each on a different random subset of the training set

To make predictions, just obtain the preditions of all individual trees, then predict the class that gets the most votes.

Why is Random Forest good?

The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. This results in a greater tree diversity, which (once again) trades a higher bias for a lower variance, generally yielding an overall better model. 👏

Ensemble Learners

Sat, 07 Nov 2020 00:00:00 +0000

Why emsemble learners?

Lower error

Each learner (model) has its own bias. It we put them together, the bias tend to be reduced (they fight against each other in some sort of way)
Less overfitting
Tastes great

Boosting

Sat, 07 Nov 2020 00:00:00 +0000

Boosting

Refers to any Ensemble method that can combine serval weak learners into a strong learner

💡 General idea: train predictors sequentially, each trying to correct its predecessor.

Popular boosting methods:

AdaBoost
Gradient Boost

Bagging and Pasting

Sat, 07 Nov 2020 00:00:00 +0000

TL;DR

Bootstrap Aggregating (Boosting): Sampling with replacement
Pasting: Sampling without replacement

Explaination

Ensemble methods work best when the predictors are as independent from one another as possible.

One way to get a diverse set of classifiers: use the same training algorithm for every predictor, but to train them on different random subsets of the training set

Sampling with replacement: boostrap aggregating (Bagging)
Sampling without replacement: pasting

Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors. The aggregation function is typically the statistical mode

classification: the most frequent prediction (just like a hard voting classifier)
regression: average

Each individual predictor has a higher bias than if it were trained on the original training set, but aggregation reduces both bias and variance. 👏

Generally, the net result is that the ensemble has a similar bias but a lower variance than a single predictor trained on the original training set.

##Advantages of Bagging and Pasting

Predictors can all be trained in parallel, via different CPU cores or even different servers.
Predictions can be made in parallel.

-> They scale very well 👍

Bagging vs. Pasting

Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting, but this also means that predictors end up being less correlated so the ensemble’s variance is reduced.
Overall, bagging often results in better models
However, if you have spare time and CPU power you can use cross- validation to evaluate both bagging and pasting and select the one that works best.

Out-of-Bag Evaluation

With bagging, some instances may be sampled several times for any given predictor, while others may not be sampled at all. This means that only about 63% of the training instances are sampled on average for each predictor.

The remaining 37% of the training instances that are not sampled are called out-of-bag (oob) instances. Note that they are not the same 37% for all predictors.

Since a predictor never sees the oob instances during training, it can be evaluated on these instances, without the need for a separate validation set. You can evaluate the ensemble itself by averaging out the oob evaluations of each predictor.

AdaBoost

Sat, 07 Nov 2020 00:00:00 +0000

Adaptive Boosting:

Correct its predecessor by paying a bit more attention to the training instance that the predecessor underfitted. This results in new predictors focusing more and more on the hard cases.

Pseudocode

Assign observation $i$ the weight for $d\_{1,i}=\frac{1}{n}$ (equal weights)
For $t=1:T$
1. Train weak learning alg orithm using data weighted by $d\_{ti}$. This produces weak classifier $h\_t$
2. Choose coefficient $\alpha\_t$ (tells us how good is the classifier is at that round)
$$ \begin{aligned} \operatorname{Error}\_{t} &= \displaystyle\sum\_{i; h\_{t}\left(x\_{i}\right) \neq y\_{i}} d\_{t} \quad \text{(sum of weights of misclassified points)} \\\\ \alpha\_t &= \frac{1}{2} (\frac{1 - \operatorname{Error}\_{t}}{\operatorname{Error}\_{t}}) \end{aligned} $$
1. Update weights
  $$ d\_{t+1, i}=\frac{d\_{t, i} \cdot \exp (-\alpha\_{t} y\_{i} h\_{t}\left(x\_{i}\right))}{Z\_{t}} $$
  - $Z\_t = \displaystyle \sum\_{i=1}^{n} d\_{t,i} $: normalization factor
    - If prediction $i$ is correct $\rightarrow y\_i h\_t(x\_i) = 1 \rightarrow $ Weight of observation $i$ will be decreased by $\exp(-\alpha\_t)$
    - If prediction $i$ is incorrect $ \rightarrow y\_i h\_t(x\_i) = -1 \rightarrow $ Weight of observation $i$ will be increased by $\exp(\alpha\_t)$
Output the final classifier

$ H(x)=\operatorname{sign}\left(\sum\_{t=1}^{T} \alpha\_{t} h\_{t}\left(x\_{i}\right)\right) $

Example

Tutorial

Non-parametric Machine Learning Alogrithms

Mon, 07 Sep 2020 00:00:00 +0000

Linear Discriminant Functions

Sat, 07 Nov 2020 00:00:00 +0000

No assumption about distributions -> non-parametric
Linear decision surfaces
Begin by supervised training (given class of training data)

Linear Discriminant Functions and Decision Surfaces

A discriminant function that is a linear combination of the components of $x$ can be written as

$$ g(\mathbf{x})=\mathbf{w}^{T} \mathbf{x}+w\_{0} $$

$\mathbf{x}$: feature vector
$\mathbf{w}$: weight vector
$w\_0$: bias or threshold weight

The two category case

Decision rule:

Decide $w\_1$ if $g(\mathbf{x}) > 0 \Leftrightarrow \mathbf{w}^{T} \mathbf{x}+w\_{0} > 0 \Leftrightarrow \mathbf{w}^{T} \mathbf{x}> -w\_{0}$
Decide $w\_{2}$ if $g(\mathbf{x}) < 0 \Leftrightarrow \mathbf{w}^{T} \mathbf{x}+w\_{0} < 0 \Leftrightarrow \mathbf{w}^{T} \mathbf{x}<-w\_{0}$
$g(\mathbf{x}) = 0$: assign to either class or can be left undefined

The equation $g(\mathbf{x}) = 0$ defines the decision surface that separates points assigned to $w\_{1}$ from points assigned to $w\_{2}$. When $g(\mathbf{x})$ is linear, this decision surface is a hyperplane.

For arbitrary $\mathbf{x}\_1$ and $\mathbf{x}\_2$ on the decision surface, we have:

$$ \mathbf{w}^{\mathrm{T}} \mathbf{x}\_{1}+w\_{0}=\mathbf{w}^{\mathrm{T}} \mathbf{x}\_{2}+w\_{0} $$ $$ \mathbf{w}^{\mathrm{T}}\left(\mathbf{x}\_{1}-\mathbf{x}\_{2}\right)=0 $$

$\Rightarrow \mathbf{w}$ is normal to any vector lying in the hyperplane.

In general, the hyperplane $H$ divides the feature space into two half-spaces:

decision region $R\_1$ for $w\_1$
decision region $R\_2$ for $w\_2$

Because $g(\mathbf{x}) > 0$ if $\mathbf{x}$ in $R\_1$, it follows that the normal vector $\mathbf{w}$ points into $R\_1$. Therefore, It is sometimes said that any $\mathbf{x}$ in $R\_1$ is on the positive side of $H$, and any $\mathbf{x}$ in $R\_2$ is on the negative side of $H$

The discriminant function $g(\mathbf{x})$ gives an algebraic measure of the distance from $\mathbf{x}$ to the hyperplane. We can write $\mathbf{x}$ as

$$ \mathbf{x}=\mathbf{x}\_{p}+r \frac{\mathbf{w}}{\|\mathbf{w}\|} $$

$\mathbf{x}\_{p}$: normal projection of $\mathbf{x}$ onto $H$
$r$: desired algebraic distance which is positive if $\mathbf{x}$ is on the positive side, else negative

As $\mathbf{x}\_p$ is on the hyperplane

$$ \begin{array}{ll} g\left(\mathbf{x}\_{p}\right)=0 \\\\ \mathbf{w}^{\mathrm{T}} \mathbf{x}\_{p}+w\_{0}=0 \\\\ \mathbf{w}^{\mathrm{T}}\left(\mathbf{x}-r \frac{\mathbf{w}}{\|\mathbf{w}\|}\right)+w\_{0}=0 \\\\ \mathbf{w}^{\mathrm{T}} \mathbf{x}-r \frac{\mathbf{w}^{\mathrm{T}} \mathbf{w}}{\|\mathbf{w}\|}+w\_{0}=0 \\\\ \mathbf{w}^{\mathrm{T}} \mathbf{x}-r\|\mathbf{w}\| + w\_0 = 0 \\\\ \underbrace{\mathbf{w}^{\mathrm{T}} \mathbf{x} + w\_0}\_{=g(\mathbf{x})} = r\|\mathbf{w}\| \\\\ \Rightarrow g(\mathbf{x}) = r\|\mathbf{w}\| \\\\ \Rightarrow r = \frac{g(\mathbf{x})}{\|\mathbf{w}\|} \end{array} $$

In particular, the distance from the origin to hyperplane $H$ is given by $\frac{w_0}{\|\mathbf{w}\|}$

$w\_0 > 0$: the origin is on the positive side of $H$
$w\_0 < 0$: the origin is on the negative side of $H$
$w\_0 = 0$: $g(\mathbf{x})$ has the homogeneous form $\mathbf{w}^{\mathrm{T}} \mathbf{x}$ and the hyperplane passes through the origin

A linear discriminant function divides the feature space by a hyperplane decision surface:

orientation: determined by the normal vector $\mathbf{w}$
location: determined by the bias $w\_0$

Reference

https://www.byclb.com/TR/Tutorials/neural_networks/ch9_1.htm

Linear Discriminant Analysis (LDA)

Sat, 07 Nov 2020 00:00:00 +0000

Linear Discriminant Analysis (LDA)

also called Fisher’s Linear Discriminant
reduces dimension (like PCA)
but focuses on maximizing seperability among known categories

💡 Idea

Create a new axis
Project the data onto this new axis in a way to maximize the separation of two categories

How it works?

Create a new axis

According to two criteria (considered simultaneously):

Maximize the distance between means
Minimize the variation $s^2$ (which LDA calls “scatter”) within each category

We have:

$$ \frac{(\overbrace{\mu_1 - \mu_2}^{=: d})^2}{s_1^2 + s_2^2} \qquad\left(\frac{\text{''ideally large''}}{\text{"ideally small"}}\right) $$

Why both distance and scatter are important?

More than 2 dimensions

The process is the same 👏:

Create an axis that maximizes the distance between the means for the two categories while minimizing the scatter

More than 2 categories (e.g. 3 categories)

Little difference:

Measure the distances among the means
- Find the point that is central to all of the data
- Then measure the distances between a point that is central in each category and the main central point
- Maximize the distance between each category and the central point while minimizing the scatter for each category
Create 2 axes to separate the data (because the 3 central points for each category define a plane)

LDA and PCA

Similarities

Both rank the new axes in order of importance
- PC1 (the first new axis that PCA creates) accounts for the most variation in the data
  - PC2 (the second new axis) does the second best job
- LD1 (the first new axis that LDA creates) accounts for the most variation between the categories
  - LD2 does the second best job
Both can let you dig in and see which features are driving the new axes
Both try to reduce dimensions
- PCA looks at the features with the most variation
- LDA tries to maximize the separation of known categories

Reference

https://www.youtube.com/watch?v=azXCzI57Yfc

Unsupervised Learning

Mon, 07 Sep 2020 00:00:00 +0000

Gaussian Mixture Model

Sat, 07 Nov 2020 00:00:00 +0000

Gaussian Distribution

Univariate: The Probability Density Function (PDF) is:

$$ P(x | \theta)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp \left(-\frac{(x-\mu)^{2}}{2 \sigma^{2}}\right) $$

$\mu$: mean
$\sigma$: standard deviation

Multivariate: The Probability Density Function (PDF) is:

$$ P(x | \theta)=\frac{1}{(2 \pi)^{\frac{D}{2}}|\Sigma|^{\frac{1}{2}}} \exp \left(-\frac{(x-\mu)^{T} \Sigma^{-1}(x-\mu)}{2}\right) $$

$\mu$: mean
$\Sigma$: covariance
$D$: dimension of data

Learning

For univariate Gaussian model, we can use Maximum Likelihood Estimation (MLE) to estimate parameter $\theta$ :

$$ \theta= \underset{\theta}{\operatorname{argmax}} L(\theta) $$

Assuming data are i.i.d, we have:

$$ L(\theta)=\prod\_{j=1}^{N} P\left(x\_{j} | \theta\right) $$

For numerical stability, we usually use Maximum Log-Likelihood:

$$ \begin{align} \theta &= \underset{\theta}{\operatorname{argmax}} L(\theta) \\\\ &= \underset{\theta}{\operatorname{argmax}} \log(L(\theta)) \\\\ &= \underset{\theta}{\operatorname{argmax}} \sum\_{j=1}^{N} \log P\left(x\_{j} | \theta\right)\end{align} $$

Gaussian Mixture Model

A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. One can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians.

Define:

$x\_j$: the $j$-th observed data, $j=1, 2,\dots, N$
$K$: number of Gaussian model components
$\alpha\_k$: probability that the observed data belongs to the $k$-th model component
- $\alpha\_k \geq 0$
- $\displaystyle \sum\_{k=1}^{K}\alpha\_k=1$
$\phi(x|\theta\_k)$: probability density function of the $k$-th model component
- $\theta\_k = (\mu\_k, \sigma\_k^2)$
$\gamma\_{jk}$: probability that the $j$-th obeserved data belongs to the $k$-th model component
Probability density function of Gaussian mixture model:
$$ P(x | \theta)=\sum\_{k=1}^{K} \alpha\_{k} \phi\left(x | \theta\_{k}\right) $$
For this model, parameter is $\theta=\left(\tilde{\mu}\_{k}, \tilde{\sigma}\_{k}, \tilde{\alpha}\_{k}\right)$.

Expectation-Maximum (EM)

Expectation-Maximization (EM) is a statistical algorithm for finding the right model parameters. We typically use EM when the data has missing values, or in other words, when the data is incomplete.

These missing variables are called latent variables.

NEVER observed
We do NOT know the correct values in advance

Since we do not have the values for the latent variables, Expectation-Maximization tries to use the existing data to determine the optimum values for these variables and then finds the model parameters. Based on these model parameters, we go back and update the values for the latent variable, and so on.

The Expectation-Maximization algorithm has two steps:

E-step: In this step, the available data is used to estimate (guess) the values of the missing variables
M-step: Based on the estimated values generated in the E-step, the complete data is used to update the parameters

EM in Gaussian Mixture Model

Initialize the parameters ($K$ Gaussian distributionw with the mean $\mu\_1, \mu\_2,\dots,\mu\_k$ and covariance $\Sigma\_1, \Sigma\_2, \dots, \Sigma\_k$)
Repeat
- E-step: For each point $x\_j$, calculate the probability that it belongs to cluster/distribution $k$
$$ \begin{align} \gamma\_{j k} &= \frac{\text{Probability } x\_j \text{ belongs to cluster } k}{\text{Sum of probability } x\_j \text{ belongs to cluster } 1, 2, \dots, k} \\\\ &= \frac{\alpha\_{k} \phi\left(x\_{j} | \theta\_{k}\right)}{\sum\_{k=1}^{K} \alpha\_{k} \phi\left(x\_{j} | \theta\_{k}\right)}\qquad j=1,2, \ldots, N ; k=1,2 \ldots, K \end{align} $$
The value will be high when the point is assigned to the right cluster and lower otherwise
- M-step: update parameters

$$ \alpha\_k = \frac{\text{Number of points assigned to cluster } k}{\text{Total number of points}} = \frac{\sum\_{j=1}^{N} \gamma\_{j k}}{N} \qquad k=1,2, \ldots, K $$ $$ \mu\_{k}=\frac{\sum\_{j}^{N}\left(\gamma\_{j k} x\_{j}\right)}{\sum\_{j}^{N} \gamma\_{j k}}\qquad k=1,2, \ldots, K $$ $$ \Sigma\_{k}=\frac{\sum\_{j}^{N} \gamma\_{j k}\left(x\_{j}-\mu\_{k}\right)\left(x\_{j}-\mu\_{k}\right)^{T}}{\sum\_{j}^{N} \gamma\_{j k}} \qquad k=1,2, \ldots, K $$

until convergence ($\left\|\theta\_{i+1}-\theta\_{i}\right\|<\varepsilon$)

Visualization:

Reference

Principle Components Analysis (PCA)

Sat, 07 Nov 2020 00:00:00 +0000

TL;DR

The usual procedure to calculate the $d$-dimensional principal component analysis consists of the following steps:

Calculate
- average
  $$ \bar{m}=\sum\_{i=1}^{N} m_{i} \in \mathbb{R} $$
- data matrix
  $$ \mathbf{M}=\left(m\_{1}-\bar{m}, \ldots, m\_{N}-\bar{m}\right) \in \mathbb{R}^{d \times \mathrm{N}} $$
- scatter matrix (covariance matrix)
  $$ \mathbf{S}=\mathbf{M M}^{\mathrm{T}} \in \mathbb{R}^{d \times d} $$
of all feature vectors $m\_{1}, \ldots, m\_{N}$
Calculate the normalized ($\\|\cdot\\|=1$) eigenvectors $\mathbf{e}\_1, \dots, \mathbf{e}\_d$ and sort them such that the corresponding eigenvalues $\lambda\_1, \dots, \lambda\_d$ are decreasing, i.e. $\lambda\_1 > \lambda\_2 > \dots > \lambda\_d$
Construct a matrix
$$ \mathbf{A}:=\left(e\_{1}, \ldots, e\_{d^{\prime}}\right) \in \mathbb{R}^{d \times d^{\prime}} $$
with the first $d^{\prime}$ eigenvectors as its columns
Transform each feature vector $m\_i$ into a new feature vector
$$ \mathrm{m}\_{\mathrm{i}}^{\prime}=\mathrm{A}^{\mathrm{T}}\left(\mathrm{m}\_{\mathrm{i}}-\overline{\mathrm{m}}\right) \quad \text { for } i=1, \ldots, N $$
of smaller dimension $d^{\prime}$

Dimensionality reduction

Goal: represent instances with fewer variables
- Try to preserve as much structure in the data as possible
- Discriminative: only structure that affects class separability
Feature selection
- Pick a subset of the original dimensions
- Discriminative: pick good class “predictors”
Feature extraction
- Construct a new set of dimensions
  $$ E\_{i} = f(X\_1 \dots X\_d) $$
  - $X\_1, \dots, X\_d$: features
- (Linear) combinations of original

Direction of greatest variance

Define a set of principal components
- 1st: direction of the greatest variability in the data (i.e. Data points are spread out as far as possible)
- 2nd: perpendicular to 1st, greatest variability of what’s left
- …and so on until $d$ (original dimensionality)
First $m \ll d$ components become $m$ dimensions
- Change coordinates of every data point to these dimensions

Q: Why greatest variablility?

A: If you pick the dimension with the highest variance, that will preserve the distances as much as possible

How to PCA?

“Center” the data at zero (subtract mean from each attribute)
$$ x\_{i, a} = x\_{i, a} - \mu $$
Compute covariance matrix $\Sigma$

The covariance between two attributes is an indication of whether they change together (positive correlation) or in opposite directions (negative correlation).

For example, $cov(x\_1, x\_2) = 0.8 > 0 \Rightarrow$ When $x\_1$ increases/decreases, $x\_2$ also increases/decreases.

$$ cov(b, a) = \frac{1}{n} \sum\_{i=1}^{n} x\_{ib} x\_{ia} $$
We want vectors $\mathbf{e}$ which aren’t turned by covariance matrix $\Sigma$:
$$ \Sigma \mathbf{e} = \lambda \mathbf{e} $$
$\Rightarrow$ $\mathbf{e}$ are eigenvectors of $\Sigma$, and $\lambda$ are corresponding eigenvalues

Principle components = eigenvectors with largest eigenvalues

Finding principle components

Find eigenvalues by solving Characteristic Polynomial
$$ \operatorname{det}(\Sigma - \lambda \mathbf{I}) = 0 $$
- $\mathbf{I}$: Identity matrix
Find $i$-th eigenvector by solving
$$ \Sigma \mathbf{e}\_i = \lambda\_i \mathbf{e}\_i $$
and we want $\mathbf{e}\_{i}$ to have unit length ($\\|\mathbf{e}\_{i}\\| = 1$)
Eigenvector with the largest eigenvalue will be the first principle component, eigenvector with the second largest eigenvalue will be the second priciple component, so on and so on.

Example

Projecting to new dimension

We pick $m
For instance $\mathbf{x} = \{x\_1, \dots, x\_d\}$ (original coordinates), we want new coordinates $\mathbf{x}^{\prime} = \{x^{\prime}\_1, \dots, x^{\prime}\_d\}$
- “Center” the instance (subtract the mean): $\mathbf{x} - \mathbf{\mu}$
- “Project” to each dimension: $(\mathbf{x} - \mathbf{\mu})^T \mathbf{e}\_j$ for $j=1, \dots, m$
Example

Go deeper in details

Why eigenvectors = greatest variance?

Why eigenvalue = variance along eigenvector?

How many dimensions should we reduce to?

Now we have eigenvectors $\mathbf{e}\_1, \dots, \mathbf{e}\_d$ and we want new dimension $m \ll d$
We pick $\mathbf{e}\_i$ that “explain” the most variance:
- Sort eigenvectors s.t. $\lambda\_1 \geq \dots \geq \lambda\_d$
- Pick first $m$ eigenvectors which explain 90% or the total variance (typical threshold values: 0.9 or 0.95)
Or we can use a scree plot

PCA in a nutshell

PCA example: Eigenfaces

Perform PCA on bitmap images of human faces:

Belows are the eigenvectors after we perform PCA on the dataset:

Then we can project new face to space of eigen-faces, and represent vector of new face as a linear combination of principle components.

As we use more and more eigenvectors in this decomposition, we can end up with a face that looks more and more like the original guy

Why is eigenface neat and interesting?

This is neat because by taking the first few eigenvectors you can get a pretty close representation of the face. Suppose that this corresponds to maybe 20 eigenvectors. This means you’re using only 20 numbers to represent a face bitmap which looks kind of like the original guy! Can you use only 20 pixels to represent him nearly? No, there’s no way!
You’re effectively picking 20 numbers/mixture coefficients/coordinates. One really nice way to use this is you can use this for massive compression of the data. If you communicate to others if they all have access to the same eigenvectors, all they need to send between each other are just the projection coordinates. Then they can transmit arbitrary faces between them. This is massive reduction in the size of data.
Your classifier or your regression system now operate in low dimensional space. So they have plenty of redundancy to grab on to and learn a better hyperplane. 👏

Application of eigenface

Face similarity
- in the reduced space
- insensitive to lighting expression, orientation
Projecting new “faces”

Pratical issues of PCA

PCA is based on covariance matrix and covariance is extremely sensitive to large values
- E.g. multiple some dimension by 1000. Then this dimension dominates covariance and become a principle component.
- Solution: normalize each dimension to zero mean and unit variacne
  $$ x^{\prime} = \frac{x - \text{mean}}{\text{standard deviation}} $$
PCA assumes underlying subspace is linear.
PCA can sometimes hurt the performace of classification
- Because PCA doesn’t see the labels
- Solution: Linear Discriminant Analysis (LDA)
  - Picks a new dimension that gives
    - maximum separation between means of prejected classes
    - minimum variance within each projected class
  - But this relies on some assumptions of the data and does not always work. 🤪

Reference

Principle Component Analysis: a great series of video tutorials explaining PCA clearly 👍