Classification | Haobin Tan

Classification

Mon, 07 Sep 2020 00:00:00 +0000

K Nearest Neighbors

Mon, 13 Jul 2020 00:00:00 +0000

Non-parametric Methods

Store all the training data
Use the training data for doing predictions
Do NOT adapt parameters
Often referred to as instance-based methods

👍 Advantages

Complexity adapts to training data
Very fast at training

👎 Disadvantages

Slow for prediction
Hard to use for high-dimensional input

$k$-Nearest Neighbour Classifiers

To classify a new input vector $x,$

Examine the $k$-closest training data points to $x$ (comman values for $k$: $k=3$, $k=5$)
Assign the object to the most frequently occurring class

🤔 When to consider?

Can measure distances between data-points
Less than 20 attributes per instance
Lots of training data

👍 Advantages

Training is very fast
Learn complex target functions
Similar algorithm can be used for regression
High accuracy
Insensitive to outliers
No assumptions about data

👎 Disadvantages

Computationally expensive
requires a lot of memory

Decision Boundaries

The nearest neighbour algorithm does NOT explicitly compute decision boundaries.
The decision boundaries form a subset of the Voronoi diagram for the training data.
The more data points we have, the more complex the decision boundary can become

Distance Metrics

Most common distance metric: Euclidean distance (ED)

$$ d(\boldsymbol{x}, \boldsymbol{y})=\|\boldsymbol{x}-\boldsymbol{y}\|=\sqrt{\left(\sum_{k=1}^{d}\left(\boldsymbol{x}_{k}-\boldsymbol{y}_{k}\right)^{2}\right)} $$

makes sense when different features are commensurate; each is variable measured in the same units.
If the units are different (e.g.. length and weight), data needs to be normalised (resulting input dimensions are zero mean, unit variance)
$$ \tilde{\boldsymbol{x}}=(\boldsymbol{x}-\boldsymbol{\mu}) \oslash \boldsymbol{\sigma} $$
- $\mu$: Mean
- $\sigma$: Standard deviation
- $\oslash$: element-wise division

Another distance metrics:

Cosine Distance: Good for documents, images $$ d(\boldsymbol{x}, \boldsymbol{y})=1-\frac{\boldsymbol{x}^{T} \boldsymbol{y}}{\|\boldsymbol{x}\|\|\boldsymbol{y}\|} $$
Hamming Distance: For string data / categorical features $$ d(\boldsymbol{x}, \boldsymbol{y})=\sum_{k=1}^{d}\left(\boldsymbol{x}_{k} \neq \boldsymbol{y}_{k}\right) $$
Manhattan Distance: Coordinate-wise distance $$ d(\boldsymbol{x}, \boldsymbol{y})=\sum_{k=1}^{d}\left|\boldsymbol{x}_{k}-\boldsymbol{y}_{k}\right| $$
Mahalanobis Distance: Normalized by the sample covariance matrix – unaffected by coordinate transformations $$ d(\boldsymbol{x}, \boldsymbol{y})=\|\boldsymbol{x}-\boldsymbol{y}\|_{\Sigma^{-1}}=\sqrt{(\boldsymbol{x}-\boldsymbol{y})^{T} \boldsymbol{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{y})} $$

Logistic Regression: Basics

Mon, 13 Jul 2020 00:00:00 +0000

💡 Use regression algorithm for classification

Logistic regression: estimate the probability that an instance belongs to a particular class

If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class (called the positive class, labeled “1”),
or else it predicts that it does not (i.e., it belongs to the negative class, labeled “0”).

This makes it a binary classifier.

Logistic / Sigmoid function

$\sigma(t)=\frac{1}{1+\exp (-t)}$

Bounded: $\sigma(t) \in (0, 1)$
Symmetric: $1 - \sigma(t) = \sigma(-t)$
Derivative: $\sigma^{\prime}(t)=\sigma(t)(1-\sigma(t))$

Estimating probabilities and making prediction

Computes a weighted sum of the input features (plus a bias term)
Outputs the logistic of this result

$\hat{p}=h_{\theta}(\mathbf{x})=\sigma\left(\mathbf{x}^{\mathrm{T}} \boldsymbol{\theta}\right)$
Prediction:
$$ \hat{y} = \begin{cases} 0 & \text{ if } \hat{p}<0.5\left(\Leftrightarrow h_{\theta}(\mathbf{x})<0\right) \\\\ 1 & \text{ if }\hat{p} \geq 0.5\left(\Leftrightarrow h_{\theta}(\mathbf{x}) \geq 0\right)\end{cases} $$

Train and cost function

Objective of training: to set the parameter vector $\boldsymbol{\theta}$ so that the model estimates:

high probabilities ($\geq 0.5$) for positive instances ($y=1$)
low probabilities ($< 0.5$) for negative instances ($y=0$)

Cost function of a single training instance:

$$ c(\boldsymbol{\theta}) = \begin{cases} -\log (\hat{p}) & \text{ if } y=1 \\\\ -\log (1-\hat{p}) & \text{ if } y=0\end{cases} $$

Actual lable: $y=1$, Misclassification: $\hat{y} = 0 \Leftrightarrow$ $\hat{p} = \sigma(h_{\boldsymbol{\theta}}(x))$ close to 0 $\Leftrightarrow c(\boldsymbol{\theta})$ large

Actual lable: $y=0$, Misclassification: $\hat{y} = 1 \Leftrightarrow$ $\hat{p} = \sigma(h_{\boldsymbol{\theta}}(x))$ close to 1 $\Leftrightarrow c(\boldsymbol{\theta})$ large

The cost function over the whole training set

Simply the average cost over all training instances (Combining the expressions of two different cases above into one single expression):

$\begin{aligned} J(\boldsymbol{\theta}) &=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \log \left(\hat{p}^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-\hat{p}^{(i)}\right)\right] \\\\ &=\frac{1}{m} \sum_{i=1}^{m}\left[-y^{(i)} \log \left(\hat{p}^{(i)}\right)-\left(1-y^{(i)}\right) \log \left(1-\hat{p}^{(i)}\right)\right] \end{aligned}$

$y^{(i)} =1:-y^{(i)} \log \left(\hat{p}^{(i)}\right)-\left(1-y^{(i)}\right) \log \left(1-\hat{p}^{(i)}\right)=-\log \left(\hat{p}^{(i)}\right)$

$y^{(i)} =0:-y^{(i)} \log \left(\hat{p}^{(i)}\right)-\left(1-y^{(i)}\right) \log \left(1-\hat{p}^{(i)}\right)=-\log \left(1-\hat{p}^{(i)}\right)$ (Exactly the same as $c(\boldsymbol{\theta})$ for a single instance above 👏)

Training

No closed-form equation 🤪
But it is convex so Gradient Descent (or any other optimization algorithm) is guaranteed to find the global minimum
Partial derivatives of the cost function with regards to the $j$-th model parameter $\theta_j$:
$$ \frac{\partial}{\partial \theta_{j}} J(\boldsymbol{\theta})=\frac{1}{m} \displaystyle \sum_{i=1}^{m}\left(\sigma\left(\boldsymbol{\theta}^{T} \mathbf{x}^{(l)}\right)-y^{(i)}\right) x_{j}^{(i)} $$

Logistic Regression: Probabilistic view

Mon, 13 Jul 2020 00:00:00 +0000

Class label:

$$ y_i \in \\{0, 1\\} $$

Conditional probability distribution of the class label is

$$ \begin{aligned} p(y=1|\boldsymbol{x}) &= \sigma(\boldsymbol{w}^T\boldsymbol{x}+b) \\\\ p(y=0|\boldsymbol{x}) &= 1 - \sigma(\boldsymbol{w}^T\boldsymbol{x}+b) \end{aligned} $$

with

$$ \sigma(x) = \frac{1}{1+\operatorname{exp}(-x)} $$

This is a conditional Bernoulli distribution. Therefore, the probability can be represented as

$$ \begin{array}{ll} p(y|\boldsymbol{x}) &= p(y=1|\boldsymbol{x})^y p(y=0|\boldsymbol{x})^{1-y} \\\\ & = \sigma(\boldsymbol{w}^T\boldsymbol{x}+b)^y (1 - \sigma(\boldsymbol{w}^T\boldsymbol{x}+b))^{1-y} \end{array} $$

The conditional Bernoulli log-likelihood is (assuming training data is i.i.d)

$$ \begin{aligned} \operatorname{loglik}(\boldsymbol{w}, \mathcal{D}) &= \log(\operatorname{lik}(\boldsymbol{w}, \mathcal{D})) \\\\ &= \log(\displaystyle\prod_i p(y_i|\boldsymbol{x}_i)) \\\\ &= \log\left(\displaystyle\prod_i \sigma(\boldsymbol{w}^T\boldsymbol{x}_i+b)^y \left(1 - \sigma(\boldsymbol{w}^T\boldsymbol{x}_i+b)\right)^{1-y}\right) \\\\ &= \displaystyle\sum_i y\log\left(\sigma(\boldsymbol{w}^T\boldsymbol{x}_i+b)\right)+ (1-y)\log\left(1 - \sigma(\boldsymbol{w}^T\boldsymbol{x}_i+b)\right) \end{aligned} $$

Let

$$ \tilde{\boldsymbol{w}}=\left(\begin{array}{c}1 \\\\ \boldsymbol{w} \end{array}\right), \quad \tilde{\boldsymbol{x}_i}=\left(\begin{array}{c}b \\\\ \boldsymbol{x}_i \end{array}\right) $$

Then:

$$ \operatorname{loglik}(\boldsymbol{w}, \mathcal{D}) = \operatorname{loglik}(\tilde{\boldsymbol{w}}, \mathcal{D}) = \displaystyle\sum_i y\log\left(\sigma(\tilde{\boldsymbol{w}}^T\tilde{\boldsymbol{x}_i})\right)+ (1-y)\log\left(1 - \sigma(\tilde{\boldsymbol{w}}^T\tilde{\boldsymbol{x}_i}))\right) $$

Our objective is to find the $\tilde{\boldsymbol{w}}^*$ that maximize the log-likelihood, i.e.

$$ \begin{array}{cl} \tilde{\boldsymbol{w}}^* &= \underset{\tilde{\boldsymbol{w}}}{\arg \max} \quad \operatorname{loglik}(\tilde{\boldsymbol{w}}, \mathcal{D}) \\\\ &= \underset{\tilde{\boldsymbol{w}}}{\arg \min} \quad -\operatorname{loglik}(\tilde{\boldsymbol{w}}, \mathcal{D})\\\\ &= \underset{\tilde{\boldsymbol{w}}}{\arg \min} \quad \underbrace{-\left(\displaystyle\sum_i y\log\left(\sigma(\tilde{\boldsymbol{w}}^T\tilde{\boldsymbol{x}_i})\right) + (1-y)\log\left(1 - \sigma(\tilde{\boldsymbol{w}}^T\tilde{\boldsymbol{x}_i}))\right)\right)}_{\text{cross-entropy loss}} \end{array} $$

In other words, maximizing the (log-)likelihood is the same as minimizing the cross entropy.

SVM: Basics

Mon, 13 Jul 2020 00:00:00 +0000

🎯 Goal of SVM

To find the optimal separating hyperplane which maximizes the margin of the training data

it correctly classifies the training data
it is the one which will generalize better with unseen data (as far as possible from data points from each category)

SVM math formulation

Assuming data is linear separable

Decision boundary: Hyperplane $\mathbf{w}^{T} \mathbf{x}+b=0$
Support Vectors: Data points closes to the decision boundary (Other examples can be ignored)
- Positive support vectors: $\mathbf{w}^{T} \mathbf{x}_{+}+b=+1$
- negative support vectors: $\mathbf{w}^{T} \mathbf{x}_{-}+b=-1$
Why do we use 1 and -1 as class labels?
- This makes the math manageable, because -1 and 1 are only different by the sign. We can write a single equation to describe the margin or how close a data point is to our separating hyperplane and not have to worry if the data is in the -1 or +1 class.
- If a point is far away from the separating plane on the positive side, then $w^Tx+b$ will be a large positive number, and $label*(w^Tx+b)$ will give us a large number. If it’s far from the negative side and has a negative label, $label*(w^Tx+b)$ will also give us a large positive number.
Margin $\rho$ : distance between the support vectors and the decision boundary and should be maximized
$$ \rho = \frac{\mathbf{w}^{T} \mathbf{x}\_{+}+b}{\|\mathbf{w}\|}-\frac{\mathbf{w}^{T} \mathbf{x}\_{-}+b}{\|\mathbf{w}\|}=\frac{2}{\|\mathbf{w}\|} $$

SVM optimization problem

Requirement:

Maximal margin
Correct classification

Based on these requirements, we have:

Reformulation:

$$ \begin{aligned} \underset{\mathbf{w}}{\operatorname{argmin}} \quad &\\|\mathbf{w}\\|^{2} \\\\ \text {s.t.} \quad & y_{i}\left(\mathbf{w}^{T} \mathbf{x}\_{i}+b\right) \geq 1 \end{aligned} $$

This is the hard margin SVM.

Soft margin SVM

💡 Idea

“Allow the classifier to make some mistakes” (Soft margin)

➡️ Trade-off between margin and classification accuracy

Slack-variables: ${\color {blue}{\xi_{i}}} \geq 0$
💡Allows violating the margin conditions
$$ y_{i}\left(\mathbf{w}^{T} \mathbf{x}_{i}+b\right) \geq 1- \color{blue}{\xi_{i}} $$
- $0 \leq \xi\_{i} \leq 1$ : sample is between margin and decision boundary (margin violation)
- $\xi\_{i} \geq 1$ : sample is on the wrong side of the decision boundary (misclassified)

Soft Max-Margin

Optimization problem

$$ \begin{array}{lll} \underset{\mathbf{w}}{\operatorname{argmin}} \quad &\|\mathbf{w}\|^{2} + \color{blue}{C \sum_i^N \xi_i} \qquad \qquad & \text{(Punish large slack variables)}\\\\ \text { s.t. } \quad & y_{i}\left(\mathbf{w}^{T} \mathbf{x}_{i}+b\right) \geq 1 -\color{blue}{\xi_i}, \quad \xi_i \geq 0 \qquad \qquad & \text{(Condition for soft-margin)}\end{array} $$

$C$ : regularization parameter, determines how important $\xi$ should be
- Small $C$: Constraints have little influence ➡️ large margin
- Large $C$: Constraints have large influence ➡️ small margin
- $C$ infinite: Constraints are enforced ➡️ hard margin

Soft SVM Optimization

Reformulate into an unconstrained optimization problem

Rewrite constraints: $\xi_{i} \geq 1-y_{i}\left(\mathbf{w}^{T} \mathbf{x}_{i}+b\right)=1-y_{i} f\left(\boldsymbol{x}_{i}\right)$
Together with $\xi_{i} \geq 0 \Rightarrow \xi_{i}=\max \left(0,1-y_{i} f\left(\boldsymbol{x}_{i}\right)\right)$

Unconstrained optimization (over $\mathbf{w}$):

$$ \underset{{\mathbf{w}}}{\operatorname{argmin}} \underbrace{\|\mathbf{w}\|^{2}}\_{\text {regularization }}+C \underbrace{\sum_{i=1}^{N} \max \left(0,1-y\_{i} f\left(\boldsymbol{x}\_{i}\right)\right)}_{\text {loss function }} $$

Points are in 3 categories:

$y\_{i} f\left(\boldsymbol{x}\_{i}\right) > 1$ : Point outside margin, no contribution to loss
$y\_{i} f\left(\boldsymbol{x}\_{i}\right) = 1$: Point is on the margin, no contribution to loss as in hard margin
$y\_{i} f\left(\boldsymbol{x}\_{i}\right) < 1$: Point violates the margin, contributes to loss

Loss function

SVMs uses “hinge” loss (approximation of 0-1 loss)

Hinge loss

For an intended output $t=\pm 1$ and a classifier score $y$, the hinge loss of the prediction $y$ is defined as
$$ > \ell(y)=\max (0,1-t \cdot y) > $$
Note that $y$ should be the “raw” output of the classifier’s decision function, not the predicted class label. For instance, in linear SVMs, $y = \mathbf{w}\cdot \mathbf{x}+ b$, where $(\mathbf{w},b)$ are the parameters of the hyperplane and $mathbf{x}$ is the input variable(s).

The loss function of SVM is convex:

I.e.,

There is only one minimum
We can find it with gradient descent
However: Hinge loss is not differentiable! 🤪

Sub-gradients

For convex function $f: \mathbb{R}^d \to \mathbb{R}$ :

$$ f(\boldsymbol{z}) \geq f(\boldsymbol{x})+\nabla f(\boldsymbol{x})^{T}(\boldsymbol{z}-\boldsymbol{x}) $$

(Linear approximation underestimates function)

A subgradient of a convex function $f$ at point $\boldsymbol{x}$ is any $\boldsymbol{g}$ such that

$$ f(\boldsymbol{z}) \geq f(\boldsymbol{x})+\nabla \mathbf{g}^{T}(\boldsymbol{z}-\boldsymbol{x}) $$

Always exists (even $f$ is not differentiable)
If $f$ is differentiable at $\boldsymbol{x}$, then: $\boldsymbol{g}=\nabla f(\boldsymbol{x})$

Example

$f(x)=|x|$

$x \neq 0$ : unique sub-gradient is $g= \operatorname{sign}(x)$
$x =0$ : $g \in [-1, 1]$

Sub-gradient Method

Sub-gradient Descent

Given convex $f$, not necessarily differentiable
Initialize $\boldsymbol{x}_0$
Repeat: $\boldsymbol{x}\_{t+1}=\boldsymbol{x}\_{t}+\eta \boldsymbol{g}$, where $\boldsymbol{g}$ is any sub-gradient of $f$ at point $\boldsymbol{x}_{t}$

‼️ Notes:

Sub-gradients do not necessarily decrease $f$ at every step (no real descent method)
Need to keep track of the best iterate $\boldsymbol{x}^*$

Sub-gradients for hinge loss

$$ \mathcal{L}\left(\mathbf{x}\_{i}, y\_{i} ; \mathbf{w}\right)=\max \left(0,1-y\_{i} f\left(\mathbf{x}\_{i}\right)\right) \quad f\left(\mathbf{x}\_{i}\right)=\mathbf{w}^{\top} \mathbf{x}\_{i}+b $$

Sub-gradient descent for SVMs

Recall the Unconstrained optimization for SVMs:

$$ \underset{{\mathbf{w}}}{\operatorname{argmin}} \quad C \underbrace{\sum\_{i=1}^{N} \max \left(0,1-y_{i} f\left(\boldsymbol{x}\_{i}\right)\right)}\_{\text {loss function }} + \underbrace{\|\mathbf{w}\|^{2}}\_{\text {regularization }} $$

At each iteration, pick random training sample $(\boldsymbol{x}_i, y_i)$

If $y_{i} f\left(\boldsymbol{x}_{i}\right)<1$:
$$ \boldsymbol{w}{t+1}=\boldsymbol{w}{t}-\eta\left(2 \boldsymbol{w}{t}-C y{i} \boldsymbol{x}_{i}\right) $$
Otherwise:
$$ \quad \boldsymbol{w}\_{t+1}=\boldsymbol{w}\_{t}-\eta 2 \boldsymbol{w}\_{t} $$

Application of SVMs

Pedestrian Tracking
text (and hypertext) categorization
image classification
bioinformatics (Protein classification, cancer classification)
hand-written character recognition

Yet, in the last 5-8 years, neural networks have outperformed SVMs on most applications.🤪☹️😭

SVM: Kernel Methods

Mon, 13 Jul 2020 00:00:00 +0000

Kernel function

Given a mapping function $\phi: \mathcal{X} \rightarrow \mathcal{V}$, the function

$$ \mathcal{K}: x \rightarrow v, \quad \mathcal{K}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\left\langle\phi(\mathbf{x}), \phi\left(\mathbf{x}^{\prime}\right)\right\rangle_{\mathcal{V}} $$

is called a kernel function.

“A kernel is a function that returns the result of a dot product performed in another space.”

Kernel trick

Applying the kernel trick simply means replacing the dot product of two examples by a kernel function.

Typical kernels

Kernel Type	Definition
Linear kernel	$k\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)=\left\langle\boldsymbol{x}, \boldsymbol{x}^{\prime}\right\rangle$
Polynomial kernel	$k\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)=\left\langle\boldsymbol{x}, \boldsymbol{x}^{\prime}\right\rangle^{d}$
Gaussian / Radial Basis Function (RBF) kernel	$k \left(\boldsymbol{x}, \boldsymbol{y}\right)=\exp \left(-\frac{\\|\boldsymbol{x}-\boldsymbol{y}\\|^{2}}{2 \sigma^{2}}\right)$

Why do we need kernel trick?

Kernels can be used for all feature based algorithms that can be rewritten such that they contain inner products of feature vectors
- This is true for almost all feature based algorithms (Linear regression, SVMs, …)
Kernels can be used to map the data $\mathbf{x}$ in an infinite dimensional feature space (i.e., a function space)
- The feature vector never has to be represented explicitly
- As long as we can evaluate the inner product of two feature vectors

➡️ We can obtain a more powerful representation than standard linear feature models.

SVM: Kernelized SVM

Mon, 13 Jul 2020 00:00:00 +0000

SVM (with features)

Maximum margin principle
Slack variables allow for margin violation
$$ \begin{array}{ll} \underset{\mathbf{w}}{\operatorname{argmin}} \quad &\|\mathbf{w}\|^{2} + C \sum_i^N \xi_i \\\\ \text { s.t. } \quad & y_{i}\left(\mathbf{w}^{T} \color{red}{\phi(\mathbf{x}_{i})} + b\right) \geq 1 -\xi_i, \quad \xi_i \geq 0\end{array} $$

Math basics

Solve the constrained optimization problem: Method of Lagrangian Multipliers

Primal optimization problem:

$$ \begin{array}{ll} \underset{\boldsymbol{x}}{\min} \quad & f(\boldsymbol{x}) \\\\ \text { s.t. } \quad & h_{i}(\boldsymbol{x}) \geq b_{i}, \text { for } i=1 \ldots K \end{array} $$

Lagrangian optimization:

$$ \begin{array}{ll} \underset{\boldsymbol{x}}{\min} \underset{\boldsymbol{\lambda}}{\max} \quad & L(\boldsymbol{x}, \boldsymbol{\lambda}) = f(\boldsymbol{x}) - \sum_{i=1}^K \lambda_i(h_i(\boldsymbol{x}) - b_i) \\\\ \text{ s.t. } &\lambda_i\geq 0, \quad i = 1\dots K \end{array} $$

Dual optimization problem
$$ \begin{aligned} \boldsymbol{\lambda}^{\*}=\underset{\boldsymbol{\lambda}}{\arg \max } g(\boldsymbol{\lambda}), \quad & g(\boldsymbol{\lambda})=\min \_{\boldsymbol{x}} L(\boldsymbol{x}, \boldsymbol{\lambda}) \\\\ \text { s.t. } \quad \lambda_{i} \geq 0, & \text { for } i=1 \ldots K \end{aligned} $$
- $g$ : dual function of the optimization problem
- Essentially swapped min and max in the definition of $L$
Slaters condition: For a convex objective and convex constraints, solving the dual is equivalent to solving the primal
- I.e., optimal primal parameters can be obtained from optimal dual parameters $$ \boldsymbol{x}^* = \underset{\boldsymbol{x}}{\operatorname{argmin}}L(\boldsymbol{x}, \boldsymbol{\lambda}^*) $$

Dual derivation of the SVM

SVM optimization:
$$ \begin{array}{ll} &\underset{\boldsymbol{w}}{\operatorname{argmin}} \quad &\|\boldsymbol{w}\|^2 \\\\ &\text{ s.t. } \quad &y_i(\boldsymbol{w}^T\phi(\mathbf{x}_i) + b) \geq 1 \end{array} $$
Lagrangian function:
$$ L(\boldsymbol{w}, \boldsymbol{\lambda})=\frac{1}{2} \boldsymbol{w}^{T} \boldsymbol{w}-\sum_{i} \alpha_{i}\left(y_{i}\left(\boldsymbol{w}^{T} \phi\left(\boldsymbol{x}_{i}\right)+b\right)-1\right) $$
Compute optimal $\boldsymbol{w}$
$$ \begin{align} &\frac{\partial L}{\partial \boldsymbol{w}} = \boldsymbol{w} - \sum_i \alpha_i y_i \phi(\boldsymbol{x}_i) \overset{!}{=} 0 \\\\ \Leftrightarrow \quad & \color{CornflowerBlue}{\boldsymbol{w}^* = \sum_i \alpha_i y_i \phi(\boldsymbol{x}_i)} \end{align} $$
- Many of the $\alpha_i$ will be zero (constraint satisfied)
- If $\alpha_i \neq 0 \overset{\text{complementary slackness}}{\Rightarrow} y_{i}\left(\boldsymbol{w}^{T} \phi\left(\boldsymbol{x}_{i}\right)+b\right)-1 =0$
  
  $\Rightarrow \phi(\boldsymbol{x}_i)$ is a support vector
- The optimal weight vector $\boldsymbol{w}$ is a linear combination of the support vectors! 👏
Optimality condition for $b$:
$$ \frac{\partial L}{\partial b} = - \sum_i \alpha_i y_i \overset{!}{=} 0 \quad \Rightarrow \sum_i \alpha_i y_i = 0 $$
- We do not obtain a solution for $b$
- But an additional condition for $\alpha$
$b$ can be computed from $w$:

If $\alpha\_i > 0$, then $\boldsymbol{x}\_i$ is on the margin due to complementary slackness condition. I.e.:
$$ \begin{align}y_{i}\left(\boldsymbol{w}^{T} \phi\left(\boldsymbol{x}_{i}\right)+b\right)-1 &= 0 \\\\y_{i}\left(\boldsymbol{w}^{T} \phi\left(\boldsymbol{x}_{i}\right)+b\right) &= 1 \\\\ \underbrace{y_{i} y_{i}}_{=1}\left(\boldsymbol{w}^{T} \phi\left(\boldsymbol{x}_{i}\right)+b\right) &= y_{i} \\\\ \Rightarrow b = y_{i} - \boldsymbol{w}^{T} \phi\left(\boldsymbol{x}_{i}\right)\end{align} $$

Apply kernel tricks for SVM

Lagrangian:

$$ L(\boldsymbol{w}, \boldsymbol{\lambda}) = {\color{red}{\frac{1}{2} \boldsymbol{w}^{T} \boldsymbol{w}}} - \sum_{i} \alpha\_{i}\left({\color{green}{y\_{i} (w^{T} \phi\left(x_{i}\right)}}+ b)-\color{CornflowerBlue}{1}\right), \quad \boldsymbol{w}^{\*}=\sum\_{i} \alpha_{i} y\_{i} \phi\left(\boldsymbol{x}\_{i}\right) $$

Dual function (Wolfe Dual Lagrangian function):

$$ \begin{aligned} g(\boldsymbol{\alpha}) &=L\left(\boldsymbol{w}^{*}, \boldsymbol{\alpha}\right) \\\\ &=\color{red}{\frac{1}{2} \underbrace{\sum_{i} \sum_{j} \alpha_{i} \alpha_{j} y_{i} y_{j} \phi\left(\boldsymbol{x}_{i}\right)^{T} \phi\left(\boldsymbol{x}_{j}\right)}_{{\boldsymbol{w}^*}^T \boldsymbol{w}^*}} - \color{green}{\sum_{i} \alpha_{i} y_{i}(\underbrace{\sum_{j} \alpha_{j} y_{j} \phi\left(x_{j}\right)}_{\boldsymbol{w}^*})^{T} \phi\left(x_{i}\right)} + \color{CornflowerBlue}{\sum_{i} \alpha_{i}} \\\\ &=\sum_{i} \alpha_{i}-\frac{1}{2} \sum_{i} \sum_{j} \alpha_{i} \alpha_{j} y_{i} y_{j} \underbrace{\phi\left(\boldsymbol{x}_{i}\right)^{T} \phi\left(\boldsymbol{x}_{j}\right)}_{\overset{}{=} \boldsymbol{k}(\boldsymbol{x}_i, \boldsymbol{x}_j)} \\\\ &= \sum_{i} \alpha_{i}-\frac{1}{2} \sum_{i} \sum_{j} \alpha_{i} \alpha_{j} y_{i} y_{j} \boldsymbol{k}(\boldsymbol{x}_i, \boldsymbol{x}_j ) \end{aligned} $$

Wolfe dual optimization problem:

$$ \begin{array}{ll} \underset{\boldsymbol{\alpha}}{\min} \quad & \sum_{i} \alpha_{i}-\frac{1}{2} \sum_{i} \sum_{j} \alpha_{i} \alpha_{j} y_{i} y_{j} \boldsymbol{k}(\boldsymbol{x}_i, \boldsymbol{x}_j ) \\\\ \text{ s.t } \quad & \alpha_i \geq 0 \quad \forall i = 1, \dots, N \\\\ & \sum_i \alpha_i y_i = 0 \end{array} $$

Compute primal from dual parameters:
- Weight vector
  $$ \boldsymbol{w}^{*}=\sum_{i} \alpha_{i} y_{i} \phi\left(\boldsymbol{x}_{i}\right) \label{eq:weight vector} $$
  - Can not be represented (as it is potentially infinite dimensional). But don’t worry, we don’t need the explicit representation
- Bias: For any $i$ with $\alpha_i > 0$ :
$$ \begin{array}{ll} b &=y_{k}-\mathbf{w}^{T} \phi\left(\boldsymbol{x}_{k}\right) \\\\ &=y_{k}-\sum_{i} y_{i} \alpha_{i} k\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{k}\right) \end{array} $$
- Decision function (Again, we use the kernel trick and therefore we don’t need the explicit representation of the weight vector $\boldsymbol{w}^*$)
$$ \begin{aligned}f(\boldsymbol{x}) &= (\boldsymbol{w}^{*})^{T} \boldsymbol{\phi}(\boldsymbol{x}) + b \\\\ &\overset{}{=} \left(\sum_{i} \alpha_{i} y_{i} \phi\left(\boldsymbol{x}_{i}\right)\right)^{T} \boldsymbol{\phi}(\boldsymbol{x}) + b \\\\ &= \sum_{i} \alpha_{i} y_{i} \boldsymbol{\phi}(\boldsymbol{x}_i)^{T} \boldsymbol{\phi}(\boldsymbol{x}) + b \\\\ & \overset{}{=}\sum_i y_{i} \alpha_{i} k\left(\boldsymbol{x}_{i}, \boldsymbol{x}\right)+b\end{aligned} $$

Relaxed constraints with slack variable

Primal optimization problem
$$ \begin{array}{ll} \underset{\mathbf{w}}{\operatorname{argmin}} \quad &\|\mathbf{w}\|^{2} + \color{CornflowerBlue}{C \sum_i^N \xi_i} \\\\ \text { s.t. } \quad & y_{i}\left(\mathbf{w}^{T} \mathbf{x}_{i} + b\right) \geq 1 - \color{CornflowerBlue}{\xi_i}, \quad \color{CornflowerBlue}{\xi_i} \geq 0\end{array} $$
Dual optimization problem
$$ \begin{array}{ll}\underset{\boldsymbol{\alpha}}{\min} \quad & \sum_{i} \alpha_{i}-\frac{1}{2} \sum_{i} \sum_{j} \alpha_{i} \alpha_{j} y_{i} y_{j} \boldsymbol{k}(\boldsymbol{x}_i, \boldsymbol{x}_j ) \\\\ \text{ s.t } \quad & \color{CornflowerBlue}{C \geq} \alpha_i \geq 0 \quad \forall i = 1, \dots, N \\\\ & \sum_i \alpha_i y_i = 0\end{array} $$
Add upper bound of $\color{CornflowerBlue}{C}$ on $\color{CornflowerBlue}{\alpha_i}$
- Without slack, $\alpha_i \to \infty$ when constraints are violated (points misclassified)
- Upper bound of $C$ limits the $\alpha_i$, so misclassifications are allowed

Classification And Regression Tree (CART)

Tue, 27 Oct 2020 00:00:00 +0000

Tree-based Methods

CART: Classification And Regression Tree

Grow a binary tree

At each node, “split” the data into two “daughter” nodes.
Splits are chosen using a splitting criterion.
Bottom nodes are “terminal” nodes.

	Type of tree	Predicted value at a node	Split criterion
Regression	Regression tree	The predicted value at a node is the average response variable for all observations in the node	Minimum residual sum of squares $$\mathrm{RSS}=\sum_{\text {left }}\left(y_{i}-\bar{y}_{L}\right)^{2}+\sum_{\text {right }}\left(y_{i}-\bar{y}_{R}\right)^{2}$$ $\bar{y}_L$ / $\bar{y}_R$: average label values in the left / right subtree (Split such that variance in subtress is minimized)
Classification	Decision tree	The predicted class is the most common class in the node (majority vote).	Minimum entropy in subtrees $$\text { score }=N_{L} H\left(p_{\mathrm{L}}\right)+N_{R} H\left(p_{\mathrm{R}}\right)$$ $H\left(p_{L}\right)=-\sum_{k} p_{L}(k) \log p_{L}(k)$: entropy in the left sub-tree $p_L(k)$: proportion of class $k$ in left tree (Split such that class-labels in sub-trees are “pure”)

When stop?

Stop if:

Minimum number of samples per node
Maximum depth

… has been reached

(Both criterias again influence the complexity of the tree)

Controlling the tree complexity

Number of samples per leaf	Affect
Small	Tree is very sensitive to noise
Large	Tree is not expressive enough

Advantages 👍

Applicable to both regression and classification problems.
Handle categorical predictors naturally.
Computationally simple and quick to fit, even for large problems.
No formal distributional assumptions (non-parametric).
Can handle highly non-linear interactions and classification boundaries.
Automatic variable selection.
Very easy to interpret if the tree is small.

Disadvantages 👎

Accuracy

current methods, such as support vector machines and ensemble classifiers often have 30% lower error rates than CART.
Instability

if we change the data a little, the tree picture can change a lot. So the interpretation is not as straightforward as it appears.

Linear Discriminant Functions

Sat, 07 Nov 2020 00:00:00 +0000

No assumption about distributions -> non-parametric
Linear decision surfaces
Begin by supervised training (given class of training data)

Linear Discriminant Functions and Decision Surfaces

A discriminant function that is a linear combination of the components of $x$ can be written as

$$ g(\mathbf{x})=\mathbf{w}^{T} \mathbf{x}+w\_{0} $$

$\mathbf{x}$: feature vector
$\mathbf{w}$: weight vector
$w\_0$: bias or threshold weight

The two category case

Decision rule:

Decide $w\_1$ if $g(\mathbf{x}) > 0 \Leftrightarrow \mathbf{w}^{T} \mathbf{x}+w\_{0} > 0 \Leftrightarrow \mathbf{w}^{T} \mathbf{x}> -w\_{0}$
Decide $w\_{2}$ if $g(\mathbf{x}) < 0 \Leftrightarrow \mathbf{w}^{T} \mathbf{x}+w\_{0} < 0 \Leftrightarrow \mathbf{w}^{T} \mathbf{x}<-w\_{0}$
$g(\mathbf{x}) = 0$: assign to either class or can be left undefined

The equation $g(\mathbf{x}) = 0$ defines the decision surface that separates points assigned to $w\_{1}$ from points assigned to $w\_{2}$. When $g(\mathbf{x})$ is linear, this decision surface is a hyperplane.

For arbitrary $\mathbf{x}\_1$ and $\mathbf{x}\_2$ on the decision surface, we have:

$$ \mathbf{w}^{\mathrm{T}} \mathbf{x}\_{1}+w\_{0}=\mathbf{w}^{\mathrm{T}} \mathbf{x}\_{2}+w\_{0} $$ $$ \mathbf{w}^{\mathrm{T}}\left(\mathbf{x}\_{1}-\mathbf{x}\_{2}\right)=0 $$

$\Rightarrow \mathbf{w}$ is normal to any vector lying in the hyperplane.

In general, the hyperplane $H$ divides the feature space into two half-spaces:

decision region $R\_1$ for $w\_1$
decision region $R\_2$ for $w\_2$

Because $g(\mathbf{x}) > 0$ if $\mathbf{x}$ in $R\_1$, it follows that the normal vector $\mathbf{w}$ points into $R\_1$. Therefore, It is sometimes said that any $\mathbf{x}$ in $R\_1$ is on the positive side of $H$, and any $\mathbf{x}$ in $R\_2$ is on the negative side of $H$

The discriminant function $g(\mathbf{x})$ gives an algebraic measure of the distance from $\mathbf{x}$ to the hyperplane. We can write $\mathbf{x}$ as

$$ \mathbf{x}=\mathbf{x}\_{p}+r \frac{\mathbf{w}}{\|\mathbf{w}\|} $$

$\mathbf{x}\_{p}$: normal projection of $\mathbf{x}$ onto $H$
$r$: desired algebraic distance which is positive if $\mathbf{x}$ is on the positive side, else negative

As $\mathbf{x}\_p$ is on the hyperplane

$$ \begin{array}{ll} g\left(\mathbf{x}\_{p}\right)=0 \\\\ \mathbf{w}^{\mathrm{T}} \mathbf{x}\_{p}+w\_{0}=0 \\\\ \mathbf{w}^{\mathrm{T}}\left(\mathbf{x}-r \frac{\mathbf{w}}{\|\mathbf{w}\|}\right)+w\_{0}=0 \\\\ \mathbf{w}^{\mathrm{T}} \mathbf{x}-r \frac{\mathbf{w}^{\mathrm{T}} \mathbf{w}}{\|\mathbf{w}\|}+w\_{0}=0 \\\\ \mathbf{w}^{\mathrm{T}} \mathbf{x}-r\|\mathbf{w}\| + w\_0 = 0 \\\\ \underbrace{\mathbf{w}^{\mathrm{T}} \mathbf{x} + w\_0}\_{=g(\mathbf{x})} = r\|\mathbf{w}\| \\\\ \Rightarrow g(\mathbf{x}) = r\|\mathbf{w}\| \\\\ \Rightarrow r = \frac{g(\mathbf{x})}{\|\mathbf{w}\|} \end{array} $$

In particular, the distance from the origin to hyperplane $H$ is given by $\frac{w_0}{\|\mathbf{w}\|}$

$w\_0 > 0$: the origin is on the positive side of $H$
$w\_0 < 0$: the origin is on the negative side of $H$
$w\_0 = 0$: $g(\mathbf{x})$ has the homogeneous form $\mathbf{w}^{\mathrm{T}} \mathbf{x}$ and the hyperplane passes through the origin

A linear discriminant function divides the feature space by a hyperplane decision surface:

orientation: determined by the normal vector $\mathbf{w}$
location: determined by the bias $w\_0$

Reference

https://www.byclb.com/TR/Tutorials/neural_networks/ch9_1.htm

Linear Discriminant Analysis (LDA)

Sat, 07 Nov 2020 00:00:00 +0000

Linear Discriminant Analysis (LDA)

also called Fisher’s Linear Discriminant
reduces dimension (like PCA)
but focuses on maximizing seperability among known categories

💡 Idea

Create a new axis
Project the data onto this new axis in a way to maximize the separation of two categories

How it works?

Create a new axis

According to two criteria (considered simultaneously):

Maximize the distance between means
Minimize the variation $s^2$ (which LDA calls “scatter”) within each category

We have:

$$ \frac{(\overbrace{\mu_1 - \mu_2}^{=: d})^2}{s_1^2 + s_2^2} \qquad\left(\frac{\text{''ideally large''}}{\text{"ideally small"}}\right) $$

Why both distance and scatter are important?

More than 2 dimensions

The process is the same 👏:

Create an axis that maximizes the distance between the means for the two categories while minimizing the scatter

More than 2 categories (e.g. 3 categories)

Little difference:

Measure the distances among the means
- Find the point that is central to all of the data
- Then measure the distances between a point that is central in each category and the main central point
- Maximize the distance between each category and the central point while minimizing the scatter for each category
Create 2 axes to separate the data (because the 3 central points for each category define a plane)

LDA and PCA

Similarities

Both rank the new axes in order of importance
- PC1 (the first new axis that PCA creates) accounts for the most variation in the data
  - PC2 (the second new axis) does the second best job
- LD1 (the first new axis that LDA creates) accounts for the most variation between the categories
  - LD2 does the second best job
Both can let you dig in and see which features are driving the new axes
Both try to reduce dimensions
- PCA looks at the features with the most variation
- LDA tries to maximize the separation of known categories

Reference

https://www.youtube.com/watch?v=azXCzI57Yfc