SVM: Basics

🎯 Goal of SVM

To find the optimal separating hyperplane which maximizes the margin of the training data

  • it correctly classifies the training data
  • it is the one which will generalize better with unseen data (as far as possible from data points from each category)

SVM math formulation

Assuming data is linear separable

image-20200304135136513
  • Decision boundary: Hyperplane wTx+b=0\mathbf{w}^{T} \mathbf{x}+b=0

  • Support Vectors: Data points closes to the decision boundary (Other examples can be ignored)

    • Positive support vectors: wTx++b=+1\mathbf{w}^{T} \mathbf{x}_{+}+b=+1
    • negative support vectors: wTxβˆ’+b=βˆ’1\mathbf{w}^{T} \mathbf{x}_{-}+b=-1

    Why do we use 1 and -1 as class labels?

    • This makes the math manageable, because -1 and 1 are only different by the sign. We can write a single equation to describe the margin or how close a data point is to our separating hyperplane and not have to worry if the data is in the -1 or +1 class.
    • If a point is far away from the separating plane on the positive side, then wTx+bw^Tx+b will be a large positive number, and labelβˆ—(wTx+b)label*(w^Tx+b) will give us a large number. If it’s far from the negative side and has a negative label, labelβˆ—(wTx+b)label*(w^Tx+b) will also give us a large positive number.
  • Margin ρ\rho : distance between the support vectors and the decision boundary and should be maximized

    ρ=wTx_++bβˆ₯wβˆ₯βˆ’wTx_βˆ’+bβˆ₯wβˆ₯=2βˆ₯wβˆ₯ \rho = \frac{\mathbf{w}^{T} \mathbf{x}\_{+}+b}{\|\mathbf{w}\|}-\frac{\mathbf{w}^{T} \mathbf{x}\_{-}+b}{\|\mathbf{w}\|}=\frac{2}{\|\mathbf{w}\|}

SVM optimization problem

Requirement:

  1. Maximal margin
  2. Correct classification

Based on these requirements, we have:

image-20200713164553044

Reformulation:

argmin⁑w∣w∣2s.t.yi(wTx_i+b)β‰₯1 \begin{aligned} \underset{\mathbf{w}}{\operatorname{argmin}} \quad &\\|\mathbf{w}\\|^{2} \\\\ \text {s.t.} \quad & y_{i}\left(\mathbf{w}^{T} \mathbf{x}\_{i}+b\right) \geq 1 \end{aligned}

This is the hard margin SVM.

Soft margin SVM

πŸ’‘ Idea

β€œAllow the classifier to make some mistakes” (Soft margin)

➑️ Trade-off between margin and classification accuracy

image-20200304141838595
  • Slack-variables: ΞΎiβ‰₯0{\color {blue}{\xi_{i}}} \geq 0

  • πŸ’‘Allows violating the margin conditions

    yi(wTxi+b)β‰₯1βˆ’ΞΎi y_{i}\left(\mathbf{w}^{T} \mathbf{x}_{i}+b\right) \geq 1- \color{blue}{\xi_{i}}
    • 0≀ξ_i≀10 \leq \xi\_{i} \leq 1 : sample is between margin and decision boundary (margin violation)
    • ΞΎ_iβ‰₯1\xi\_{i} \geq 1 : sample is on the wrong side of the decision boundary (misclassified)

Soft Max-Margin

Optimization problem

argmin⁑wβˆ₯wβˆ₯2+Cβˆ‘iNΞΎi(Punish large slack variables) s.t. yi(wTxi+b)β‰₯1βˆ’ΞΎi,ΞΎiβ‰₯0(Condition for soft-margin) \begin{array}{lll} \underset{\mathbf{w}}{\operatorname{argmin}} \quad &\|\mathbf{w}\|^{2} + \color{blue}{C \sum_i^N \xi_i} \qquad \qquad & \text{(Punish large slack variables)}\\\\ \text { s.t. } \quad & y_{i}\left(\mathbf{w}^{T} \mathbf{x}_{i}+b\right) \geq 1 -\color{blue}{\xi_i}, \quad \xi_i \geq 0 \qquad \qquad & \text{(Condition for soft-margin)}\end{array}
  • CC : regularization parameter, determines how important ΞΎ\xi should be
    • Small CC: Constraints have little influence ➑️ large margin
    • Large CC: Constraints have large influence ➑️ small margin
    • CC infinite: Constraints are enforced ➑️ hard margin

Soft SVM Optimization

Reformulate into an unconstrained optimization problem

  1. Rewrite constraints: ΞΎiβ‰₯1βˆ’yi(wTxi+b)=1βˆ’yif(xi)\xi_{i} \geq 1-y_{i}\left(\mathbf{w}^{T} \mathbf{x}_{i}+b\right)=1-y_{i} f\left(\boldsymbol{x}_{i}\right)
  2. Together with ΞΎiβ‰₯0β‡’ΞΎi=max⁑(0,1βˆ’yif(xi))\xi_{i} \geq 0 \Rightarrow \xi_{i}=\max \left(0,1-y_{i} f\left(\boldsymbol{x}_{i}\right)\right)

Unconstrained optimization (over w\mathbf{w}):

argmin⁑wβˆ₯wβˆ₯2⏟_regularization +Cβˆ‘i=1Nmax⁑(0,1βˆ’y_if(x_i))⏟loss function  \underset{{\mathbf{w}}}{\operatorname{argmin}} \underbrace{\|\mathbf{w}\|^{2}}\_{\text {regularization }}+C \underbrace{\sum_{i=1}^{N} \max \left(0,1-y\_{i} f\left(\boldsymbol{x}\_{i}\right)\right)}_{\text {loss function }}

Points are in 3 categories:

  • y_if(x_i)>1y\_{i} f\left(\boldsymbol{x}\_{i}\right) > 1 : Point outside margin, no contribution to loss

  • y_if(x_i)=1y\_{i} f\left(\boldsymbol{x}\_{i}\right) = 1: Point is on the margin, no contribution to loss as in hard margin

  • y_if(x_i)<1y\_{i} f\left(\boldsymbol{x}\_{i}\right) < 1: Point violates the margin, contributes to loss

Loss function

SVMs uses β€œhinge” loss (approximation of 0-1 loss)

Hinge loss

For an intended output t=Β±1t=\pm 1 and a classifier score yy, the hinge loss of the prediction yy is defined as

>β„“(y)=max⁑(0,1βˆ’tβ‹…y)> > \ell(y)=\max (0,1-t \cdot y) >

Note that yy should be the β€œraw” output of the classifier’s decision function, not the predicted class label. For instance, in linear SVMs, y=wβ‹…x+by = \mathbf{w}\cdot \mathbf{x}+ b, where (w,b)(\mathbf{w},b) are the parameters of the hyperplane and mathbfxmathbf{x} is the input variable(s).

image-20200304172146690

The loss function of SVM is convex:

image-20200304172349088

I.e.,

  • There is only one minimum
  • We can find it with gradient descent
  • However: Hinge loss is not differentiable! πŸ€ͺ

Sub-gradients

For convex function f:Rd→Rf: \mathbb{R}^d \to \mathbb{R} :

f(z)β‰₯f(x)+βˆ‡f(x)T(zβˆ’x) f(\boldsymbol{z}) \geq f(\boldsymbol{x})+\nabla f(\boldsymbol{x})^{T}(\boldsymbol{z}-\boldsymbol{x})

(Linear approximation underestimates function)

image-20200304172748278

A subgradient of a convex function ff at point x\boldsymbol{x} is any g\boldsymbol{g} such that

f(z)β‰₯f(x)+βˆ‡gT(zβˆ’x) f(\boldsymbol{z}) \geq f(\boldsymbol{x})+\nabla \mathbf{g}^{T}(\boldsymbol{z}-\boldsymbol{x})
  • Always exists (even ff is not differentiable)
  • If ff is differentiable at x\boldsymbol{x}, then: g=βˆ‡f(x)\boldsymbol{g}=\nabla f(\boldsymbol{x})

Example

f(x)=∣x∣f(x)=|x|

  • xβ‰ 0x \neq 0 : unique sub-gradient is g=sign⁑(x)g= \operatorname{sign}(x)
  • x=0x =0 : g∈[βˆ’1,1]g \in [-1, 1]

img

Sub-gradient Method

Sub-gradient Descent

  1. Given convex ff, not necessarily differentiable
  2. Initialize x0\boldsymbol{x}_0
  3. Repeat: x_t+1=x_t+Ξ·g\boldsymbol{x}\_{t+1}=\boldsymbol{x}\_{t}+\eta \boldsymbol{g}, where g\boldsymbol{g} is any sub-gradient of ff at point xt\boldsymbol{x}_{t}

‼️ Notes:

  • Sub-gradients do not necessarily decrease ff at every step (no real descent method)
  • Need to keep track of the best iterate xβˆ—\boldsymbol{x}^*

Sub-gradients for hinge loss

L(x_i,y_i;w)=max⁑(0,1βˆ’y_if(x_i))f(x_i)=w⊀x_i+b \mathcal{L}\left(\mathbf{x}\_{i}, y\_{i} ; \mathbf{w}\right)=\max \left(0,1-y\_{i} f\left(\mathbf{x}\_{i}\right)\right) \quad f\left(\mathbf{x}\_{i}\right)=\mathbf{w}^{\top} \mathbf{x}\_{i}+b image-20200304175930294

Sub-gradient descent for SVMs

Recall the Unconstrained optimization for SVMs:

argmin⁑wCβˆ‘_i=1Nmax⁑(0,1βˆ’yif(x_i))⏟_loss function +βˆ₯wβˆ₯2⏟_regularization  \underset{{\mathbf{w}}}{\operatorname{argmin}} \quad C \underbrace{\sum\_{i=1}^{N} \max \left(0,1-y_{i} f\left(\boldsymbol{x}\_{i}\right)\right)}\_{\text {loss function }} + \underbrace{\|\mathbf{w}\|^{2}}\_{\text {regularization }}

At each iteration, pick random training sample (xi,yi)(\boldsymbol{x}_i, y_i)

  • If yif(xi)<1y_{i} f\left(\boldsymbol{x}_{i}\right)<1: ​

    wt+1=wtβˆ’Ξ·(2wtβˆ’Cyixi) \boldsymbol{w}{t+1}=\boldsymbol{w}{t}-\eta\left(2 \boldsymbol{w}{t}-C y{i} \boldsymbol{x}_{i}\right)
  • Otherwise:

    w_t+1=w_tβˆ’Ξ·2w_t \quad \boldsymbol{w}\_{t+1}=\boldsymbol{w}\_{t}-\eta 2 \boldsymbol{w}\_{t}

Application of SVMs

  • Pedestrian Tracking
  • text (and hypertext) categorization
  • image classification
  • bioinformatics (Protein classification, cancer classification)
  • hand-written character recognition

Yet, in the last 5-8 years, neural networks have outperformed SVMs on most applications.πŸ€ͺ☹️😭