To find the optimal separating hyperplane which maximizes the margin of the training data
it correctly classifies the training data
it is the one which will generalize better with unseen data (as far as possible from data points from each category)
SVM math formulation
Assuming data is linear separable
Decision boundary: Hyperplane wTx+b=0
Support Vectors: Data points closes to the decision boundary (Other examples can be ignored)
Positive support vectors: wTx+β+b=+1
negative support vectors: wTxββ+b=β1
Why do we use 1 and -1 as class labels?
This makes the math manageable, because -1 and 1 are only different by the sign. We can write a single equation to describe the margin or how close a data point is to our separating hyperplane and not have to worry if the data is in the -1 or +1 class.
If a point is far away from the separating plane on the positive side, then wTx+b will be a large positive number, and labelβ(wTx+b) will give us a large number. If itβs far from the negative side and has a negative label, labelβ(wTx+b) will also give us a large positive number.
MarginΟ : distance between the support vectors and the decision boundary and should be maximized
For an intended output t=Β±1 and a classifier score y, the hinge loss of the prediction y is defined as
>β(y)=max(0,1βtβ y)>
Note that y should be the βrawβ output of the classifierβs decision function, not the predicted class label. For instance, in linear SVMs, y=wβ x+b, where (w,b) are the parameters of the hyperplane and mathbfx is the input variable(s).
The loss function of SVM is convex:
I.e.,
There is only one minimum
We can find it with gradient descent
However: Hinge loss is not differentiable! π€ͺ
Sub-gradients
For convex function f:RdβR :
f(z)β₯f(x)+βf(x)T(zβx)
(Linear approximation underestimates function)
A subgradient of a convex function f at point x is any g such that
f(z)β₯f(x)+βgT(zβx)
Always exists (even f is not differentiable)
If f is differentiable at x, then: g=βf(x)
Example
f(x)=β£xβ£
xξ =0 : unique sub-gradient is g=sign(x)
x=0 : gβ[β1,1]
Sub-gradient Method
Sub-gradient Descent
Given convexf, not necessarily differentiable
Initialize x0β
Repeat: x_t+1=x_t+Ξ·g, where g is any sub-gradient of f at point xtβ
βΌοΈ Notes:
Sub-gradients do not necessarily decrease f at every step (no real descent method)
Need to keep track of the best iterate xβ
Sub-gradients for hinge loss
L(x_i,y_i;w)=max(0,1βy_if(x_i))f(x_i)=wβ€x_i+b
Sub-gradient descent for SVMs
Recall the Unconstrained optimization for SVMs:
wargminβCβ_i=1Nmax(0,1βyiβf(x_i))β_loss function +β₯wβ₯2β_regularization
At each iteration, pick random training sample (xiβ,yiβ)
If yiβf(xiβ)<1: β
wt+1=wtβΞ·(2wtβCyixiβ)
Otherwise:
w_t+1=w_tβΞ·2w_t
Application of SVMs
Pedestrian Tracking
text (and hypertext) categorization
image classification
bioinformatics (Protein classification, cancer classification)
hand-written character recognition
Yet, in the last 5-8 years, neural networks have outperformed SVMs on most applications.π€ͺβΉοΈπ