SVM: Kernelized SVM

SVM: Kernelized SVM

SVM (with features)

  • Maximum margin principle

  • Slack variables allow for margin violation

    argmin⁑wβˆ₯wβˆ₯2+Cβˆ‘iNΞΎi s.t. yi(wTΟ•(xi)+b)β‰₯1βˆ’ΞΎi,ΞΎiβ‰₯0 \begin{array}{ll} \underset{\mathbf{w}}{\operatorname{argmin}} \quad &\|\mathbf{w}\|^{2} + C \sum_i^N \xi_i \\\\ \text { s.t. } \quad & y_{i}\left(\mathbf{w}^{T} \color{red}{\phi(\mathbf{x}_{i})} + b\right) \geq 1 -\xi_i, \quad \xi_i \geq 0\end{array}

Math basics

Solve the constrained optimization problem: Method of Lagrangian Multipliers

  • Primal optimization problem:
min⁑xf(x) s.t. hi(x)β‰₯bi, for i=1…K \begin{array}{ll} \underset{\boldsymbol{x}}{\min} \quad & f(\boldsymbol{x}) \\\\ \text { s.t. } \quad & h_{i}(\boldsymbol{x}) \geq b_{i}, \text { for } i=1 \ldots K \end{array}
  • Lagrangian optimization:
min⁑xmax⁑λL(x,Ξ»)=f(x)βˆ’βˆ‘i=1KΞ»i(hi(x)βˆ’bi) s.t. Ξ»iβ‰₯0,i=1…K \begin{array}{ll} \underset{\boldsymbol{x}}{\min} \underset{\boldsymbol{\lambda}}{\max} \quad & L(\boldsymbol{x}, \boldsymbol{\lambda}) = f(\boldsymbol{x}) - \sum_{i=1}^K \lambda_i(h_i(\boldsymbol{x}) - b_i) \\\\ \text{ s.t. } &\lambda_i\geq 0, \quad i = 1\dots K \end{array}
  • Dual optimization problem

    Ξ»\*=arg⁑max⁑λg(Ξ»),g(Ξ»)=min⁑_xL(x,Ξ») s.t. Ξ»iβ‰₯0, for i=1…K \begin{aligned} \boldsymbol{\lambda}^{\*}=\underset{\boldsymbol{\lambda}}{\arg \max } g(\boldsymbol{\lambda}), \quad & g(\boldsymbol{\lambda})=\min \_{\boldsymbol{x}} L(\boldsymbol{x}, \boldsymbol{\lambda}) \\\\ \text { s.t. } \quad \lambda_{i} \geq 0, & \text { for } i=1 \ldots K \end{aligned}
    • gg : dual function of the optimization problem
    • Essentially swapped min and max in the definition of LL
  • Slaters condition: For a convex objective and convex constraints, solving the dual is equivalent to solving the primal

    • I.e., optimal primal parameters can be obtained from optimal dual parameters xβˆ—=argmin⁑xL(x,Ξ»βˆ—) \boldsymbol{x}^* = \underset{\boldsymbol{x}}{\operatorname{argmin}}L(\boldsymbol{x}, \boldsymbol{\lambda}^*)

Dual derivation of the SVM

  1. SVM optimization:

    argmin⁑wβˆ₯wβˆ₯2 s.t. yi(wTΟ•(xi)+b)β‰₯1 \begin{array}{ll} &\underset{\boldsymbol{w}}{\operatorname{argmin}} \quad &\|\boldsymbol{w}\|^2 \\\\ &\text{ s.t. } \quad &y_i(\boldsymbol{w}^T\phi(\mathbf{x}_i) + b) \geq 1 \end{array}
  2. Lagrangian function:

    L(w,Ξ»)=12wTwβˆ’βˆ‘iΞ±i(yi(wTΟ•(xi)+b)βˆ’1) L(\boldsymbol{w}, \boldsymbol{\lambda})=\frac{1}{2} \boldsymbol{w}^{T} \boldsymbol{w}-\sum_{i} \alpha_{i}\left(y_{i}\left(\boldsymbol{w}^{T} \phi\left(\boldsymbol{x}_{i}\right)+b\right)-1\right)
  3. Compute optimal w\boldsymbol{w}

    βˆ‚Lβˆ‚w=wβˆ’βˆ‘iΞ±iyiΟ•(xi)=!0⇔wβˆ—=βˆ‘iΞ±iyiΟ•(xi) \begin{align} &\frac{\partial L}{\partial \boldsymbol{w}} = \boldsymbol{w} - \sum_i \alpha_i y_i \phi(\boldsymbol{x}_i) \overset{!}{=} 0 \\\\ \Leftrightarrow \quad & \color{CornflowerBlue}{\boldsymbol{w}^* = \sum_i \alpha_i y_i \phi(\boldsymbol{x}_i)} \end{align}
    • Many of the Ξ±i\alpha_i will be zero (constraint satisfied)

    • If Ξ±iβ‰ 0β‡’complementary slacknessyi(wTΟ•(xi)+b)βˆ’1=0\alpha_i \neq 0 \overset{\text{complementary slackness}}{\Rightarrow} y_{i}\left(\boldsymbol{w}^{T} \phi\left(\boldsymbol{x}_{i}\right)+b\right)-1 =0

      β‡’Ο•(xi)\Rightarrow \phi(\boldsymbol{x}_i) is a support vector

    • The optimal weight vector w\boldsymbol{w} is a linear combination of the support vectors! πŸ‘

  4. Optimality condition for bb:

    βˆ‚Lβˆ‚b=βˆ’βˆ‘iΞ±iyi=!0β‡’βˆ‘iΞ±iyi=0 \frac{\partial L}{\partial b} = - \sum_i \alpha_i y_i \overset{!}{=} 0 \quad \Rightarrow \sum_i \alpha_i y_i = 0
    • We do not obtain a solution for bb
    • But an additional condition for Ξ±\alpha

    bb can be computed from ww:

    If Ξ±_i>0\alpha\_i > 0, then x_i\boldsymbol{x}\_i is on the margin due to complementary slackness condition. I.e.:

    yi(wTΟ•(xi)+b)βˆ’1=0yi(wTΟ•(xi)+b)=1yiyi⏟=1(wTΟ•(xi)+b)=yiβ‡’b=yiβˆ’wTΟ•(xi) \begin{align}y_{i}\left(\boldsymbol{w}^{T} \phi\left(\boldsymbol{x}_{i}\right)+b\right)-1 &= 0 \\\\y_{i}\left(\boldsymbol{w}^{T} \phi\left(\boldsymbol{x}_{i}\right)+b\right) &= 1 \\\\ \underbrace{y_{i} y_{i}}_{=1}\left(\boldsymbol{w}^{T} \phi\left(\boldsymbol{x}_{i}\right)+b\right) &= y_{i} \\\\ \Rightarrow b = y_{i} - \boldsymbol{w}^{T} \phi\left(\boldsymbol{x}_{i}\right)\end{align}

Apply kernel tricks for SVM

  • Lagrangian:
L(w,Ξ»)=12wTwβˆ’βˆ‘iΞ±_i(y_i(wTΟ•(xi)+b)βˆ’1),w\*=βˆ‘_iΞ±iy_iΟ•(x_i) L(\boldsymbol{w}, \boldsymbol{\lambda}) = {\color{red}{\frac{1}{2} \boldsymbol{w}^{T} \boldsymbol{w}}} - \sum_{i} \alpha\_{i}\left({\color{green}{y\_{i} (w^{T} \phi\left(x_{i}\right)}}+ b)-\color{CornflowerBlue}{1}\right), \quad \boldsymbol{w}^{\*}=\sum\_{i} \alpha_{i} y\_{i} \phi\left(\boldsymbol{x}\_{i}\right)
  • Dual function (Wolfe Dual Lagrangian function):
g(Ξ±)=L(wβˆ—,Ξ±)=12βˆ‘iβˆ‘jΞ±iΞ±jyiyjΟ•(xi)TΟ•(xj)⏟wβˆ—Twβˆ—βˆ’βˆ‘iΞ±iyi(βˆ‘jΞ±jyjΟ•(xj)⏟wβˆ—)TΟ•(xi)+βˆ‘iΞ±i=βˆ‘iΞ±iβˆ’12βˆ‘iβˆ‘jΞ±iΞ±jyiyjΟ•(xi)TΟ•(xj)⏟=k(xi,xj)=βˆ‘iΞ±iβˆ’12βˆ‘iβˆ‘jΞ±iΞ±jyiyjk(xi,xj) \begin{aligned} g(\boldsymbol{\alpha}) &=L\left(\boldsymbol{w}^{*}, \boldsymbol{\alpha}\right) \\\\ &=\color{red}{\frac{1}{2} \underbrace{\sum_{i} \sum_{j} \alpha_{i} \alpha_{j} y_{i} y_{j} \phi\left(\boldsymbol{x}_{i}\right)^{T} \phi\left(\boldsymbol{x}_{j}\right)}_{{\boldsymbol{w}^*}^T \boldsymbol{w}^*}} - \color{green}{\sum_{i} \alpha_{i} y_{i}(\underbrace{\sum_{j} \alpha_{j} y_{j} \phi\left(x_{j}\right)}_{\boldsymbol{w}^*})^{T} \phi\left(x_{i}\right)} + \color{CornflowerBlue}{\sum_{i} \alpha_{i}} \\\\ &=\sum_{i} \alpha_{i}-\frac{1}{2} \sum_{i} \sum_{j} \alpha_{i} \alpha_{j} y_{i} y_{j} \underbrace{\phi\left(\boldsymbol{x}_{i}\right)^{T} \phi\left(\boldsymbol{x}_{j}\right)}_{\overset{}{=} \boldsymbol{k}(\boldsymbol{x}_i, \boldsymbol{x}_j)} \\\\ &= \sum_{i} \alpha_{i}-\frac{1}{2} \sum_{i} \sum_{j} \alpha_{i} \alpha_{j} y_{i} y_{j} \boldsymbol{k}(\boldsymbol{x}_i, \boldsymbol{x}_j ) \end{aligned}
  • Wolfe dual optimization problem:
minβ‘Ξ±βˆ‘iΞ±iβˆ’12βˆ‘iβˆ‘jΞ±iΞ±jyiyjk(xi,xj) s.t Ξ±iβ‰₯0βˆ€i=1,…,Nβˆ‘iΞ±iyi=0 \begin{array}{ll} \underset{\boldsymbol{\alpha}}{\min} \quad & \sum_{i} \alpha_{i}-\frac{1}{2} \sum_{i} \sum_{j} \alpha_{i} \alpha_{j} y_{i} y_{j} \boldsymbol{k}(\boldsymbol{x}_i, \boldsymbol{x}_j ) \\\\ \text{ s.t } \quad & \alpha_i \geq 0 \quad \forall i = 1, \dots, N \\\\ & \sum_i \alpha_i y_i = 0 \end{array}
  • Compute primal from dual parameters:

    • Weight vector

      wβˆ—=βˆ‘iΞ±iyiΟ•(xi)\labeleq:weightvector \boldsymbol{w}^{*}=\sum_{i} \alpha_{i} y_{i} \phi\left(\boldsymbol{x}_{i}\right) \label{eq:weight vector}
      • Can not be represented (as it is potentially infinite dimensional). But don’t worry, we don’t need the explicit representation
    • Bias: For any ii with Ξ±i>0\alpha_i > 0 :

    b=ykβˆ’wTΟ•(xk)=ykβˆ’βˆ‘iyiΞ±ik(xi,xk) \begin{array}{ll} b &=y_{k}-\mathbf{w}^{T} \phi\left(\boldsymbol{x}_{k}\right) \\\\ &=y_{k}-\sum_{i} y_{i} \alpha_{i} k\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{k}\right) \end{array}
    • Decision function (Again, we use the kernel trick and therefore we don’t need the explicit representation of the weight vector wβˆ—\boldsymbol{w}^*)
    f(x)=(wβˆ—)TΟ•(x)+b=(βˆ‘iΞ±iyiΟ•(xi))TΟ•(x)+b=βˆ‘iΞ±iyiΟ•(xi)TΟ•(x)+b=βˆ‘iyiΞ±ik(xi,x)+b \begin{aligned}f(\boldsymbol{x}) &= (\boldsymbol{w}^{*})^{T} \boldsymbol{\phi}(\boldsymbol{x}) + b \\\\ &\overset{}{=} \left(\sum_{i} \alpha_{i} y_{i} \phi\left(\boldsymbol{x}_{i}\right)\right)^{T} \boldsymbol{\phi}(\boldsymbol{x}) + b \\\\ &= \sum_{i} \alpha_{i} y_{i} \boldsymbol{\phi}(\boldsymbol{x}_i)^{T} \boldsymbol{\phi}(\boldsymbol{x}) + b \\\\ & \overset{}{=}\sum_i y_{i} \alpha_{i} k\left(\boldsymbol{x}_{i}, \boldsymbol{x}\right)+b\end{aligned}

Relaxed constraints with slack variable

  • Primal optimization problem

    argmin⁑wβˆ₯wβˆ₯2+Cβˆ‘iNΞΎi s.t. yi(wTxi+b)β‰₯1βˆ’ΞΎi,ΞΎiβ‰₯0 \begin{array}{ll} \underset{\mathbf{w}}{\operatorname{argmin}} \quad &\|\mathbf{w}\|^{2} + \color{CornflowerBlue}{C \sum_i^N \xi_i} \\\\ \text { s.t. } \quad & y_{i}\left(\mathbf{w}^{T} \mathbf{x}_{i} + b\right) \geq 1 - \color{CornflowerBlue}{\xi_i}, \quad \color{CornflowerBlue}{\xi_i} \geq 0\end{array}
  • Dual optimization problem

    minβ‘Ξ±βˆ‘iΞ±iβˆ’12βˆ‘iβˆ‘jΞ±iΞ±jyiyjk(xi,xj) s.t Cβ‰₯Ξ±iβ‰₯0βˆ€i=1,…,Nβˆ‘iΞ±iyi=0 \begin{array}{ll}\underset{\boldsymbol{\alpha}}{\min} \quad & \sum_{i} \alpha_{i}-\frac{1}{2} \sum_{i} \sum_{j} \alpha_{i} \alpha_{j} y_{i} y_{j} \boldsymbol{k}(\boldsymbol{x}_i, \boldsymbol{x}_j ) \\\\ \text{ s.t } \quad & \color{CornflowerBlue}{C \geq} \alpha_i \geq 0 \quad \forall i = 1, \dots, N \\\\ & \sum_i \alpha_i y_i = 0\end{array}

    Add upper bound of C\color{CornflowerBlue}{C} on Ξ±i\color{CornflowerBlue}{\alpha_i}

    • Without slack, Ξ±iβ†’βˆž\alpha_i \to \infty when constraints are violated (points misclassified)
    • Upper bound of CC limits the Ξ±i\alpha_i, so misclassifications are allowed