Kernelized Ridge Regression

Kernelized Ridge Regression

Kernel regression

Kernel identities

Let

Φ_X=[ϕ(x_1)Tϕ(x_N)T]RN×d,(Φ_XT=[ϕ(x_1),,ϕ(x_N)]Rd×N) \boldsymbol{\Phi}\_{X}=\left[\begin{array}{c} \boldsymbol{\phi}\left(\boldsymbol{x}\_{1}\right)^{T} \\\\ \vdots \\\\ \boldsymbol{\phi}\left(\boldsymbol{x}\_{N}\right)^{T} \end{array}\right] \in \mathbb{R}^{N \times d} , \qquad \left( \boldsymbol{\Phi}\_{X}^T = \left[ \boldsymbol{\phi}(x\_1), \dots, \boldsymbol{\phi}(x\_N)\right] \in \mathbb{R}^{d \times N} \right)

then the following identities hold:

  • Kernel matrix

    K=Φ_XΦ_XT \boldsymbol{K}=\boldsymbol{\Phi}\_{X} \boldsymbol{\Phi}\_{X}^{T}

    with

    [K]_ij=ϕ(x_i)Tϕ(x_j)=ϕ(x_i),ϕ(x_j)=k(x_i,x_j) [\boldsymbol{K}]\_{ij}=\boldsymbol{\phi}\left(\boldsymbol{x}\_{i}\right)^{T} \boldsymbol{\phi}(\boldsymbol{x}\_{j}) = \langle \boldsymbol{\phi}(\boldsymbol{x}\_{i}), \boldsymbol{\phi}(\boldsymbol{x}\_{j}) \rangle = k\left(\boldsymbol{x}\_{i}, \boldsymbol{x}\_{j}\right)
  • Kernel vector

    k(x\*)=[k(x_1,x\*)k(x_N,x\*)]=[ϕ(x_1)Tϕ(x\*)ϕ(x_N)Tϕ(x\*)]=Φ_Xϕ(x\*) \boldsymbol{k}\left(\boldsymbol{x}^{\*}\right)=\left[\begin{array}{c} k\left(\boldsymbol{x}\_{1}, \boldsymbol{x}^{\*}\right) \\\\ \vdots \\\\ k\left(\boldsymbol{x}\_{N}, \boldsymbol{x}^{\*}\right) \end{array}\right]=\left[\begin{array}{c} \boldsymbol{\phi}\left(\boldsymbol{x}\_{1}\right)^{T} \boldsymbol{\phi}(\boldsymbol{x}^{\*}) \\\\ \vdots \\\\ \phi\left(\boldsymbol{x}\_{N}\right)^{T} \boldsymbol{\phi}(\boldsymbol{x}^{\*}) \end{array}\right]=\boldsymbol{\Phi}\_{X} \boldsymbol{\phi}\left(\boldsymbol{x}^{\*}\right)

Kernel Ridge Regression

Ridge Regression: (See also: Polynomial Regression (Generalized linear regression models))

    • Squared error function + L2 regularization
  • Linear feature space

  • Not directly applicable in infinite dimensional feature spaces

  • Objective:

    Lridge =(yΦw)T(yΦw)sum of squared errors +λwTwL2 regularization  L_{\text {ridge }}=\underbrace{(\boldsymbol{y}-\boldsymbol{\Phi} \boldsymbol{w})^{T}(\boldsymbol{y}-\boldsymbol{\Phi} \boldsymbol{w})}_{\text {sum of squared errors }}+\lambda \underbrace{\boldsymbol{w}^{T} \boldsymbol{w}}_{L_{2} \text { regularization }}
    • Φ_X=[ϕ(x_1)Tϕ(x_N)T]RN×d\boldsymbol{\Phi}\_{X}=\left[\begin{array}{c} \phi\left(\boldsymbol{x}\_{1}\right)^{T} \\\\ \vdots \\\\ \phi\left(\boldsymbol{x}\_{N}\right)^{T} \end{array}\right] \in \mathbb{R}^{N \times d}
  • Solution

    w_ridge =(ΦTΦ+λI)1_d×d matrix inversion ΦTy \boldsymbol{w}\_{\text {ridge }}^{*}= \color{red}{\underbrace{\left(\boldsymbol{\Phi}^{T} \boldsymbol{\Phi}+\lambda \boldsymbol{I}\right)^{-1}}\_{d \times d \text { matrix inversion }}} \boldsymbol{\Phi}^{T} \boldsymbol{y}

    Matrix inversion infeasible in infinite dimensions!!!😭

Apply kernel trick

Rewrite solution as inner products of the feature space with the following matrix identity

(I+AB)1A=A(I+BA)1 (\boldsymbol{I} + \boldsymbol{A}\boldsymbol{B})^{-1}\boldsymbol{A} = \boldsymbol{A} (\boldsymbol{I} + \boldsymbol{B}\boldsymbol{A})^{-1}

Then we get

w_ridge =(ΦTΦ+λI)1_d×d matrix inversion ΦTy=ΦT(ΦΦT+λI)1_N×N matrix inversion y=ΦT(K+λI)1y=:α=ΦTα \begin{array}{ll} \boldsymbol{w}\_{\text {ridge }}^{*} &= \color{red}{\underbrace{\left(\boldsymbol{\Phi}^{T} \boldsymbol{\Phi}+\lambda \boldsymbol{I}\right)^{-1}}\_{d \times d \text { matrix inversion }}} \boldsymbol{\Phi}^{T} \boldsymbol{y} \\\\ & \overset{}{=} \boldsymbol{\Phi}^{T} \color{LimeGreen}{\underbrace{\left( \boldsymbol{\Phi}\boldsymbol{\Phi}^{T} +\lambda \boldsymbol{I}\right)^{-1}}\_{N \times N \text { matrix inversion }}} \boldsymbol{y} \\\\ &= \boldsymbol{\Phi}^{T} \underbrace{\left( \boldsymbol{K} +\lambda \boldsymbol{I}\right)^{-1}\boldsymbol{y}}_{=: \boldsymbol{\alpha}} \\\\ &= \boldsymbol{\Phi}^{T} \boldsymbol{\alpha} \end{array}
  • beneficial for dNd \gg N
  • Still, w\*Rd\boldsymbol{w}^\* \in \mathbb{R}^d is potentially infinite dimensional and can not be represented

Yet, we can still evaluate the function ff without the explicit representation of w\boldsymbol{w}^* 😉

f(x)=ϕ(x)Tw=ϕ(x)TΦTα=kerneltrickk(x)Tα=_iα_ik(x_i,x) \begin{array}{ll} f(\boldsymbol{x}) & =\boldsymbol{\phi}(\boldsymbol{x})^{T} \boldsymbol{w}^{*} \\\\ & \overset{}{=}\boldsymbol{\phi}(\boldsymbol{x})^{T} \boldsymbol{\Phi}^{T} \boldsymbol{\alpha} \\\\ & \overset{\text{kernel} \\ \text{trick}}{=}\boldsymbol{k}(\boldsymbol{x})^{T} \boldsymbol{\alpha} \\\\ & =\sum\_{i} \alpha\_{i} k\left(\boldsymbol{x}\_{i}, \boldsymbol{x}\right) \end{array}

For a Gaussian kernel

f(x)=iαik(xi,x)=iαiexp(xxi22σ2) f(\boldsymbol{x})=\sum_{i} \alpha_{i} k\left(\boldsymbol{x}_{i}, \boldsymbol{x}\right)=\sum_{i} \alpha_{i} \exp \left(-\frac{\left\|\boldsymbol{x}-\boldsymbol{x}_{i}\right\|^{2}}{2 \sigma^{2}}\right)

Select hyperparameter

Bandwidth parameter σ\sigma in Gaussian kernel

k(x,y)=exp(xy22σ2) k(\boldsymbol{x}, \boldsymbol{y})=\exp \left(-\frac{\|\boldsymbol{x}-\boldsymbol{y}\|^{2}}{2 \sigma^{2}}\right)

are called hyperparameters.

How to choose? Cross validation!

Example:

image-20200305164457118

Summary: kernel ridge regression

The solution for kernel ridge regression is given by

f(x)=k(x)T(K+λI)1y f^{*}(\boldsymbol{x})=\boldsymbol{k}(\boldsymbol{x})^{T}(\boldsymbol{K}+\lambda \boldsymbol{I})^{-1} \boldsymbol{y}
  • No evaluation of the feature vectors needed 👏
  • Only pair-wise scalar products (evaluated by the kernel) 👏
  • Need to invert a N×N\color{red}{N \times N} matrix (can be costly) 🤪

‼️Note:

  • Have to store all samples in kernel-based methods

    • Computationally expensive (matrix inverse is O(n2.376)O(n^{2.376})) !
  • Hyperparameters of the method are given by the kernel-parameters

    • Can be optimized on validation-set
  • Very flexible function representation, only few hyper-parameters 👍