Kernelized Ridge Regression
Kernel regression
Kernel identities
Let
Φ_X=ϕ(x_1)T⋮ϕ(x_N)T∈RN×d,(Φ_XT=[ϕ(x_1),…,ϕ(x_N)]∈Rd×N)then the following identities hold:
Kernel matrix
K=Φ_XΦ_XTwith
[K]_ij=ϕ(x_i)Tϕ(x_j)=⟨ϕ(x_i),ϕ(x_j)⟩=k(x_i,x_j)Kernel vector
k(x\*)=k(x_1,x\*)⋮k(x_N,x\*)=ϕ(x_1)Tϕ(x\*)⋮ϕ(x_N)Tϕ(x\*)=Φ_Xϕ(x\*)
Kernel Ridge Regression
Ridge Regression: (See also: Polynomial Regression (Generalized linear regression models))
Apply kernel trick
Rewrite solution as inner products of the feature space with the following matrix identity
(I+AB)−1A=A(I+BA)−1Then we get
w_ridge ∗=(ΦTΦ+λI)−1_d×d matrix inversion ΦTy=ΦT(ΦΦT+λI)−1_N×N matrix inversion y=ΦT=:α(K+λI)−1y=ΦTα- beneficial for d≫N
- Still, w\*∈Rd is potentially infinite dimensional and can not be represented
Yet, we can still evaluate the function f without the explicit representation of w∗ 😉
f(x)=ϕ(x)Tw∗=ϕ(x)TΦTα=kerneltrickk(x)Tα=∑_iα_ik(x_i,x)For a Gaussian kernel
f(x)=i∑αik(xi,x)=i∑αiexp(−2σ2∥x−xi∥2)Select hyperparameter
Bandwidth parameter σ in Gaussian kernel
k(x,y)=exp(−2σ2∥x−y∥2)are called hyperparameters.
How to choose? Cross validation!
Example:
Summary: kernel ridge regression
The solution for kernel ridge regression is given by
f∗(x)=k(x)T(K+λI)−1y- No evaluation of the feature vectors needed 👏
- Only pair-wise scalar products (evaluated by the kernel) 👏
- Need to invert a N×N matrix (can be costly) 🤪
‼️Note:
Have to store all samples in kernel-based methods
- Computationally expensive (matrix inverse is O(n2.376)) !
Hyperparameters of the method are given by the kernel-parameters
- Can be optimized on validation-set
Very flexible function representation, only few hyper-parameters 👍