Polynomial Regression

Polynomial Regression (Generalized linear regression models)

💡Idea

Use a linear model to fit nonlinear data: add powers of each feature as new features, then train a linear model on this extended set of features.

Generalize Linear Regression to Polynomial Regression

In Linear Regression $f$ is modelled as linear in $\boldsymbol{x}$ and $\boldsymbol{w}$

$ f(x) = \hat{\boldsymbol{x}}^T \boldsymbol{w} $

Rewrite it more generally:

$ f(x) = \phi(\boldsymbol{x})^T \boldsymbol{w} $

  • $\phi(\boldsymbol{x})$: vector valued funtion of the input vector $\boldsymbol{x}$ (also called “linear basis function models”)
    • $\phi_i(\boldsymbol{x})$: basis functions

In principle, this allows us to learn any non-linear function, if we know suitable basis functions (which is typically not the case 🤪).

Example 1

$\boldsymbol{x}=\left[\begin{array}{c}{x_1 \\ x_2}\end{array}\right] \in \mathbb{R}^{2}$

$\phi: \mathbb{R}^2 \to \mathbb{R}^3, \left[\begin{array}{c}{x_1 \\ x_2}\end{array}\right] \mapsto \left[\begin{array}{c}{1 \\ x_1 \\ x_2}\end{array}\right]\qquad $(I.e.: $\phi_1(\boldsymbol{x}) = 1, \phi_2(\boldsymbol{x}) = x_1, \phi_3(\boldsymbol{x}) = x_2$)

Pol_Reg_Example_2

Example 2

$\boldsymbol{x}=\left[\begin{array}{c}{x_1 \\ x_2}\end{array}\right] \in \mathbb{R}^{2}$

$\phi: \mathbb{R}^2 \to \mathbb{R}^5, \left[\begin{array}{c}{x_1 \\ x_2}\end{array}\right] \mapsto \left[\begin{array}{c}{1 \\ x_1 \\ x_2 \\ x_1^2 \\ x_2^2}\end{array}\right]$

Pol_Reg_Example_3

Optimal value of $\boldsymbol{w}$

$ \boldsymbol{w}^{*}=\left(\boldsymbol{\Phi}^{T} \boldsymbol{\Phi}\right)^{-1} \boldsymbol{\Phi}^{T} \boldsymbol{y}, \qquad \mathbf{\Phi}=\left[\begin{array}{c}{\phi_{1}^{T}} \\\\ {\vdots} \\\\ {\phi_{n}^{T}}\end{array}\right] $

(The same as in Linear Regression, just the data matrix is now replaced by the basis function matrix)

Challenge of Polynomial Regression: Overfitting

Reason: Too complex model (Degree of the polynom is too high!). It fits the noise and has unspecified behaviour between the training points.😭

Solution: Regularization

Regularization

Regularization: Constrain a model to make it simpler and reduce the task of overfitting.

💡 Avoid overfitting by forcing the weights $\boldsymbol{w}$ to be small

Assume that our model has degree of 3 ($x^1, x^2, x^3$), and the corresponding parameters/weights are $w_1, w_2, w_3$. If we force $w_3=0$, then $w_3 x^3 = 0$, meaning that the model now has only degree of 2. In other words. the model is somehow simpler.

In general, a regularized model has the following cost/objective function:

$$ \underbrace{E\_D(\boldsymbol{w})}\_{\text{Data term}} + \underbrace{\lambda E\_W(\boldsymbol{w})}\_{\text{Regularization term}} $$

$\lambda$: regularization factor (hyperparameter, need to be tuned manually), controls how much you want to regularize the model.

Regularized Least Squares (Ridge Regression)

Consists of:

  • Sum of Squareds Error (SSE) function
  • quadratic regulariser ($L_2$-Regularization)

$ \begin{aligned} L_{\text {ridge }} &= \mathbf{SSE} + \lambda \|w\|^2 \\\\ &= (\boldsymbol{y}-\boldsymbol{\Phi} \boldsymbol{w})^{T}(\boldsymbol{y}-\boldsymbol{\Phi} \boldsymbol{w})+\lambda \boldsymbol{w}^{T} \boldsymbol{w} \end{aligned} $

Solution:

$\boldsymbol{w}_{\mathrm{ridge}}^{*}=\left(\boldsymbol{\Phi}^{T} \boldsymbol{\Phi}+\lambda \boldsymbol{I}\right)^{-1} \boldsymbol{\Phi}^{T} \boldsymbol{y}$

  • $\boldsymbol{I}$: Identity matrix
  • $\left(\boldsymbol{\Phi}^{T} \boldsymbol{\Phi}+\lambda \boldsymbol{I}\right)$ is full rank and can be easily inverted