Regression | Haobin Tan

Regression

Mon, 07 Sep 2020 00:00:00 +0000

Linear Regression

Mon, 06 Jul 2020 00:00:00 +0000

Linear Regression Model

A linear model makes a prediction $\hat{y}_i$ by simply computing a weighted sum of the input $\boldsymbol{x}_i$, plus a constant $w_0$ called the bias term:

For single sample/instances

$$ \hat{y}_i = f \left( \boldsymbol{x} \right) = w_0 + \sum\_{j=1}^{D}w\_{j} x\_{i, j} $$

In matrix-form:

$$ \hat{y}\_{i}=w_{0}+ \displaystyle \sum\_{j=1}^{D} w_{j} x_{i, j}=\tilde{\boldsymbol{x}}\_{i}^{T} \boldsymbol{w}\ $$

$\tilde{\boldsymbol{x}}\_{i} = \left[\begin{array}{c}{1} \\\\ {x_{i}}\end{array}\right] = \left[\begin{array}{c} {1} \\\\ x\_{i, 1} \\\\ \vdots \\\\ {x\_{i, D}}\end{array}\right] \in \mathbb{R}^{D+1}$
$\boldsymbol{w}=\left[\begin{array}{c}{w\_{0}} \\\\ {\vdots} \\\\ {w\_{D}}\end{array}\right] \in \mathbb{R}^{D+1}$

On full dataset

$$ \hat{\boldsymbol{y}}=\left[\begin{array}{c}{\hat{y}\_{1}} \\\\{\vdots} \\\\ {\hat{y}\_{n}}\end{array}\right]=\left[\begin{array}{c}{\tilde{\boldsymbol{x}}\_{1}^{T} \boldsymbol{w}} \\\\ {\vdots} \\\\ {\tilde{\boldsymbol{x}}\_{n}^{T} \boldsymbol{w}}\end{array}\right] = \underbrace{\left[\begin{array}{cc}{1} & {\boldsymbol{x}\_{1}^{T}} \\\\ {\vdots} & {\vdots} \\\\ {1} & {\boldsymbol{x}\_{n}^{T}}\end{array}\right]}\_{=: \boldsymbol{X}} \boldsymbol{w} = \boldsymbol{X} \boldsymbol{w} $$

$\hat{\boldsymbol{y}}$: vector containing the output for each sample
$\boldsymbol{X}$: data-matrix containing a vector of ones as the first column as bias

$y=\underbrace{\begin{bmatrix}{\widehat y}\_1 \\\\ \vdots\\\\{\widehat y}\_n\end{bmatrix}}\_{\boldsymbol\in\mathbf ℝ^{n\times1}}=\begin{bmatrix}\widehat x\_1^Tw\\\\\vdots\\\\\widehat x\_n^Tw\end{bmatrix}=\begin{bmatrix}1\cdot w\_0+x\_{1,1}\cdot w\_1+\cdots+x\_{1,D}\cdot w\_D\\\\\vdots\\\\1\cdot w\_0+x_{n,1}\cdot w\_1+\cdots+x\_{n,D}\cdot w_D\end{bmatrix}=\underset{=\begin{bmatrix}1&x\_1^T\\\\\vdots&\vdots\\\\1&x_n^T\end{bmatrix}\\\\=:\boldsymbol X\in\mathbb{R}^{n\times(1+D)}}{\underbrace{\begin{bmatrix}1&x\_{1,1}&\cdots&x\_{1,D}\\\\\vdots&\vdots&\ddots&\vdots\\\\1&x\_{n,1}&\cdots&x\_{n,D}\end{bmatrix}}\cdot}\underbrace{\begin{bmatrix}w\_0\\\\w_1\\\\\vdots\\\\w\_D\end{bmatrix}}\_{=:\boldsymbol w\boldsymbol\in\mathbf ℝ^{\boldsymbol(\mathbf1\boldsymbol+\mathbf D\boldsymbol)\boldsymbol\times\mathbf1}}$

Polynomial Regression (Generalized linear regression models)

Mon, 13 Jul 2020 00:00:00 +0000

💡Idea

Use a linear model to fit nonlinear data: add powers of each feature as new features, then train a linear model on this extended set of features.

Generalize Linear Regression to Polynomial Regression

In Linear Regression $f$ is modelled as linear in $\boldsymbol{x}$ and $\boldsymbol{w}$

$ f(x) = \hat{\boldsymbol{x}}^T \boldsymbol{w} $

Rewrite it more generally:

$ f(x) = \phi(\boldsymbol{x})^T \boldsymbol{w} $

$\phi(\boldsymbol{x})$: vector valued funtion of the input vector $\boldsymbol{x}$ (also called “linear basis function models”)
- $\phi_i(\boldsymbol{x})$: basis functions

In principle, this allows us to learn any non-linear function, if we know suitable basis functions (which is typically not the case 🤪).

Example 1

$\boldsymbol{x}=\left[\begin{array}{c}{x_1 \\ x_2}\end{array}\right] \in \mathbb{R}^{2}$

$\phi: \mathbb{R}^2 \to \mathbb{R}^3, \left[\begin{array}{c}{x_1 \\ x_2}\end{array}\right] \mapsto \left[\begin{array}{c}{1 \\ x_1 \\ x_2}\end{array}\right]\qquad $(I.e.: $\phi_1(\boldsymbol{x}) = 1, \phi_2(\boldsymbol{x}) = x_1, \phi_3(\boldsymbol{x}) = x_2$)

Example 2

$\boldsymbol{x}=\left[\begin{array}{c}{x_1 \\ x_2}\end{array}\right] \in \mathbb{R}^{2}$

$\phi: \mathbb{R}^2 \to \mathbb{R}^5, \left[\begin{array}{c}{x_1 \\ x_2}\end{array}\right] \mapsto \left[\begin{array}{c}{1 \\ x_1 \\ x_2 \\ x_1^2 \\ x_2^2}\end{array}\right]$

Optimal value of $\boldsymbol{w}$

$ \boldsymbol{w}^{*}=\left(\boldsymbol{\Phi}^{T} \boldsymbol{\Phi}\right)^{-1} \boldsymbol{\Phi}^{T} \boldsymbol{y}, \qquad \mathbf{\Phi}=\left[\begin{array}{c}{\phi_{1}^{T}} \\\\ {\vdots} \\\\ {\phi_{n}^{T}}\end{array}\right] $

(The same as in Linear Regression, just the data matrix is now replaced by the basis function matrix)

Challenge of Polynomial Regression: Overfitting

Reason: Too complex model (Degree of the polynom is too high!). It fits the noise and has unspecified behaviour between the training points.😭

Solution: Regularization

Regularization

Regularization: Constrain a model to make it simpler and reduce the task of overfitting.

💡 Avoid overfitting by forcing the weights $\boldsymbol{w}$ to be small

Assume that our model has degree of 3 ($x^1, x^2, x^3$), and the corresponding parameters/weights are $w_1, w_2, w_3$. If we force $w_3=0$, then $w_3 x^3 = 0$, meaning that the model now has only degree of 2. In other words. the model is somehow simpler.

In general, a regularized model has the following cost/objective function:

$$ \underbrace{E\_D(\boldsymbol{w})}\_{\text{Data term}} + \underbrace{\lambda E\_W(\boldsymbol{w})}\_{\text{Regularization term}} $$

$\lambda$: regularization factor (hyperparameter, need to be tuned manually), controls how much you want to regularize the model.

Regularized Least Squares (Ridge Regression)

Consists of:

Sum of Squareds Error (SSE) function
quadratic regulariser ($L_2$-Regularization)

$ \begin{aligned} L_{\text {ridge }} &= \mathbf{SSE} + \lambda \|w\|^2 \\\\ &= (\boldsymbol{y}-\boldsymbol{\Phi} \boldsymbol{w})^{T}(\boldsymbol{y}-\boldsymbol{\Phi} \boldsymbol{w})+\lambda \boldsymbol{w}^{T} \boldsymbol{w} \end{aligned} $

Solution:

$\boldsymbol{w}_{\mathrm{ridge}}^{*}=\left(\boldsymbol{\Phi}^{T} \boldsymbol{\Phi}+\lambda \boldsymbol{I}\right)^{-1} \boldsymbol{\Phi}^{T} \boldsymbol{y}$

$\boldsymbol{I}$: Identity matrix
$\left(\boldsymbol{\Phi}^{T} \boldsymbol{\Phi}+\lambda \boldsymbol{I}\right)$ is full rank and can be easily inverted

Kernelized Ridge Regression

Mon, 13 Jul 2020 00:00:00 +0000

Kernel regression

Kernel identities

Let

$$ \boldsymbol{\Phi}\_{X}=\left[\begin{array}{c} \boldsymbol{\phi}\left(\boldsymbol{x}\_{1}\right)^{T} \\\\ \vdots \\\\ \boldsymbol{\phi}\left(\boldsymbol{x}\_{N}\right)^{T} \end{array}\right] \in \mathbb{R}^{N \times d} , \qquad \left( \boldsymbol{\Phi}\_{X}^T = \left[ \boldsymbol{\phi}(x\_1), \dots, \boldsymbol{\phi}(x\_N)\right] \in \mathbb{R}^{d \times N} \right) $$

then the following identities hold:

Kernel matrix
$$ \boldsymbol{K}=\boldsymbol{\Phi}\_{X} \boldsymbol{\Phi}\_{X}^{T} $$
with
$$ [\boldsymbol{K}]\_{ij}=\boldsymbol{\phi}\left(\boldsymbol{x}\_{i}\right)^{T} \boldsymbol{\phi}(\boldsymbol{x}\_{j}) = \langle \boldsymbol{\phi}(\boldsymbol{x}\_{i}), \boldsymbol{\phi}(\boldsymbol{x}\_{j}) \rangle = k\left(\boldsymbol{x}\_{i}, \boldsymbol{x}\_{j}\right) $$
Kernel vector
$$ \boldsymbol{k}\left(\boldsymbol{x}^{\*}\right)=\left[\begin{array}{c} k\left(\boldsymbol{x}\_{1}, \boldsymbol{x}^{\*}\right) \\\\ \vdots \\\\ k\left(\boldsymbol{x}\_{N}, \boldsymbol{x}^{\*}\right) \end{array}\right]=\left[\begin{array}{c} \boldsymbol{\phi}\left(\boldsymbol{x}\_{1}\right)^{T} \boldsymbol{\phi}(\boldsymbol{x}^{\*}) \\\\ \vdots \\\\ \phi\left(\boldsymbol{x}\_{N}\right)^{T} \boldsymbol{\phi}(\boldsymbol{x}^{\*}) \end{array}\right]=\boldsymbol{\Phi}\_{X} \boldsymbol{\phi}\left(\boldsymbol{x}^{\*}\right) $$

Kernel Ridge Regression

Ridge Regression: (See also: Polynomial Regression (Generalized linear regression models))

- Squared error function + L2 regularization
Linear feature space
Not directly applicable in infinite dimensional feature spaces
Objective:
$$ L_{\text {ridge }}=\underbrace{(\boldsymbol{y}-\boldsymbol{\Phi} \boldsymbol{w})^{T}(\boldsymbol{y}-\boldsymbol{\Phi} \boldsymbol{w})}_{\text {sum of squared errors }}+\lambda \underbrace{\boldsymbol{w}^{T} \boldsymbol{w}}_{L_{2} \text { regularization }} $$
- $\boldsymbol{\Phi}\_{X}=\left[\begin{array}{c} \phi\left(\boldsymbol{x}\_{1}\right)^{T} \\\\ \vdots \\\\ \phi\left(\boldsymbol{x}\_{N}\right)^{T} \end{array}\right] \in \mathbb{R}^{N \times d}$
Solution
$$ \boldsymbol{w}\_{\text {ridge }}^{*}= \color{red}{\underbrace{\left(\boldsymbol{\Phi}^{T} \boldsymbol{\Phi}+\lambda \boldsymbol{I}\right)^{-1}}\_{d \times d \text { matrix inversion }}} \boldsymbol{\Phi}^{T} \boldsymbol{y} $$
Matrix inversion infeasible in infinite dimensions!!!😭

Apply kernel trick

Rewrite solution as inner products of the feature space with the following matrix identity

$$ (\boldsymbol{I} + \boldsymbol{A}\boldsymbol{B})^{-1}\boldsymbol{A} = \boldsymbol{A} (\boldsymbol{I} + \boldsymbol{B}\boldsymbol{A})^{-1} $$

Then we get

$$ \begin{array}{ll} \boldsymbol{w}\_{\text {ridge }}^{*} &= \color{red}{\underbrace{\left(\boldsymbol{\Phi}^{T} \boldsymbol{\Phi}+\lambda \boldsymbol{I}\right)^{-1}}\_{d \times d \text { matrix inversion }}} \boldsymbol{\Phi}^{T} \boldsymbol{y} \\\\ & \overset{}{=} \boldsymbol{\Phi}^{T} \color{LimeGreen}{\underbrace{\left( \boldsymbol{\Phi}\boldsymbol{\Phi}^{T} +\lambda \boldsymbol{I}\right)^{-1}}\_{N \times N \text { matrix inversion }}} \boldsymbol{y} \\\\ &= \boldsymbol{\Phi}^{T} \underbrace{\left( \boldsymbol{K} +\lambda \boldsymbol{I}\right)^{-1}\boldsymbol{y}}_{=: \boldsymbol{\alpha}} \\\\ &= \boldsymbol{\Phi}^{T} \boldsymbol{\alpha} \end{array} $$

beneficial for $d \gg N$
Still, $\boldsymbol{w}^\* \in \mathbb{R}^d$ is potentially infinite dimensional and can not be represented

Yet, we can still evaluate the function $f$ without the explicit representation of $\boldsymbol{w}^*$ 😉

$$ \begin{array}{ll} f(\boldsymbol{x}) & =\boldsymbol{\phi}(\boldsymbol{x})^{T} \boldsymbol{w}^{*} \\\\ & \overset{}{=}\boldsymbol{\phi}(\boldsymbol{x})^{T} \boldsymbol{\Phi}^{T} \boldsymbol{\alpha} \\\\ & \overset{\text{kernel} \\ \text{trick}}{=}\boldsymbol{k}(\boldsymbol{x})^{T} \boldsymbol{\alpha} \\\\ & =\sum\_{i} \alpha\_{i} k\left(\boldsymbol{x}\_{i}, \boldsymbol{x}\right) \end{array} $$

For a Gaussian kernel

$$ f(\boldsymbol{x})=\sum_{i} \alpha_{i} k\left(\boldsymbol{x}_{i}, \boldsymbol{x}\right)=\sum_{i} \alpha_{i} \exp \left(-\frac{\left\|\boldsymbol{x}-\boldsymbol{x}_{i}\right\|^{2}}{2 \sigma^{2}}\right) $$

Select hyperparameter

Bandwidth parameter $\sigma$ in Gaussian kernel

$$ k(\boldsymbol{x}, \boldsymbol{y})=\exp \left(-\frac{\|\boldsymbol{x}-\boldsymbol{y}\|^{2}}{2 \sigma^{2}}\right) $$

are called hyperparameters.

How to choose? Cross validation!

Example:

Summary: kernel ridge regression

The solution for kernel ridge regression is given by

$$ f^{*}(\boldsymbol{x})=\boldsymbol{k}(\boldsymbol{x})^{T}(\boldsymbol{K}+\lambda \boldsymbol{I})^{-1} \boldsymbol{y} $$

No evaluation of the feature vectors needed 👏
Only pair-wise scalar products (evaluated by the kernel) 👏
Need to invert a $\color{red}{N \times N}$ matrix (can be costly) 🤪

‼️Note:

Have to store all samples in kernel-based methods
- Computationally expensive (matrix inverse is $O(n^{2.376})$) !
Hyperparameters of the method are given by the kernel-parameters
- Can be optimized on validation-set
Very flexible function representation, only few hyper-parameters 👍

Classification And Regression Tree (CART)

Tue, 27 Oct 2020 00:00:00 +0000

Tree-based Methods

CART: Classification And Regression Tree

Grow a binary tree

At each node, “split” the data into two “daughter” nodes.
Splits are chosen using a splitting criterion.
Bottom nodes are “terminal” nodes.

	Type of tree	Predicted value at a node	Split criterion
Regression	Regression tree	The predicted value at a node is the average response variable for all observations in the node	Minimum residual sum of squares $$\mathrm{RSS}=\sum_{\text {left }}\left(y_{i}-\bar{y}_{L}\right)^{2}+\sum_{\text {right }}\left(y_{i}-\bar{y}_{R}\right)^{2}$$ $\bar{y}_L$ / $\bar{y}_R$: average label values in the left / right subtree (Split such that variance in subtress is minimized)
Classification	Decision tree	The predicted class is the most common class in the node (majority vote).	Minimum entropy in subtrees $$\text { score }=N_{L} H\left(p_{\mathrm{L}}\right)+N_{R} H\left(p_{\mathrm{R}}\right)$$ $H\left(p_{L}\right)=-\sum_{k} p_{L}(k) \log p_{L}(k)$: entropy in the left sub-tree $p_L(k)$: proportion of class $k$ in left tree (Split such that class-labels in sub-trees are “pure”)

When stop?

Stop if:

Minimum number of samples per node
Maximum depth

… has been reached

(Both criterias again influence the complexity of the tree)

Controlling the tree complexity

Number of samples per leaf	Affect
Small	Tree is very sensitive to noise
Large	Tree is not expressive enough

Advantages 👍

Applicable to both regression and classification problems.
Handle categorical predictors naturally.
Computationally simple and quick to fit, even for large problems.
No formal distributional assumptions (non-parametric).
Can handle highly non-linear interactions and classification boundaries.
Automatic variable selection.
Very easy to interpret if the tree is small.

Disadvantages 👎

Accuracy

current methods, such as support vector machines and ensemble classifiers often have 30% lower error rates than CART.
Instability

if we change the data a little, the tree picture can change a lot. So the interpretation is not as straightforward as it appears.