Model Selection | Haobin Tan

Model Selection

Mon, 07 Sep 2020 00:00:00 +0000

Objective Function

Mon, 06 Jul 2020 00:00:00 +0000

How does the objective function look like?

Objective function:

$$ \operatorname{Obj}(\Theta)= \overbrace{L(\Theta)}^{\text {Training Loss}} + \underbrace{\Omega(\Theta)}_{\text{Regularization}} $$

Training loss: measures how well the model fit on training data
$$ L=\sum_{i=1}^{n} l\left(y_{i}, g_{i}\right) $$
- Square loss: $$ l(y_i, \hat{y}_i) = (y_i - \hat{y}_i)^2 $$
- Logistic loss: $$ l(y_i, \hat{y}_i) = y_i \log(1 + e^{-\hat{y}_i}) + (1 - y_i) \log(1 + e^{\hat{y}_i}) $$
Regularization: How complicated is the model?
- $L_2$ norm (Ridge): $\omega(w) = \lambda \|w\|^2$
- $L_1$ norm (Lasso): $\omega(w) = \lambda \|w\|$

	Objective Function	Linear model?	Loss	Regularization
Ridge regression	$\sum_{i=1}^{n}\left(y_{i}-w^{\top} x_{i}\right)^{2}+\lambda\\|w\\|^{2}$	✅	square	$L_2$
Lasso regression	$\sum_{i=1}^{n}\left(y_{i}-w^{\top} x_{i}\right)^{2}+\lambda\\|w\\|$	✅	square	$L_1$
Logistic regression	$\sum_{i=1}^{n}\left[y_{i} \cdot \ln \left(1+e^{-w^{\top} x_{i}}\right)+\left(1-y_{i}\right) \cdot \ln \left(1+e^{w^{\top} x_{i}}\right)\right]+\lambda\\|w\\|^{2}$	✅	logistic	$L_2$

Why do we want to contain two component in the objective?

Optimizing training loss encourages predictive models
- Fitting well in training data at least get you close to training data which is hopefully close to the underlying distribution
Optimizing regularization encourages simple models
- Simpler models tends to have smaller variance in future predictions, making prediction stable

Bias Variance Tradeoff

Mon, 06 Jul 2020 00:00:00 +0000

TL;DR

	Resaon	Example	affect	Model's complexity ⬆️	Model's complexity ⬇️
Bias	wrong assumption	assume a quadratic model to be linear	underfitting	⬇️	⬆️
Variance	excessive sensitivity to small variations	high-degree polynomial model	overfitting	⬆️	⬇️
Inreducible error	noisy data

Explaination

A model’s generalization error can be expressed as the sum of three very different errors:

Bias

This part of the generalization error is due to wrong assumptions, such as assuming that the data is linear when it is actually quadratic. A high-bias model is most likely to underfit the training data.

Variance

This part is due to the model’s excessive sensitivity to small variations in the training data.
A model with many degrees of freedom (such as a high-degree polynomial model) is likely to have high variance, and thus to overfit the training data.

Irreducible Error

This part is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up the data (e.g., fix the data sources, such as broken sensors, or detect and remove outliers).

	High bias	Low bias
High variance	something is terribly wrong! 😭	Overfitting
Low variance	Underfitting	too good to be true! 🤪

Cross Validation

Mon, 06 Jul 2020 00:00:00 +0000

	How it works?	Illustration
K-fold	1. Create $k$-fold partition of the dataset 2. Estimate $k$ hold-out predictors using $1$ partition as validation and $k-1$ partition as training set
Leave-One-Out (LOO)	(Special case with $k=n$) Consequently estimate $n$ hold-out predictors using $1$ partition as validation and $n-1$ partition as training set
Random sub-sampling	1. Randomly sample a fraction of $\alpha \cdot n, \alpha \in (0,1)$ data points for validation 2. Train on remaining points and validate, repeat $K$ times

Model Selection | Haobin Tan

Model Selection

Objective Function

How does the objective function look like?

Why do we want to contain two component in the objective?

Bias Variance Tradeoff

TL;DR

Explaination

Bias

Variance

Irreducible Error

Cross Validation

🎥 Explaination