<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Regression | Haobin Tan</title><link>https://haobin-tan.netlify.app/tags/regression/</link><atom:link href="https://haobin-tan.netlify.app/tags/regression/index.xml" rel="self" type="application/rss+xml"/><description>Regression</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Tue, 27 Oct 2020 00:00:00 +0000</lastBuildDate><image><url>https://haobin-tan.netlify.app/media/icon_hu7d15bc7db65c8eaf7a4f66f5447d0b42_15095_512x512_fill_lanczos_center_3.png</url><title>Regression</title><link>https://haobin-tan.netlify.app/tags/regression/</link></image><item><title>Regression</title><link>https://haobin-tan.netlify.app/docs/ai/machine-learning/regression/</link><pubDate>Mon, 07 Sep 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/machine-learning/regression/</guid><description/></item><item><title>Linear Regression</title><link>https://haobin-tan.netlify.app/docs/ai/machine-learning/regression/linear-regression/</link><pubDate>Mon, 06 Jul 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/machine-learning/regression/linear-regression/</guid><description>&lt;h2 id="linear-regression-model">Linear Regression Model&lt;/h2>
&lt;p>A linear model makes a prediction $\hat{y}_i$ by &lt;strong>simply computing a weighted sum of the input $\boldsymbol{x}_i$, plus a constant $w_0$ called the &lt;em>bias&lt;/em> term&lt;/strong>:&lt;/p>
&lt;h3 id="for-single-sampleinstances">For single sample/instances&lt;/h3>
$$
\hat{y}_i = f \left( \boldsymbol{x} \right) = w_0 + \sum\_{j=1}^{D}w\_{j} x\_{i, j}
$$
&lt;p>In matrix-form:&lt;/p>
$$
\hat{y}\_{i}=w_{0}+ \displaystyle \sum\_{j=1}^{D} w_{j} x_{i, j}=\tilde{\boldsymbol{x}}\_{i}^{T} \boldsymbol{w}\
$$
&lt;ul>
&lt;li>
&lt;p>$\tilde{\boldsymbol{x}}\_{i} = \left[\begin{array}{c}{1} \\\\ {x_{i}}\end{array}\right] = \left[\begin{array}{c} {1} \\\\ x\_{i, 1} \\\\ \vdots \\\\ {x\_{i, D}}\end{array}\right] \in \mathbb{R}^{D+1}$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$\boldsymbol{w}=\left[\begin{array}{c}{w\_{0}} \\\\ {\vdots} \\\\ {w\_{D}}\end{array}\right] \in \mathbb{R}^{D+1}$&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="on-full-dataset">On full dataset&lt;/h3>
$$
\hat{\boldsymbol{y}}=\left[\begin{array}{c}{\hat{y}\_{1}} \\\\{\vdots} \\\\ {\hat{y}\_{n}}\end{array}\right]=\left[\begin{array}{c}{\tilde{\boldsymbol{x}}\_{1}^{T} \boldsymbol{w}} \\\\ {\vdots} \\\\ {\tilde{\boldsymbol{x}}\_{n}^{T} \boldsymbol{w}}\end{array}\right] = \underbrace{\left[\begin{array}{cc}{1} &amp; {\boldsymbol{x}\_{1}^{T}} \\\\ {\vdots} &amp; {\vdots} \\\\ {1} &amp; {\boldsymbol{x}\_{n}^{T}}\end{array}\right]}\_{=: \boldsymbol{X}} \boldsymbol{w} = \boldsymbol{X} \boldsymbol{w}
$$
&lt;ul>
&lt;li>$\hat{\boldsymbol{y}}$: vector containing the output for each sample&lt;/li>
&lt;li>$\boldsymbol{X}$: data-matrix containing a vector of ones as the first column as bias&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>$y=\underbrace{\begin{bmatrix}{\widehat y}\_1 \\\\ \vdots\\\\{\widehat y}\_n\end{bmatrix}}\_{\boldsymbol\in\mathbf ℝ^{n\times1}}=\begin{bmatrix}\widehat x\_1^Tw\\\\\vdots\\\\\widehat x\_n^Tw\end{bmatrix}=\begin{bmatrix}1\cdot w\_0+x\_{1,1}\cdot w\_1+\cdots+x\_{1,D}\cdot w\_D\\\\\vdots\\\\1\cdot w\_0+x_{n,1}\cdot w\_1+\cdots+x\_{n,D}\cdot w_D\end{bmatrix}=\underset{=\begin{bmatrix}1&amp;x\_1^T\\\\\vdots&amp;\vdots\\\\1&amp;x_n^T\end{bmatrix}\\\\=:\boldsymbol X\in\mathbb{R}^{n\times(1+D)}}{\underbrace{\begin{bmatrix}1&amp;x\_{1,1}&amp;\cdots&amp;x\_{1,D}\\\\\vdots&amp;\vdots&amp;\ddots&amp;\vdots\\\\1&amp;x\_{n,1}&amp;\cdots&amp;x\_{n,D}\end{bmatrix}}\cdot}\underbrace{\begin{bmatrix}w\_0\\\\w_1\\\\\vdots\\\\w\_D\end{bmatrix}}\_{=:\boldsymbol w\boldsymbol\in\mathbf ℝ^{\boldsymbol(\mathbf1\boldsymbol+\mathbf D\boldsymbol)\boldsymbol\times\mathbf1}}$&lt;/p>
&lt;/blockquote></description></item><item><title>Polynomial Regression (Generalized linear regression models)</title><link>https://haobin-tan.netlify.app/docs/ai/machine-learning/regression/polynomial-regression/</link><pubDate>Mon, 13 Jul 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/machine-learning/regression/polynomial-regression/</guid><description>&lt;h2 id="idea">💡Idea&lt;/h2>
&lt;p>&lt;strong>Use a linear model to fit nonlinear data&lt;/strong>:
add powers of each feature as new features, then train a linear model on this extended set of features.&lt;/p>
&lt;h2 id="generalize-linear-regression-to-polynomial-regression">Generalize Linear Regression to Polynomial Regression&lt;/h2>
&lt;p>In Linear Regression $f$ is modelled as linear in $\boldsymbol{x}$ and $\boldsymbol{w}$&lt;/p>
&lt;p>$
f(x) = \hat{\boldsymbol{x}}^T \boldsymbol{w}
$&lt;/p>
&lt;p>Rewrite it more generally:&lt;/p>
&lt;p>$
f(x) = \phi(\boldsymbol{x})^T \boldsymbol{w}
$&lt;/p>
&lt;ul>
&lt;li>$\phi(\boldsymbol{x})$: vector valued funtion of the input vector $\boldsymbol{x}$ (also called &amp;ldquo;&lt;strong>linear basis function models&lt;/strong>&amp;rdquo;)
&lt;ul>
&lt;li>$\phi_i(\boldsymbol{x})$: &lt;strong>basis functions&lt;/strong>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>In principle, this allows us to &lt;strong>learn any non-linear function&lt;/strong>, if we know suitable basis functions (which is typically not the case 🤪).&lt;/p>
&lt;h3 id="example-1">Example 1&lt;/h3>
&lt;p>$\boldsymbol{x}=\left[\begin{array}{c}{x_1 \\ x_2}\end{array}\right] \in \mathbb{R}^{2}$&lt;/p>
&lt;p>$\phi: \mathbb{R}^2 \to \mathbb{R}^3, \left[\begin{array}{c}{x_1 \\ x_2}\end{array}\right] \mapsto \left[\begin{array}{c}{1 \\ x_1 \\ x_2}\end{array}\right]\qquad $(I.e.: $\phi_1(\boldsymbol{x}) = 1, \phi_2(\boldsymbol{x}) = x_1, \phi_3(\boldsymbol{x}) = x_2$)&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/Pol_Reg_Example_2.png" alt="Pol_Reg_Example_2" style="zoom:50%;" />
&lt;h3 id="example-2">Example 2&lt;/h3>
&lt;p>$\boldsymbol{x}=\left[\begin{array}{c}{x_1 \\ x_2}\end{array}\right] \in \mathbb{R}^{2}$&lt;/p>
&lt;p>$\phi: \mathbb{R}^2 \to \mathbb{R}^5, \left[\begin{array}{c}{x_1 \\ x_2}\end{array}\right] \mapsto \left[\begin{array}{c}{1 \\ x_1 \\ x_2 \\ x_1^2 \\ x_2^2}\end{array}\right]$&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/Pol_Reg_Example_3.png" alt="Pol_Reg_Example_3" style="zoom:50%;" />
&lt;h3 id="optimal-value-of-boldsymbolw">Optimal value of $\boldsymbol{w}$&lt;/h3>
&lt;p>$
\boldsymbol{w}^{*}=\left(\boldsymbol{\Phi}^{T} \boldsymbol{\Phi}\right)^{-1} \boldsymbol{\Phi}^{T} \boldsymbol{y}, \qquad \mathbf{\Phi}=\left[\begin{array}{c}{\phi_{1}^{T}} \\\\ {\vdots} \\\\ {\phi_{n}^{T}}\end{array}\right]
$&lt;/p>
&lt;p>&lt;em>(The same as in Linear Regression, just the data matrix is now replaced by the basis function matrix)&lt;/em>&lt;/p>
&lt;h2 id="challenge-of-polynomial-regression-overfitting">Challenge of Polynomial Regression: Overfitting&lt;/h2>
&lt;img src="https://i0.wp.com/csmoon-ml.com/wp-content/uploads/2019/02/Screen-Shot-2019-02-19-at-11.06.04-AM.png?fit=640%2C213" style="zoom:100%; background-color:white">
&lt;p>Reason:
&lt;strong>Too complex model&lt;/strong> (Degree of the polynom is too high!). It fits the noise and has unspecified behaviour between the training points.😭&lt;/p>
&lt;p>Solution: Regularization&lt;/p>
&lt;h2 id="regularization">Regularization&lt;/h2>
&lt;p>&lt;em>Regularization: Constrain a model to make it simpler and reduce the task of overfitting.&lt;/em>&lt;/p>
&lt;p>💡 &lt;strong>Avoid overfitting by forcing the weights $\boldsymbol{w}$ to be small&lt;/strong>&lt;/p>
&lt;blockquote>
&lt;p>Assume that our model has degree of 3 ($x^1, x^2, x^3$), and the corresponding parameters/weights are $w_1, w_2, w_3$. If we force $w_3=0$, then $w_3 x^3 = 0$, meaning that the model now has only degree of 2. In other words. the model is somehow simpler.&lt;/p>
&lt;/blockquote>
&lt;p>In general, a regularized model has the following cost/objective function:
&lt;/p>
$$
\underbrace{E\_D(\boldsymbol{w})}\_{\text{Data term}} + \underbrace{\lambda E\_W(\boldsymbol{w})}\_{\text{Regularization term}}
$$
&lt;p>
$\lambda$: regularization factor (hyperparameter, need to be tuned manually), controls how much you want to regularize the model.&lt;/p>
&lt;h3 id="regularized-least-squares-ridge-regression">Regularized Least Squares (Ridge Regression)&lt;/h3>
&lt;p>Consists of:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Sum of Squareds Error (SSE)&lt;/strong> function&lt;/li>
&lt;li>quadratic regulariser ($L_2$-Regularization)&lt;/li>
&lt;/ul>
&lt;p>$
\begin{aligned}
L_{\text {ridge }}
&amp;= \mathbf{SSE} + \lambda \|w\|^2 \\\\
&amp;= (\boldsymbol{y}-\boldsymbol{\Phi} \boldsymbol{w})^{T}(\boldsymbol{y}-\boldsymbol{\Phi} \boldsymbol{w})+\lambda \boldsymbol{w}^{T} \boldsymbol{w}
\end{aligned}
$&lt;/p>
&lt;p>Solution:&lt;/p>
&lt;p>$\boldsymbol{w}_{\mathrm{ridge}}^{*}=\left(\boldsymbol{\Phi}^{T} \boldsymbol{\Phi}+\lambda \boldsymbol{I}\right)^{-1} \boldsymbol{\Phi}^{T} \boldsymbol{y}$&lt;/p>
&lt;ul>
&lt;li>$\boldsymbol{I}$: Identity matrix&lt;/li>
&lt;li>$\left(\boldsymbol{\Phi}^{T} \boldsymbol{\Phi}+\lambda \boldsymbol{I}\right)$ is full rank and can be easily inverted&lt;/li>
&lt;/ul></description></item><item><title>Kernelized Ridge Regression</title><link>https://haobin-tan.netlify.app/docs/ai/machine-learning/regression/kernelized-ridge-regression/</link><pubDate>Mon, 13 Jul 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/machine-learning/regression/kernelized-ridge-regression/</guid><description>&lt;h2 id="kernel-regression">Kernel regression&lt;/h2>
&lt;h3 id="kernel-identities">Kernel identities&lt;/h3>
&lt;p>Let&lt;/p>
$$
\boldsymbol{\Phi}\_{X}=\left[\begin{array}{c}
\boldsymbol{\phi}\left(\boldsymbol{x}\_{1}\right)^{T} \\\\
\vdots \\\\
\boldsymbol{\phi}\left(\boldsymbol{x}\_{N}\right)^{T}
\end{array}\right] \in \mathbb{R}^{N \times d} , \qquad \left( \boldsymbol{\Phi}\_{X}^T = \left[ \boldsymbol{\phi}(x\_1), \dots, \boldsymbol{\phi}(x\_N)\right] \in \mathbb{R}^{d \times N} \right)
$$
&lt;p>then the following identities hold:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Kernel matrix&lt;/strong>&lt;/p>
$$
\boldsymbol{K}=\boldsymbol{\Phi}\_{X} \boldsymbol{\Phi}\_{X}^{T}
$$
&lt;p>with&lt;/p>
$$
[\boldsymbol{K}]\_{ij}=\boldsymbol{\phi}\left(\boldsymbol{x}\_{i}\right)^{T} \boldsymbol{\phi}(\boldsymbol{x}\_{j}) = \langle \boldsymbol{\phi}(\boldsymbol{x}\_{i}), \boldsymbol{\phi}(\boldsymbol{x}\_{j}) \rangle = k\left(\boldsymbol{x}\_{i}, \boldsymbol{x}\_{j}\right)
$$
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Kernel vector&lt;/strong>&lt;/p>
$$
\boldsymbol{k}\left(\boldsymbol{x}^{\*}\right)=\left[\begin{array}{c}
k\left(\boldsymbol{x}\_{1}, \boldsymbol{x}^{\*}\right) \\\\
\vdots \\\\
k\left(\boldsymbol{x}\_{N}, \boldsymbol{x}^{\*}\right)
\end{array}\right]=\left[\begin{array}{c}
\boldsymbol{\phi}\left(\boldsymbol{x}\_{1}\right)^{T} \boldsymbol{\phi}(\boldsymbol{x}^{\*}) \\\\
\vdots \\\\
\phi\left(\boldsymbol{x}\_{N}\right)^{T} \boldsymbol{\phi}(\boldsymbol{x}^{\*})
\end{array}\right]=\boldsymbol{\Phi}\_{X} \boldsymbol{\phi}\left(\boldsymbol{x}^{\*}\right)
$$
&lt;/li>
&lt;/ul>
&lt;h3 id="kernel-ridge-regression">Kernel Ridge Regression&lt;/h3>
&lt;p>Ridge Regression: (See also: &lt;a href="quiver-note-url/E1C1BD63-C259-41DE-8252-635696F048C0">Polynomial Regression (Generalized linear regression models)&lt;/a>)&lt;/p>
&lt;ul>
&lt;li>
&lt;ul>
&lt;li>Squared error function + L2 regularization&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Linear feature space&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;span style="color:red">&lt;strong>Not directly applicable in infinite dimensional feature spaces&lt;/strong>&lt;/span>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Objective:&lt;/strong>&lt;/p>
$$
L_{\text {ridge }}=\underbrace{(\boldsymbol{y}-\boldsymbol{\Phi} \boldsymbol{w})^{T}(\boldsymbol{y}-\boldsymbol{\Phi} \boldsymbol{w})}_{\text {sum of squared errors }}+\lambda \underbrace{\boldsymbol{w}^{T} \boldsymbol{w}}_{L_{2} \text { regularization }}
$$
&lt;ul>
&lt;li>$\boldsymbol{\Phi}\_{X}=\left[\begin{array}{c}
\phi\left(\boldsymbol{x}\_{1}\right)^{T} \\\\
\vdots \\\\
\phi\left(\boldsymbol{x}\_{N}\right)^{T}
\end{array}\right] \in \mathbb{R}^{N \times d}$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Solution&lt;/p>
$$
\boldsymbol{w}\_{\text {ridge }}^{*}= \color{red}{\underbrace{\left(\boldsymbol{\Phi}^{T} \boldsymbol{\Phi}+\lambda \boldsymbol{I}\right)^{-1}}\_{d \times d \text { matrix inversion }}} \boldsymbol{\Phi}^{T} \boldsymbol{y}
$$
&lt;p>&lt;span style="color:red">Matrix inversion &lt;strong>infeasible&lt;/strong> in &lt;strong>infinite&lt;/strong> dimensions!!!😭&lt;/span>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="apply-kernel-trick">Apply kernel trick&lt;/h4>
&lt;p>Rewrite solution as &lt;strong>inner products&lt;/strong> of the feature space with the following matrix identity&lt;/p>
$$
(\boldsymbol{I} + \boldsymbol{A}\boldsymbol{B})^{-1}\boldsymbol{A} = \boldsymbol{A} (\boldsymbol{I} + \boldsymbol{B}\boldsymbol{A})^{-1}
$$
&lt;p>Then we get&lt;/p>
$$
\begin{array}{ll}
\boldsymbol{w}\_{\text {ridge }}^{*}
&amp;= \color{red}{\underbrace{\left(\boldsymbol{\Phi}^{T} \boldsymbol{\Phi}+\lambda \boldsymbol{I}\right)^{-1}}\_{d \times d \text { matrix inversion }}} \boldsymbol{\Phi}^{T} \boldsymbol{y} \\\\
&amp; \overset{}{=} \boldsymbol{\Phi}^{T} \color{LimeGreen}{\underbrace{\left( \boldsymbol{\Phi}\boldsymbol{\Phi}^{T} +\lambda \boldsymbol{I}\right)^{-1}}\_{N \times N \text { matrix inversion }}} \boldsymbol{y} \\\\
&amp;= \boldsymbol{\Phi}^{T} \underbrace{\left( \boldsymbol{K} +\lambda \boldsymbol{I}\right)^{-1}\boldsymbol{y}}_{=: \boldsymbol{\alpha}} \\\\
&amp;= \boldsymbol{\Phi}^{T} \boldsymbol{\alpha}
\end{array}
$$
&lt;ul>
&lt;li>beneficial for $d \gg N$&lt;/li>
&lt;li>&lt;strong>Still, $\boldsymbol{w}^\* \in \mathbb{R}^d$ is potentially infinite dimensional and can not be represented&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>Yet, we can still evaluate the function $f$ without the explicit representation of $\boldsymbol{w}^*$ 😉&lt;/p>
$$
\begin{array}{ll}
f(\boldsymbol{x})
&amp; =\boldsymbol{\phi}(\boldsymbol{x})^{T} \boldsymbol{w}^{*} \\\\
&amp; \overset{}{=}\boldsymbol{\phi}(\boldsymbol{x})^{T} \boldsymbol{\Phi}^{T} \boldsymbol{\alpha} \\\\
&amp; \overset{\text{kernel} \\ \text{trick}}{=}\boldsymbol{k}(\boldsymbol{x})^{T} \boldsymbol{\alpha} \\\\
&amp; =\sum\_{i} \alpha\_{i} k\left(\boldsymbol{x}\_{i}, \boldsymbol{x}\right)
\end{array}
$$
&lt;p>For a &lt;strong>Gaussian kernel&lt;/strong>&lt;/p>
$$
f(\boldsymbol{x})=\sum_{i} \alpha_{i} k\left(\boldsymbol{x}_{i}, \boldsymbol{x}\right)=\sum_{i} \alpha_{i} \exp \left(-\frac{\left\|\boldsymbol{x}-\boldsymbol{x}_{i}\right\|^{2}}{2 \sigma^{2}}\right)
$$
&lt;h4 id="select-hyperparameter">Select hyperparameter&lt;/h4>
&lt;p>Bandwidth parameter $\sigma$ in Gaussian kernel&lt;/p>
$$
k(\boldsymbol{x}, \boldsymbol{y})=\exp \left(-\frac{\|\boldsymbol{x}-\boldsymbol{y}\|^{2}}{2 \sigma^{2}}\right)
$$
&lt;p>are called &lt;strong>hyperparameters&lt;/strong>.&lt;/p>
&lt;p>How to choose? &lt;strong>Cross validation!&lt;/strong>&lt;/p>
&lt;p>Example:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/image-20200305164457118.png" alt="image-20200305164457118">&lt;/p>
&lt;h3 id="summary-kernel-ridge-regression">Summary: kernel ridge regression&lt;/h3>
&lt;p>The solution for kernel ridge regression is given by
&lt;/p>
$$
f^{*}(\boldsymbol{x})=\boldsymbol{k}(\boldsymbol{x})^{T}(\boldsymbol{K}+\lambda \boldsymbol{I})^{-1} \boldsymbol{y}
$$
&lt;ul>
&lt;li>&lt;span style="color:LimeGreen">No evaluation of the feature vectors needed&lt;/span> 👏&lt;/li>
&lt;li>&lt;span style="color:LimeGreen">Only pair-wise scalar products (evaluated by the kernel)&lt;/span> 👏&lt;/li>
&lt;li>&lt;span style="color:red">Need to invert a &lt;/span> $\color{red}{N \times N}$ &lt;span style="color:red">matrix (can be costly)&lt;/span> 🤪&lt;/li>
&lt;/ul>
&lt;p>‼️&lt;strong>Note&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Have to store &lt;strong>all samples&lt;/strong> in kernel-based methods&lt;/p>
&lt;ul>
&lt;li>Computationally expensive (matrix inverse is $O(n^{2.376})$) !&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Hyperparameters of the method are given by the kernel-parameters&lt;/p>
&lt;ul>
&lt;li>Can be optimized on &lt;strong>validation-set&lt;/strong>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Very flexible function representation, only few hyper-parameters&lt;/strong> 👍&lt;/p>
&lt;/li>
&lt;/ul></description></item><item><title>Classification And Regression Tree (CART)</title><link>https://haobin-tan.netlify.app/docs/ai/machine-learning/decision-tree/cart/</link><pubDate>Tue, 27 Oct 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/machine-learning/decision-tree/cart/</guid><description>&lt;h2 id="tree-based-methods">Tree-based Methods&lt;/h2>
&lt;p>&lt;strong>CART&lt;/strong>: &lt;strong>C&lt;/strong>lassification &lt;strong>A&lt;/strong>nd &lt;strong>R&lt;/strong>egression &lt;strong>T&lt;/strong>ree&lt;/p>
&lt;h3 id="grow-a-binary-tree">Grow a binary tree&lt;/h3>
&lt;ul>
&lt;li>At each node, “split” the data into two “daughter” nodes.&lt;/li>
&lt;li>Splits are chosen using a splitting criterion.&lt;/li>
&lt;li>Bottom nodes are “terminal” nodes.&lt;/li>
&lt;/ul>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>&lt;/th>
&lt;th>Type of tree&lt;/th>
&lt;th>Predicted value at a node&lt;/th>
&lt;th>Split criterion&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>Regression&lt;/strong>&lt;/td>
&lt;td>Regression tree&lt;/td>
&lt;td>The predicted value at a node is the &lt;strong>average response&lt;/strong> variable for all observations in the node&lt;/td>
&lt;td>&lt;strong>Minimum residual sum of squares&lt;/strong> &lt;br />$$\mathrm{RSS}=\sum_{\text {left }}\left(y_{i}-\bar{y}_{L}\right)^{2}+\sum_{\text {right }}\left(y_{i}-\bar{y}_{R}\right)^{2}$$&lt;li />$\bar{y}_L$ / $\bar{y}_R$: average label values in the left / right subtree &lt;br />(Split such that variance in subtress is minimized)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Classification&lt;/strong>&lt;/td>
&lt;td>Decision tree&lt;/td>
&lt;td>The predicted class is the &lt;strong>most common class&lt;/strong> in the node (majority vote).&lt;/td>
&lt;td>&lt;strong>Minimum entropy&lt;/strong> in subtrees&lt;br />$$\text { score }=N_{L} H\left(p_{\mathrm{L}}\right)+N_{R} H\left(p_{\mathrm{R}}\right)$$&lt;li />$H\left(p_{L}\right)=-\sum_{k} p_{L}(k) \log p_{L}(k)$: entropy in the left sub-tree &lt;li /> $p_L(k)$: proportion of class $k$ in left tree&lt;br />(Split such that class-labels in sub-trees are &amp;ldquo;pure&amp;rdquo;)&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="when-stop">When stop?&lt;/h3>
&lt;p>&lt;strong>Stop if:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Minimum number of samples per node&lt;/li>
&lt;li>Maximum depth&lt;/li>
&lt;/ul>
&lt;p>&amp;hellip; has been reached&lt;/p>
&lt;p>(Both criterias again influence the &lt;strong>complexity&lt;/strong> of the tree)&lt;/p>
&lt;h3 id="controlling-the-tree-complexity">Controlling the tree complexity&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Number of samples per leaf&lt;/th>
&lt;th>Affect&lt;/th>
&lt;th>&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>Small&lt;/strong>&lt;/td>
&lt;td>Tree is &lt;strong>very sensitive&lt;/strong> to noise&lt;/td>
&lt;td>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/屏幕快照%202020-03-01%2023.26.23.png" alt="屏幕快照 2020-03-01 23.26.23" style="zoom:33%;" />&lt;br />&lt;img src="https://github.com/EckoTan0804/upic-repo/blob/master/uPic/%E5%B1%8F%E5%B9%95%E5%BF%AB%E7%85%A7%202020-03-01%2023.25.40.png?raw=true" alt="屏幕快照 2020-03-01 23.25.40.png" style="zoom:33%;" />&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Large&lt;/strong>&lt;/td>
&lt;td>Tree is &lt;strong>not expressive enough&lt;/strong>&lt;/td>
&lt;td>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/屏幕快照%202020-03-01%2023.25.50.png" alt="屏幕快照 2020-03-01 23.25.50" style="zoom:33%;" />&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="advantages-">Advantages 👍&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Applicable to both regression and classification problems.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Handle categorical predictors naturally.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Computationally simple and quick to fit, even for large problems.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>No formal distributional assumptions (non-parametric).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Can handle highly non-linear interactions and classification boundaries.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Automatic variable selection.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Very easy to interpret if the tree is small.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="disadvantages-">Disadvantages 👎&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;em>&lt;strong>Accuracy&lt;/strong>&lt;/em>&lt;/p>
&lt;p>current methods, such as support vector machines and ensemble classifiers often have 30% lower error rates than CART.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;em>&lt;strong>Instability&lt;/strong>&lt;/em>&lt;/p>
&lt;p>if we change the data a little, the tree picture can change a lot. So the interpretation is not as straightforward as it appears.&lt;/p>
&lt;/li>
&lt;/ul></description></item></channel></rss>