Non-Parametric | Haobin Tan

Non-parametric Machine Learning Alogrithms

Mon, 07 Sep 2020 00:00:00 +0000

Linear Discriminant Functions

Sat, 07 Nov 2020 00:00:00 +0000

No assumption about distributions -> non-parametric
Linear decision surfaces
Begin by supervised training (given class of training data)

Linear Discriminant Functions and Decision Surfaces

A discriminant function that is a linear combination of the components of $x$ can be written as

$$ g(\mathbf{x})=\mathbf{w}^{T} \mathbf{x}+w\_{0} $$

$\mathbf{x}$: feature vector
$\mathbf{w}$: weight vector
$w\_0$: bias or threshold weight

The two category case

Decision rule:

Decide $w\_1$ if $g(\mathbf{x}) > 0 \Leftrightarrow \mathbf{w}^{T} \mathbf{x}+w\_{0} > 0 \Leftrightarrow \mathbf{w}^{T} \mathbf{x}> -w\_{0}$
Decide $w\_{2}$ if $g(\mathbf{x}) < 0 \Leftrightarrow \mathbf{w}^{T} \mathbf{x}+w\_{0} < 0 \Leftrightarrow \mathbf{w}^{T} \mathbf{x}<-w\_{0}$
$g(\mathbf{x}) = 0$: assign to either class or can be left undefined

The equation $g(\mathbf{x}) = 0$ defines the decision surface that separates points assigned to $w\_{1}$ from points assigned to $w\_{2}$. When $g(\mathbf{x})$ is linear, this decision surface is a hyperplane.

For arbitrary $\mathbf{x}\_1$ and $\mathbf{x}\_2$ on the decision surface, we have:

$$ \mathbf{w}^{\mathrm{T}} \mathbf{x}\_{1}+w\_{0}=\mathbf{w}^{\mathrm{T}} \mathbf{x}\_{2}+w\_{0} $$ $$ \mathbf{w}^{\mathrm{T}}\left(\mathbf{x}\_{1}-\mathbf{x}\_{2}\right)=0 $$

$\Rightarrow \mathbf{w}$ is normal to any vector lying in the hyperplane.

In general, the hyperplane $H$ divides the feature space into two half-spaces:

decision region $R\_1$ for $w\_1$
decision region $R\_2$ for $w\_2$

Because $g(\mathbf{x}) > 0$ if $\mathbf{x}$ in $R\_1$, it follows that the normal vector $\mathbf{w}$ points into $R\_1$. Therefore, It is sometimes said that any $\mathbf{x}$ in $R\_1$ is on the positive side of $H$, and any $\mathbf{x}$ in $R\_2$ is on the negative side of $H$

The discriminant function $g(\mathbf{x})$ gives an algebraic measure of the distance from $\mathbf{x}$ to the hyperplane. We can write $\mathbf{x}$ as

$$ \mathbf{x}=\mathbf{x}\_{p}+r \frac{\mathbf{w}}{\|\mathbf{w}\|} $$

$\mathbf{x}\_{p}$: normal projection of $\mathbf{x}$ onto $H$
$r$: desired algebraic distance which is positive if $\mathbf{x}$ is on the positive side, else negative

As $\mathbf{x}\_p$ is on the hyperplane

$$ \begin{array}{ll} g\left(\mathbf{x}\_{p}\right)=0 \\\\ \mathbf{w}^{\mathrm{T}} \mathbf{x}\_{p}+w\_{0}=0 \\\\ \mathbf{w}^{\mathrm{T}}\left(\mathbf{x}-r \frac{\mathbf{w}}{\|\mathbf{w}\|}\right)+w\_{0}=0 \\\\ \mathbf{w}^{\mathrm{T}} \mathbf{x}-r \frac{\mathbf{w}^{\mathrm{T}} \mathbf{w}}{\|\mathbf{w}\|}+w\_{0}=0 \\\\ \mathbf{w}^{\mathrm{T}} \mathbf{x}-r\|\mathbf{w}\| + w\_0 = 0 \\\\ \underbrace{\mathbf{w}^{\mathrm{T}} \mathbf{x} + w\_0}\_{=g(\mathbf{x})} = r\|\mathbf{w}\| \\\\ \Rightarrow g(\mathbf{x}) = r\|\mathbf{w}\| \\\\ \Rightarrow r = \frac{g(\mathbf{x})}{\|\mathbf{w}\|} \end{array} $$

In particular, the distance from the origin to hyperplane $H$ is given by $\frac{w_0}{\|\mathbf{w}\|}$

$w\_0 > 0$: the origin is on the positive side of $H$
$w\_0 < 0$: the origin is on the negative side of $H$
$w\_0 = 0$: $g(\mathbf{x})$ has the homogeneous form $\mathbf{w}^{\mathrm{T}} \mathbf{x}$ and the hyperplane passes through the origin

A linear discriminant function divides the feature space by a hyperplane decision surface:

orientation: determined by the normal vector $\mathbf{w}$
location: determined by the bias $w\_0$

Reference

https://www.byclb.com/TR/Tutorials/neural_networks/ch9_1.htm

Linear Discriminant Analysis (LDA)

Sat, 07 Nov 2020 00:00:00 +0000

Linear Discriminant Analysis (LDA)

also called Fisher’s Linear Discriminant
reduces dimension (like PCA)
but focuses on maximizing seperability among known categories

💡 Idea

Create a new axis
Project the data onto this new axis in a way to maximize the separation of two categories

How it works?

Create a new axis

According to two criteria (considered simultaneously):

Maximize the distance between means
Minimize the variation $s^2$ (which LDA calls “scatter”) within each category

We have:

$$ \frac{(\overbrace{\mu_1 - \mu_2}^{=: d})^2}{s_1^2 + s_2^2} \qquad\left(\frac{\text{''ideally large''}}{\text{"ideally small"}}\right) $$

Why both distance and scatter are important?

More than 2 dimensions

The process is the same 👏:

Create an axis that maximizes the distance between the means for the two categories while minimizing the scatter

More than 2 categories (e.g. 3 categories)

Little difference:

Measure the distances among the means
- Find the point that is central to all of the data
- Then measure the distances between a point that is central in each category and the main central point
- Maximize the distance between each category and the central point while minimizing the scatter for each category
Create 2 axes to separate the data (because the 3 central points for each category define a plane)

LDA and PCA

Similarities

Both rank the new axes in order of importance
- PC1 (the first new axis that PCA creates) accounts for the most variation in the data
  - PC2 (the second new axis) does the second best job
- LD1 (the first new axis that LDA creates) accounts for the most variation between the categories
  - LD2 does the second best job
Both can let you dig in and see which features are driving the new axes
Both try to reduce dimensions
- PCA looks at the features with the most variation
- LDA tries to maximize the separation of known categories

Reference

https://www.youtube.com/watch?v=azXCzI57Yfc