<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>PCA | Haobin Tan</title><link>https://haobin-tan.netlify.app/tags/pca/</link><atom:link href="https://haobin-tan.netlify.app/tags/pca/index.xml" rel="self" type="application/rss+xml"/><description>PCA</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Sat, 07 Nov 2020 00:00:00 +0000</lastBuildDate><image><url>https://haobin-tan.netlify.app/media/icon_hu7d15bc7db65c8eaf7a4f66f5447d0b42_15095_512x512_fill_lanczos_center_3.png</url><title>PCA</title><link>https://haobin-tan.netlify.app/tags/pca/</link></image><item><title>Principle Components Analysis (PCA)</title><link>https://haobin-tan.netlify.app/docs/ai/machine-learning/unsupervised/pca/</link><pubDate>Sat, 07 Nov 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/machine-learning/unsupervised/pca/</guid><description>&lt;h2 id="tldr">TL;DR&lt;/h2>
&lt;p>The usual procedure to calculate the $d$-dimensional principal component analysis consists of the following steps:&lt;/p>
&lt;ol start="0">
&lt;li>
&lt;p>Calculate&lt;/p>
&lt;ul>
&lt;li>
&lt;p>average
&lt;/p>
$$
\bar{m}=\sum\_{i=1}^{N} m_{i} \in \mathbb{R}
$$
&lt;/li>
&lt;li>
&lt;p>data matrix
&lt;/p>
$$
\mathbf{M}=\left(m\_{1}-\bar{m}, \ldots, m\_{N}-\bar{m}\right) \in \mathbb{R}^{d \times \mathrm{N}}
$$
&lt;/li>
&lt;li>
&lt;p>scatter matrix (covariance matrix)
&lt;/p>
$$
\mathbf{S}=\mathbf{M M}^{\mathrm{T}} \in \mathbb{R}^{d \times d}
$$
&lt;/li>
&lt;/ul>
&lt;p>of all feature vectors $m\_{1}, \ldots, m\_{N}$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Calculate the normalized ($\\|\cdot\\|=1$) eigenvectors $\mathbf{e}\_1, \dots, \mathbf{e}\_d$ and sort them such that the corresponding eigenvalues $\lambda\_1, \dots, \lambda\_d$ are decreasing, i.e. $\lambda\_1 > \lambda\_2 > \dots > \lambda\_d$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Construct a matrix
&lt;/p>
$$
\mathbf{A}:=\left(e\_{1}, \ldots, e\_{d^{\prime}}\right) \in \mathbb{R}^{d \times d^{\prime}}
$$
&lt;p>
with the first $d^{\prime}$ eigenvectors as its columns&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Transform each feature vector $m\_i$ into a new feature vector
&lt;/p>
$$
\mathrm{m}\_{\mathrm{i}}^{\prime}=\mathrm{A}^{\mathrm{T}}\left(\mathrm{m}\_{\mathrm{i}}-\overline{\mathrm{m}}\right) \quad \text { for } i=1, \ldots, N
$$
&lt;p>
of smaller dimension $d^{\prime}$&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h2 id="dimensionality-reduction">Dimensionality reduction&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>Goal: represent instances with fewer variables&lt;/p>
&lt;ul>
&lt;li>Try to preserve as much structure in the data as possible&lt;/li>
&lt;li>Discriminative: only structure that affects class separability&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Feature selection&lt;/p>
&lt;ul>
&lt;li>Pick a subset of the original dimensions&lt;/li>
&lt;li>Discriminative: pick good class &amp;ldquo;predictors&amp;rdquo;&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Feature extraction&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Construct a new set of dimensions
&lt;/p>
$$
E\_{i} = f(X\_1 \dots X\_d)
$$
&lt;ul>
&lt;li>$X\_1, \dots, X\_d$: features&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>(Linear) combinations of original&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="direction-of-greatest-variance">Direction of greatest variance&lt;/h2>
&lt;ul>
&lt;li>Define a set of principal components
&lt;ul>
&lt;li>1st: direction of the &lt;strong>greatest variability&lt;/strong> in the data (i.e. Data points are spread out as far as possible)&lt;/li>
&lt;li>2nd: &lt;em>perpendicular&lt;/em> to 1st, greatest variability of what&amp;rsquo;s left&lt;/li>
&lt;li>&amp;hellip;and so on until $d$ (original dimensionality)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>First $m \ll d$ components become $m$ dimensions
&lt;ul>
&lt;li>Change coordinates of every data point to these dimensions&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-06%2023.51.17.png" alt="截屏2021-02-06 23.51.17">&lt;/p>
&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">&lt;p>Q: Why greatest variablility?&lt;/p>
&lt;p>A: If you pick the dimension with the highest variance, that will preserve the distances as much as possible&lt;/p>
&lt;/span>
&lt;/div>
&lt;h2 id="how-to-pca">How to PCA?&lt;/h2>
&lt;ol>
&lt;li>
&lt;p>&amp;ldquo;Center&amp;rdquo; the data at zero (subtract mean from each attribute)
&lt;/p>
$$
x\_{i, a} = x\_{i, a} - \mu
$$
&lt;/li>
&lt;li>
&lt;p>Compute covariance matrix $\Sigma$&lt;/p>
&lt;blockquote>
&lt;p>The &lt;strong>covariance&lt;/strong> between two attributes is an indication of whether they change together (positive correlation) or in opposite directions (negative correlation).&lt;/p>
&lt;p>For example, $cov(x\_1, x\_2) = 0.8 > 0 \Rightarrow$ When $x\_1$ increases/decreases, $x\_2$ also increases/decreases.&lt;/p>
&lt;/blockquote>
$$
cov(b, a) = \frac{1}{n} \sum\_{i=1}^{n} x\_{ib} x\_{ia}
$$
&lt;/li>
&lt;li>
&lt;p>We want vectors $\mathbf{e}$ which aren&amp;rsquo;t turned by covariance matrix $\Sigma$:
&lt;/p>
$$
\Sigma \mathbf{e} = \lambda \mathbf{e}
$$
&lt;p>
$\Rightarrow$ $\mathbf{e}$ are eigenvectors of $\Sigma$, and $\lambda$ are corresponding eigenvalues&lt;/p>
&lt;p>&lt;strong>Principle components = eigenvectors with largest eigenvalues&lt;/strong>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h3 id="finding-principle-components">Finding principle components&lt;/h3>
&lt;ol>
&lt;li>
&lt;p>Find eigenvalues by solving &lt;a href="https://en.wikipedia.org/wiki/Characteristic_polynomial">Characteristic Polynomial&lt;/a>
&lt;/p>
$$
\operatorname{det}(\Sigma - \lambda \mathbf{I}) = 0
$$
&lt;ul>
&lt;li>$\mathbf{I}$: Identity matrix&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Find $i$-th eigenvector by solving
&lt;/p>
$$
\Sigma \mathbf{e}\_i = \lambda\_i \mathbf{e}\_i
$$
&lt;p>
and we want $\mathbf{e}\_{i}$ to have unit length ($\\|\mathbf{e}\_{i}\\| = 1$)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Eigenvector with the largest eigenvalue will be the first principle component, eigenvector with the second largest eigenvalue will be the second priciple component, so on and so on.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;details>
&lt;summary>Example&lt;/summary>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-02-07%2000.21.08.png" alt="截屏2021-02-07 00.21.08" style="zoom:67%;" />
&lt;/details>
&lt;h3 id="projecting-to-new-dimension">Projecting to new dimension&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>We pick $m&lt;d$ Eigenvectors $\mathbf{e}\_1, \dots, \mathbf{e}\_m$ with the biggest Eigenvalues. Now $\mathbf{e}\_1, \dots, \mathbf{e}\_m$ are the new dimension vectors&lt;/p>
&lt;/li>
&lt;li>
&lt;p>For instance $\mathbf{x} = \{x\_1, \dots, x\_d\}$ (original coordinates), we want new coordinates $\mathbf{x}^{\prime} = \{x^{\prime}\_1, \dots, x^{\prime}\_d\}$&lt;/p>
&lt;ul>
&lt;li>&amp;ldquo;Center&amp;rdquo; the instance (subtract the mean): $\mathbf{x} - \mathbf{\mu}$&lt;/li>
&lt;li>&amp;ldquo;Project&amp;rdquo; to each dimension: $(\mathbf{x} - \mathbf{\mu})^T \mathbf{e}\_j$ for $j=1, \dots, m$&lt;/li>
&lt;/ul>
&lt;details>
&lt;summary>Example&lt;/summary>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/PCA.png" alt="PCA" style="zoom:80%;" />
&lt;/details>
&lt;/li>
&lt;/ul>
&lt;h2 id="go-deeper-in-details">Go deeper in details&lt;/h2>
&lt;h3 id="why-eigenvectors--greatest-variance">Why eigenvectors = greatest variance?&lt;/h3>
&lt;div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
&lt;iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="allowfullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube.com/embed/cIE2MDxyf80?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"
>&lt;/iframe>
&lt;/div>
&lt;h3 id="why-eigenvalue--variance-along-eigenvector">Why eigenvalue = variance along eigenvector?&lt;/h3>
&lt;div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
&lt;iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="allowfullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube.com/embed/tL0wFZ9aJP8?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"
>&lt;/iframe>
&lt;/div>
&lt;h3 id="how-many-dimensions-should-we-reduce-to">How many dimensions should we reduce to?&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Now we have eigenvectors $\mathbf{e}\_1, \dots, \mathbf{e}\_d$ and we want new dimension $m \ll d$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>We pick $\mathbf{e}\_i$ that &amp;ldquo;explain&amp;rdquo; the most variance:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Sort eigenvectors s.t. $\lambda\_1 \geq \dots \geq \lambda\_d$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Pick first $m$ eigenvectors which explain 90% or the total variance (typical threshold values: 0.9 or 0.95)&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-07%2013.06.46.png" alt="截屏2021-02-07 13.06.46">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Or we can use a scree plot&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="pca-in-a-nutshell">PCA in a nutshell&lt;/h2>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-07%2013.09.32.png" alt="截屏2021-02-07 13.09.32">&lt;/p>
&lt;h2 id="pca-example-eigenfaces">PCA example: Eigenfaces&lt;/h2>
&lt;p>Perform PCA on bitmap images of human faces:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-07%2016.22.02.png" alt="截屏2021-02-07 16.22.02">&lt;/p>
&lt;p>Belows are the eigenvectors after we perform PCA on the dataset:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-07%2016.25.01.png" alt="截屏2021-02-07 16.25.01">&lt;/p>
&lt;p>Then we can project new face to space of eigen-faces, and represent vector of new face as a linear combination of principle components.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-07%2016.24.28.png" alt="截屏2021-02-07 16.24.28">&lt;/p>
&lt;p>As we use more and more eigenvectors in this decomposition, we can end up with a face that looks more and more like the original guy&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-07%2016.33.28.png" alt="截屏2021-02-07 16.33.28">&lt;/p>
&lt;details>
&lt;summary>Why is eigenface neat and interesting?&lt;/summary>
&lt;ul>
&lt;li>This is neat because by taking the first few eigenvectors you can get a pretty close representation of the face. Suppose that this corresponds to maybe 20 eigenvectors. &lt;strong>This means you&amp;rsquo;re using only 20 numbers to represent a face bitmap which looks kind of like the original guy!&lt;/strong> Can you use only 20 pixels to represent him nearly? No, there&amp;rsquo;s no way!&lt;/li>
&lt;li>You&amp;rsquo;re effectively picking 20 numbers/mixture coefficients/coordinates. One really nice way to use this is you can use this for &lt;strong>massive compression&lt;/strong> of the data. If you communicate to others if they all have access to the same eigenvectors, all they need to send between each other are just the projection coordinates. Then they can transmit arbitrary faces between them. This is massive reduction in the size of data.&lt;/li>
&lt;li>Your classifier or your regression system now operate in low dimensional space. So they have plenty of redundancy to grab on to and learn a better hyperplane. &amp;#x1f44f;&lt;/li>
&lt;/ul>
&lt;/details>
&lt;h3 id="application-of-eigenface">Application of eigenface&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Face similarity&lt;/p>
&lt;ul>
&lt;li>in the reduced space&lt;/li>
&lt;li>insensitive to lighting expression, orientation&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Projecting new &amp;ldquo;faces&amp;rdquo;&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-07%2016.49.58.png" alt="截屏2021-02-07 16.49.58">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="pratical-issues-of-pca">Pratical issues of PCA&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>PCA is based on covariance matrix and covariance is extremely sensitive to large values&lt;/p>
&lt;ul>
&lt;li>
&lt;p>E.g. multiple some dimension by 1000. Then this dimension dominates covariance and become a principle component.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Solution: normalize each dimension to zero mean and unit variacne
&lt;/p>
$$
x^{\prime} = \frac{x - \text{mean}}{\text{standard deviation}}
$$
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>PCA assumes underlying subspace is linear.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>PCA can sometimes hurt the performace of classification&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Because PCA doesn&amp;rsquo;t see the labels&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Solution: &lt;a href="https://haobin-tan.netlify.app/docs/ai/machine-learning/non-parametric/lda-summary/">Linear Discriminant Analysis (LDA)&lt;/a>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Picks a new dimension that gives&lt;/p>
&lt;ul>
&lt;li>maximum separation between means of prejected classes&lt;/li>
&lt;li>minimum variance within each projected class&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-07%2017.23.36.png" alt="截屏2021-02-07 17.23.36">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>But this relies on some assumptions of the data and does not always work. 🤪&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://www.youtube.com/watch?v=IbE0tbjy6JQ&amp;amp;list=PLBv09BD7ez_5_yapAg86Od6JeeypkS4YM&amp;amp;index=1">Principle Component Analysis&lt;/a>: a great series of video tutorials explaining PCA clearly 👍&lt;/li>
&lt;/ul></description></item></channel></rss>