Supervised/Unsupervised Learning Supervised learning The training data you feed to the algorithm includes the desired solutions, called labels Typical task: Classification Regression Important supervised learning algo: k-Nearest Neighbors Linear Regression Logistic Regression Support Vector Machine (SVM) Decision Trees and Random Forests Neural Networks Unsupervised learning Training data is unlabeled.
2020-08-17
TL;DR Confusion matrix, ROC, and AUC Confuse matrix A confusion matrix tells you what your ML algorithm did right and what it did wrong. Known Truth Positive Negative Prediction Positive True Positive (TP) False Positive (FP) Precision = TP / (TP+FP) Negative False Negative (FN) True Negative (TN) TPR = Sensitivity = Recall = TP / (TP + FN) Specificity = TN / (FP+TN) FPR = FP / (FP + TN) = 1 - Specificity Row: Prediction Column: Known truth Each cell:
2020-08-17
1. Look at the big picture 1.1 Frame the problem Consider the business objective: How do we expect to use and benefit from this model? 1.2 Select a performance measure 1.
2020-08-17
Linear Algebra Vectors Vector: multi-dimensional quantity Each dimension contains different information (e.g.: Age, Weight, Heightβ¦) represented as bold symbols A vector $\boldsymbol{x}$ is always a column vector $$ \boldsymbol{x}=\left[\begin{array}{l} {1} \\\\ {2} \\\\ {4} \end{array}\right] $$ A transposed vector $\boldsymbol{x}^T$ is a row vector $$ \boldsymbol{x}^{T}=\left[\begin{array}{lll} {1} & {2} & {4} \end{array}\right] $$ Vector Operations Multiplication by scalars $$ 2\left[\begin{array}{l} {1} \\\\ {2} \end{array}\right]=\left[\begin{array}{l} {2} \\\\ {4} \end{array}\right] $$ Addtition of vectors $$ \left[\begin{array}{l}{1} \\\\ {2} \end{array}\right]+\left[\begin{array}{l}{3} \\\\ {1}\end{array}\right]=\left[\begin{array}{l}{4} \\\\ {3} \end{array}\right] $$ Scalar (Inner) products: Sum the element-wise products $$ \boldsymbol{v}=\left[\begin{array}{c}{1} \\\\ {2} \\\\ {4}\end{array}\right], \quad \boldsymbol{w}=\left[\begin{array}{l}{2} \\\\ {4} \\\\ {8}\end{array}\right] $$ $$ \langle v, w\rangle= 1 \cdot 2+2 \cdot 4+4 \cdot 8=42 $$ Length of a vector: Square root of the inner product with itself $$ \|\boldsymbol{v}\|=\langle\boldsymbol{v}, \boldsymbol{v}\rangle^{\frac{1}{2}}=\left(1^{2}+2^{2}+4^{2}\right)^{\frac{1}{2}}=\sqrt{21} $$ Matrices Matrix: rectangular array of numbers arranged in rows and columns
2020-08-17
SVM (with features) Maximum margin principle Slack variables allow for margin violation $$ \begin{array}{ll} \underset{\mathbf{w}}{\operatorname{argmin}} \quad &\|\mathbf{w}\|^{2} + C \sum_i^N \xi_i \\\\ \text { s.t. } \quad & y_{i}\left(\mathbf{w}^{T} \color{red}{\phi(\mathbf{x}_{i})} + b\right) \geq 1 -\xi_i, \quad \xi_i \geq 0\end{array} $$ Math basics Solve the constrained optimization problem: Method of Lagrangian Multipliers
2020-07-13
Kernel function Given a mapping function $\phi: \mathcal{X} \rightarrow \mathcal{V}$, the function $$ \mathcal{K}: x \rightarrow v, \quad \mathcal{K}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\left\langle\phi(\mathbf{x}), \phi\left(\mathbf{x}^{\prime}\right)\right\rangle_{\mathcal{V}} $$ is called a kernel function. βA kernel is a function that returns the result of a dot product performed in another space.
2020-07-13
π― Goal of SVM To find the optimal separating hyperplane which maximizes the margin of the training data it correctly classifies the training data it is the one which will generalize better with unseen data (as far as possible from data points from each category) SVM math formulation Assuming data is linear separable
2020-07-13
Class label: $$ y_i \in \\{0, 1\\} $$ Conditional probability distribution of the class label is $$ \begin{aligned} p(y=1|\boldsymbol{x}) &= \sigma(\boldsymbol{w}^T\boldsymbol{x}+b) \\\\ p(y=0|\boldsymbol{x}) &= 1 - \sigma(\boldsymbol{w}^T\boldsymbol{x}+b) \end{aligned} $$ with
2020-07-13
π‘ Use regression algorithm for classification Logistic regression: estimate the probability that an instance belongs to a particular class If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class (called the positive class, labeled β1β), or else it predicts that it does not (i.
2020-07-13
How does the objective function look like? Objective function: $$ \operatorname{Obj}(\Theta)= \overbrace{L(\Theta)}^{\text {Training Loss}} + \underbrace{\Omega(\Theta)}_{\text{Regularization}} $$ Training loss: measures how well the model fit on training data $$ L=\sum_{i=1}^{n} l\left(y_{i}, g_{i}\right) $$ Square loss: $$ l(y_i, \hat{y}_i) = (y_i - \hat{y}_i)^2 $$ Logistic loss: $$ l(y_i, \hat{y}_i) = y_i \log(1 + e^{-\hat{y}_i}) + (1 - y_i) \log(1 + e^{\hat{y}_i}) $$ Regularization: How complicated is the model?
2020-07-06