ML Basics | Haobin Tan

Machine Learning Fundamentals

Mon, 07 Sep 2020 00:00:00 +0000

Math Basics

Mon, 17 Aug 2020 00:00:00 +0000

Linear Algebra

Vectors

Vector: multi-dimensional quantity

Each dimension contains different information (e.g.: Age, Weight, Height…)
represented as bold symbols
A vector $\boldsymbol{x}$ is always a column vector
$$ \boldsymbol{x}=\left[\begin{array}{l} {1} \\\\ {2} \\\\ {4} \end{array}\right] $$
A transposed vector $\boldsymbol{x}^T$ is a row vector
$$ \boldsymbol{x}^{T}=\left[\begin{array}{lll} {1} & {2} & {4} \end{array}\right] $$

Vector Operations

Multiplication by scalars
$$ 2\left[\begin{array}{l} {1} \\\\ {2} \end{array}\right]=\left[\begin{array}{l} {2} \\\\ {4} \end{array}\right] $$
Addtition of vectors
$$ \left[\begin{array}{l}{1} \\\\ {2} \end{array}\right]+\left[\begin{array}{l}{3} \\\\ {1}\end{array}\right]=\left[\begin{array}{l}{4} \\\\ {3} \end{array}\right] $$
Scalar (Inner) products: Sum the element-wise products
$$ \boldsymbol{v}=\left[\begin{array}{c}{1} \\\\ {2} \\\\ {4}\end{array}\right], \quad \boldsymbol{w}=\left[\begin{array}{l}{2} \\\\ {4} \\\\ {8}\end{array}\right] $$

$$ \langle v, w\rangle= 1 \cdot 2+2 \cdot 4+4 \cdot 8=42 $$

Length of a vector: Square root of the inner product with itself $$ \|\boldsymbol{v}\|=\langle\boldsymbol{v}, \boldsymbol{v}\rangle^{\frac{1}{2}}=\left(1^{2}+2^{2}+4^{2}\right)^{\frac{1}{2}}=\sqrt{21} $$

Matrices

Matrix: rectangular array of numbers arranged in rows and columns

denoted with bold upper-case letters
$$ \boldsymbol{X}=\left[\begin{array}{ll}{1} & {3} \\\\ {2} & {3} \\\\ {4} & {7}\end{array}\right] $$
Dimension: $\\#rows \\times \\#columns$ (E.g.: 👆$X \in \mathbb{R}^{3 \times 2}$)
Vectors are special cases of matrices
$$ \boldsymbol{x}^{T}=\underbrace{\left[\begin{array}{ccc}{1} & {2} & {4}\end{array}\right]}_{1 \times 3 \text { matrix }} $$

####Matrices in ML

Data set can be represented as matrix, where single samples are vectors

e.g.:

Age Weight Height

Joe 37 72 175

Mary 10 30 61

Carol 25 65 121

Brad 66 67 175

$$ \text { Joe: } \boldsymbol{x}\_{1}=\left[\begin{array}{c}{37} \\\\ {72} \\\\ {175}\end{array}\right], \qquad \text { Mary: } \boldsymbol{x}\_{2}=\left[\begin{array}{c}{10} \\\\ {30} \\\\ {61}\end{array}\right] \\\\ $$ $$ \text { Carol: } \boldsymbol{x}\_{3}=\left[\begin{array}{c}{25} \\\\ {65} \\\\ {121}\end{array}\right], \qquad \text { Brad: } \boldsymbol{x}\_{4}=\left[\begin{array}{c}{66} \\\\ {67} \\\\ {175}\end{array}\right] $$
Most typical representation:
- row ~ data sample (e.g. Joe)
- column ~ data entry (e.g. age)
$$ \boldsymbol{X}=\left[\begin{array}{l}{\boldsymbol{x}\_{1}^{T}} \\\\ {\boldsymbol{x}\_{2}^{T}} \\\\ {\boldsymbol{x}\_{3}^{T}} \\\\ {\boldsymbol{x}\_{4}^{T}}\end{array}\right]=\left[\begin{array}{ccc}{37} & {72} & {175} \\\\ {10} & {30} & {61} \\\\ {25} & {65} & {121} \\\\ {66} & {67} & {175}\end{array}\right] $$

	Age	Weight	Height
Joe	37	72	175
Mary	10	30	61
Carol	25	65	121
Brad	66	67	175

Matrice Operations

Multiplication with scalar
$$ 3 \boldsymbol{M}=3\left[\begin{array}{lll}{3} & {4} & {5} \\\\ {1} & {0} & {1}\end{array}\right]=\left[\begin{array}{ccc}{9} & {12} & {15} \\\\ {3} & {0} & {3}\end{array}\right] $$
Addition of matrices
$$ \boldsymbol{M} + \boldsymbol{N}=\left[\begin{array}{lll}{3} & {4} & {5} \\\\ {1} & {0} & {1}\end{array}\right]+\left[\begin{array}{lll}{1} & {2} & {1} \\\\ {3} & {1} & {1}\end{array}\right]=\left[\begin{array}{lll}{4} & {6} & {6} \\\\ {4} & {1} & {2}\end{array}\right] $$
Transposed
$$ \boldsymbol{M}=\left[\begin{array}{lll}{3} & {4} & {5} \\\\ {1} & {0} & {1}\end{array}\right], \boldsymbol{M}^{T}=\left[\begin{array}{ll}{3} & {1} \\\\ {4} & {0} \\\\ {5} & {1}\end{array}\right] $$
Matrix-Vector product (Vector need to have same dimensionality as number of columns)
$$ \underbrace{\left[\boldsymbol{w}\_{1}, \ldots, \boldsymbol{w}\_{n}\right]}_{\boldsymbol{W}} \underbrace{\left[\begin{array}{c}{v\_{1}} \\\\ {\vdots} \\\\ {v\_{n}}\end{array}\right]}\_{\boldsymbol{v}}=\underbrace{\left[\begin{array}{c}{v\_{1} \boldsymbol{w}\_{1}+\cdots+v\_{n} \boldsymbol{w}\_{n}}\end{array}\right]}\_{\boldsymbol{u}} $$
E.g.:
$$ \boldsymbol{u}=\boldsymbol{W} \boldsymbol{v}=\left[\begin{array}{ccc}{3} & {4} & {5} \\\\ {1} & {0} & {1}\end{array}\right]\left[\begin{array}{l}{1} \\\\ {0} \\\\ {2}\end{array}\right]=\left[\begin{array}{l}{3 \cdot 1+4 \cdot 0+5 \cdot 2} \\\\ {1 \cdot 1+0 \cdot 0+1 \cdot 2}\end{array}\right]=\left[\begin{array}{c}{13} \\\\ {3}\end{array}\right] $$
💡 Think as: We sum over the columns $\boldsymbol{w}_i$ of $\boldsymbol{W}$ weighted by $v_i$

$$ u=v\_{1} w\_{1}+\cdots+v\_{n} w\_{n}=1\left[\begin{array}{l}{3} \\\\ {1}\end{array}\right]+0\left[\begin{array}{l}{4} \\\\ {0}\end{array}\right]+2\left[\begin{array}{l}{5} \\\\ {1}\end{array}\right]=\left[\begin{array}{c}{13} \\\\ {3}\end{array}\right] $$

Matrix-Matrix product
$$ \boldsymbol{U} = \boldsymbol{W} \boldsymbol{V}=\left[\begin{array}{lll}{3} & {4} & {5} \\\\ {1} & {0} & {1}\end{array}\right]\left[\begin{array}{ll}{1} & {0} \\\\ {0} & {3} \\\\ {2} & {4}\end{array}\right]=\left[\begin{array}{ll}{3 \cdot 1+4 \cdot 0+5 \cdot 2} & {3 \cdot 0+4 \cdot 3+5 \cdot 4} \\\\ {1 \cdot 1+0 \cdot 0+1 \cdot 2} & {1 \cdot 0+0 \cdot 3+1 \cdot 4}\end{array}\right]=\left[\begin{array}{cc}{13} & {32} \\\\ {3} & {4}\end{array}\right] $$
💡 Think of it as: Each column $\boldsymbol{u}\_i = \boldsymbol{W} \boldsymbol{v}\_i$ can be computed by a matrix-vector product
$$ \boldsymbol{W} \underbrace{\left[\boldsymbol{v}\_{1}, \ldots, \boldsymbol{v}\_{n}\right]}\_{\boldsymbol{V}}=[\underbrace{\boldsymbol{W} \boldsymbol{v}\_{1}}_{\boldsymbol{u}\_{1}}, \ldots, \underbrace{\boldsymbol{W} \boldsymbol{v}\_{n}}\_{\boldsymbol{u}\_{n}}]=\boldsymbol{U} $$
- Non-commutative: $\boldsymbol{V} \boldsymbol{W} \neq \boldsymbol{W} \boldsymbol{V}$
- Associative: $\boldsymbol{V}(\boldsymbol{W} \boldsymbol{X})=(\boldsymbol{V} \boldsymbol{W}) \boldsymbol{X}$
- Transpose product:
  $$ (\boldsymbol{V} \boldsymbol{W}) ^{T}=\boldsymbol{W}^{T} \boldsymbol{V}^{T} $$
Matrix inverse
- scalar
  $$ w \cdot w^{-1}=1 $$
- matrices
  $$ \boldsymbol{W} \boldsymbol{W}^{-1}=\boldsymbol{I}, \quad \boldsymbol{W}^{-1} \boldsymbol{W}=\boldsymbol{I} $$

Important Special Cases

Scalar (Inner) product:
$$ \langle\boldsymbol{w}, \boldsymbol{v}\rangle = \boldsymbol{w}^{T} \boldsymbol{v}=\left[w\_{1}, \ldots, w\_{n}\right]\left[\begin{array}{c}{v\_{1}} \\\\ {\vdots} \\\\ {v\_{n}}\end{array}\right]=w\_{1} v\_{1}+\cdots+w\_{n} v\_{n} $$
Compute row/column averages of matrix
$$ \boldsymbol{X}=\underbrace{\left[\begin{array}{ccc}{X\_{1,1}} & {\dots} & {X\_{1, m}} \\\\ {\vdots} & {} & {\vdots} \\\\ {X\_{n, 1}} & {\dots} & {X\_{n, m}}\end{array}\right]}\_{n \text { (samples) } \times m \text { (entries) }} $$
- Vector of row averages (average over all entries per sample)
  $$ \left[\begin{array}{cc}{\frac{1}{m} \sum\_{i=1}^{m} X\_{1, i}} \\\\ {\vdots} & {} \\\\ {\frac{1}{m} \sum_{i=1}^{m} X\_{n, i}}\end{array}\right]=\boldsymbol{X}\left[\begin{array}{c}{\frac{1}{m}} \\\\ {\vdots} \\\\ {\frac{1}{m}}\end{array}\right]=\boldsymbol{X} \boldsymbol{a}, \quad \text { with } \boldsymbol{a}=\left[\begin{array}{c}{\frac{1}{m}} \\\\ {\vdots} \\\\ {\frac{1}{m}}\end{array}\right] $$
- Vector of column averages (average over all samples per entry)
  $$ \left[\frac{1}{n} \sum_{i=1}^{n} X\_{i, 1}, \ldots, \frac{1}{n} \sum\_{i=1}^{n} X\_{i, m}\right]=\left[\frac{1}{n}, \ldots, \frac{1}{n}\right] \boldsymbol{X}=\boldsymbol{b}^{T} \boldsymbol{X}, \text { with } \boldsymbol{b}=\left[\begin{array}{c}{\frac{1}{n}} \\\\ {\vdots} \\\\ {\frac{1}{n}}\end{array}\right] $$

Calculus

“The derivative of a function of a real variable measures the sensitivity to change of a quantity (a function value or dependent variable) which is determined by another quantity (the independent variable)”

	Scalar	Vector
Function	$f(x)$	$f(\boldsymbol{x})$
Derivative	$\frac{\partial f(x)}{\partial x}=g$	$\frac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}}=\left[\frac{\partial f(\boldsymbol{x})}{\partial x\_{1}}, \ldots, \frac{\partial f(\boldsymbol{x})}{\partial x\_{d}}\right]^{T} =: \nabla f(x)\quad$ (👆 gradient of function $f$ at $\boldsymbol{x}$)
Min/Max	$\frac{\partial f(x)}{\partial x}=0$	$\frac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}}=[0, \ldots, 0]^{T}=\mathbf{0}$

Matrix Calculus

	Scalar	Vector
Linear	$\frac{\partial a x}{\partial x}=a$	$\nabla\_{\boldsymbol{x}} \boldsymbol{A} \boldsymbol{x}=\boldsymbol{A}^{T}$
Quadratic	$\frac{\partial x^{2}}{\partial x}=2 x$	$\begin{array}{l}{\nabla\_{\boldsymbol{x}} \boldsymbol{x}^{T} \boldsymbol{x}=2 \boldsymbol{x}} \\\\ {\nabla\_{\boldsymbol{x}} \boldsymbol{x}^{T} \boldsymbol{A} \boldsymbol{x}=2 \boldsymbol{A} \boldsymbol{x}}\end{array}$

End-to-End Machine Learning Project

Mon, 17 Aug 2020 00:00:00 +0000

1. Look at the big picture

1.1 Frame the problem

Consider the business objective: How do we expect to use and benefit from this model?

1.2 Select a performance measure

1.3 Check the assumptions

List and verify the assumptions.

2. Get the data

2.1 Download the data

Automate this process: Create a small function to handle downloading, extracting, and storing data.

2.2 Take a quick look at the data

Use pandas.head() to look at the top rows of the data
Use pandas.info() to get a quick description of the data
- For categorical attributes, use value_counts() to see categories and the #samples of each category
- For numerical attributes, use describe() to get a summary of the numerical attributes.

Create a test set

If dataset is large enough, use purely random sampling. (train_test_split)
If the test set need to be representative of the overall data, use stratified sampling.

3. Discover and visualize the data to gain insights

Make sure put the test set aside and only explore the training set
If the trainingset is very large, sample an exploration set to make manipulations easy and fast

3.1 Visualizing data

3.2 Look for correlations

Two ways:

Compute the standard correlation coefficient (also called Pearson’s r) between every pair of attributes using the corr() method.
Or use panda’s scatter_matrix function

3.3 Experimenting with attribute combinations

4. Prepare the data for ML algorithms

Firstly, ensure a clean training set and separate the predictors and labels.

4.1 Data cleaning

Handle missing features:

Get rid of the corresponding samples (districts) -> use dropna()
Get rid of the whole attribute -> use drop()
Set the values to some value (zero, the mean, the median, etc.) -> use fillna()

Or apply SimpleImputer from Scikit-Learn to all the numerical attributes.

4.2 Handle text and categorical attributes

Most ML algorithms prefer to work with numbers anyway. Transform text and categorical attributes to numerical attributes Using One-hot encoding.

4.3 Custom transformers

The custom transformer should work seamlessly with Scikit-Learn functionalities (such as pipelines). -> Create a class and implement three methods:

fit()
transform()
fit_transform() (can get it by simply adding TransfromerMixin as a base class)

If we add BaseEstimator as a bass class, we can get two extra methods

get_params()
set_params() that will be useful for automatic hyperparameter tuning.

4.4 Feature scaling

Comman ways:

Min-max scaling (normalization): Use MinMaxScalar
Standardization
- Use StandardScalar
- Less affected by outliners

4.5 Transformation pipelines

Group sequences of transformations into one step.

Pipeline from scikit-learn:

a list of name/estimator pairs defining a sequence of steps
the last estimator must be transformers (must have a fit_transform() method)
names can be anything but must be unique and don’t contain double underscores “__”

More convenient is to use a single transformer to handle the categorical columns and the numerical columns. -> Use ColumbTransformer: handle all columns, applying the appropriate transformations to each column and also works great with Pandas DataFrames.

5. Select a model and train it

5.1 Train and evaluate on the trainging set

5.2 Better evaluation using Cross-Validation

6. Fine-tune the model

6.1 Grid search

When exploring relatively few combinations, use GridSearchCV: Tell it which hyperparameters we want to experiment with, and what values to try out. Then it will evaluate all the possible combinations of hyperparameter values, using cross-validation.

6.2 Randomized search

When the hyperparameter search space is large, use RandomizedSearchCV. It evaluates a given number of random combinations by selecting a random value for each hyperparameter at every iteration.

6.3 Ensemble methods

Try to combine the models that perform best.

6.4 Analyze the best models and their errors

Gain good insights on the problem by inspecting the best models.

6.5 Evaluate the system on the test set

Get the predictors and labels from test set
Run full pipeline to transform the data
Evaluate the final model on the test set

7. Present the solution

8. Launch, monitor, and maintain the system

Plug the production input data source into the system and write test
Write monitoring code to check system’s live performance at regular intervals and trigger callouts when it drops
Evaluate the system’s input data quality
Train the models on a regular basis using fresh data (automate this precess as much as possible!)

Evaluation

Mon, 17 Aug 2020 00:00:00 +0000

TL;DR

Confusion matrix, ROC, and AUC

Confuse matrix

A confusion matrix tells you what your ML algorithm did right and what it did wrong.

		Known Truth
		Positive	Negative
Prediction	Positive	True Positive (TP)	False Positive (FP)	Precision = TP / (TP+FP)
Prediction	Negative	False Negative (FN)	True Negative (TN)
		TPR = Sensitivity = Recall = TP / (TP + FN)	Specificity = TN / (FP+TN) FPR = FP / (FP + TN) = 1 - Specificity

Row: Prediction
Column: Known truth

Each cell:

Positive/negative: refers to the prediction
True/False: Whether this prediction matches to the truth
The numbers along the diagonal (green) tell us how many times the samples were correctly classified
The numbers not on the diagonal (red) are samples the algorithm messed up.

Definition

Precision

How many selected items are relevant?

$$ \text{ Precision } = \frac{TP}{TP + FP} =\frac{\\# \text{ relevant item retrieved }}{\\# \text{ of items retrieved }} $$

Recall / True Positive Rate (TPR) / Sensitivity

How many relevant items are selected?

$$ \text { Recall } = \frac{TP}{TP + FN} =\frac{\\# \text { relevant item retrieved }}{\\# \text { of relevant items in collection }} $$

Example

F-score / F-measure

$F\_1$ score

The traditional F-measure or balanced F-score ($F\_1$ score) is the harmonic mean of precision and recall:

$$ F\_1=\frac{2 \cdot \text {precison} \cdot \text {recall}}{\text {precision}+\text {recall}} = \frac{2TP}{2TP + FP + FN} $$

$F\_\beta$ score

$F\_\beta$ uses a positive real factor $\beta$, where $\beta$ is chosen such that recall is considered $\beta$ times as important as precision

$$ F\_{\beta}=\left(1+\beta^{2}\right) \cdot \frac{\text { precision } \cdot \text { recall }}{\left(\beta^{2} \cdot \text { precision }\right)+\text { recall }} $$

Two commonly used values for $\beta$:

$2$: weighs recall higher than precision
$0.5$: weighs recall lower than precision

Specificity

$$ \text{Specifity} = \frac{TN}{FP + TN} $$

False Positive Rate (FPR)

$$ \text{FPR} = \frac{FP}{FP + TN} \left(= 1- \frac{TN}{FP + TN} = 1- \text{Specifity}\right) $$

Relation between Sensitivity, Specificity, FPR and Threshold

Assuming that the distributions of the actual postive and negative classes looks like this:

And we have already defined our threshold. What greater than the threshold will be predicted as positive, and smaller than the threshold will be predicted as negative.

If we set a lower threshold, we’ll get the following diagram:

We can notice that FP ⬆️ , and FN ⬇️ .

Therefore, we have the relationship:

Threshold ⬇️
- FP ⬆️ , FN ⬇️
- $\text{Sensitivity} (= TPR) = \frac{TP}{TP + FN}$ ⬆️ , $\text{Specificity} = \frac{TN}{TN + FP}$ ⬇️
- $FPR (= 1 - \text{Specificity})$⬆️
And vice versa

AUC-ROC curve

AUC (Area Under The Curve)-ROC (Receiver Operating Characteristics) curve

Performance measurement for the classification problems at various threshold settings.
- ROC is a probability curve
- AUC represents the degree or measure of separability
Tells how much the model is capable of distinguishing between classes.
The higher the AUC, the better the model is at predicting 0s as 0s and 1s as 1s

How is ROC plotted?

for threshold in thresholds: # iterate over all thresholds
 TPR, FPR = classify(threshold) # calculate TPR and FPR based on threshold
 plot_point(FPR, TPR) # plot coordinate (FPR, TPR) in the diagram

connect_points() # connect all plotted points to get ROC curve

Example:

Suppose that the probability of a series of samples being classified into positive classes has been derived and we sort them descendingly:

Class: actual label of test sample
Score: probability of classifying test sample as positive

Next, we use the “Score” value as the threshold (from high to low).

When the probability that the test sample is a positive sample is greater than or equal to this threshold, we consider it a positive sample, otherwise it is a negative sample.
- For example, for the 4-th sample, its “Score” has value 0.6. So Sample 1, 2, 3, 4 will be considered as positive, because their “Score” values are $\geq$ 0.6. Other samples are classified as negative.
By picking a different threshold each time, we can get a set of FPR and TPR, i.e., a point on the ROC curve. In this way, we get a total of 20 sets of FPR and TPR values. We plot them in the diagram:

How to speculate about the performance of the model?

An excellent model has AUC near to the 1 which means it has a good measure of separability.

Ideal situation: two curves don’t overlap at all means model has an ideal measure of separability. It is perfectly able to distinguish between positive class and negative class.
When $0.5 < \text{AUC} < 1$, there is a high chance that the classifier will be able to distinguish the positive class values from the negative class values. This is because the classifier is able to detect more numbers of True positives and True negatives than False negatives and False positives.

When AUC is 0.7, it means there is a 70% chance that the model will be able to distinguish between positive class and negative class.
When AUC is 0.5, it means the model has no class separation capacity whatsoever.
A poor model has AUC near to the 0 which means it has the worst measure of separability.

When AUC is approximately 0, the model is actually reciprocating the classes. It means the model is predicting a negative class as a positive class and vice versa.

🎥 Video tutorials

The confusion matrix

Sensitivity and specificity

ROC and AUC

Reference

Understanding AUC - ROC Curve
What is the F-score?: very nice explanation with examples
机器学习之分类器性能指标之ROC曲线、AUC值

Overview of Machine Learning Algorithms

Mon, 17 Aug 2020 00:00:00 +0000

Supervised/Unsupervised Learning

Supervised learning

The training data you feed to the algorithm includes the desired solutions, called labels

Typical task:

Classification
Regression

Important supervised learning algo:

k-Nearest Neighbors
Linear Regression
Logistic Regression
Support Vector Machine (SVM)
Decision Trees and Random Forests
Neural Networks

Unsupervised learning

Training data is unlabeled.

Important unsupervised learning algo:

Clustering
- K-Means
- DBSCAN
- Hierarchical Cluster Analysis (HCA)
Anomaly detection and novelty detection
- One-class SVM
- Isolation Forest
Visualization and dimensionality reduction
- Principal Component Analysis (PCA)
- Kernel PCA
- Locally-Linear Embedding (LLE)
- t-distributed Stochastic Neighbor Embedding (t-SNE)
Association rule learning
- Apriori
- Eclat

Semisupervised learning (supervised + unsupervised)

Deal with partially labeled training data, usually a lot of unlabeled data and a little bit of labeled data

Reinforcement Learning

The learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards in return or penalties in the form of negative rewards.

It must then learn by itself what is the best strategy, called a policy, to get the most reward over time.

A policy defines what action the agent should choose when it is in a given situation.

Batch and Online Learning

whether the system can learn incrementally from a stream of incoming data or not

Batch Learning

The system muss be trained using all the available data (I.e., it is incapable of learning incrementally)

First the system is trained, and then it is launched into production and runs without learning anymore; it just applies what it has learned. This is called offline learning.

Want a batch learning system to know about new data?

Need to train a new version of the system from scratch on the full dataset (not just the new data, but also the old data). Then stop the old system and replace it with the new one.

Online Learning

Train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches.

Each learning step is fast and cheap, so the system can learn about new data on the fly, as it arrives.

👍 Advantages:

Great for systems that receive data as a continuous flow and need to adapt to chagne rapidly or autonomously
Save a huge amount of space (After learning the new data instance, do not need them anymore and can just discard them)

😠 Challenge: if bad data is fed to the system, the system’s performance will gradually decline.

🔧 Solution:

monitor the system closely
promptly switch learning off if detect a drop in performance
monitor the input data and react to abnormal data

Instance-Based Vs. Model-Based Learning

Instance-based learning

The system learns the examples by heart, then generalizes to new cases by comparing them to the learned examples (or a subset of them), using a similarity measure

Model-based learning

Build a model of these examples, then use that model to make predictions