Ensemble Learning | Haobin Tan

Ensemble Learning

Mon, 07 Sep 2020 00:00:00 +0000

Why ensemble learning?

Sat, 07 Nov 2020 00:00:00 +0000

wisdom of the crowd : In many cases you will find that this aggregated answer is better than an expert’s answer.

Similarly, if you aggregate the predictions of a group of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor.

A group of predictors is called an ensemble;

thus, this technique is called Ensemble Learning,

and an Ensemble Learning algorithm is called an Ensemble method.

Popular Emsemble methods:

Voting Classifier

Sat, 07 Nov 2020 00:00:00 +0000

Suppose we have trained a few classifiers, each one achieving about 80% accuracy.

A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes.

This majority-vote classifier is called a hard voting classifier

Surprisingly, this voting classifier often achieves a higher accuracy than the best classifier in the ensemble. In fact, even if each classifier is a weak learner (meaning it does only slightly better than random guessing), the ensemble can still be a strong learner (achieving high accuracy), provided there are a sufficient number of weak learners and they are sufficiently diverse. (Reason behind: the law of large numbers)

Ensemble methods work best when the predictors are as independent from one another as possible.

One way to get diverse classifiers is to train them using very different algorithms. This increases the chance that they will make very different types of errors, improving the ensemble’s accuracy.
Another approach is to use the same training algorithm for every predictor, but to train them on different random subsets of the training set. (See Bagging and Pasting)

Random Forest

Sat, 07 Nov 2020 00:00:00 +0000

Train a group of Decision Tree classifiers (generally via the bagging method (or sometimes pasting)), each on a different random subset of the training set

To make predictions, just obtain the preditions of all individual trees, then predict the class that gets the most votes.

Why is Random Forest good?

The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. This results in a greater tree diversity, which (once again) trades a higher bias for a lower variance, generally yielding an overall better model. 👏

Ensemble Learners

Sat, 07 Nov 2020 00:00:00 +0000

Why emsemble learners?

Lower error

Each learner (model) has its own bias. It we put them together, the bias tend to be reduced (they fight against each other in some sort of way)
Less overfitting
Tastes great

Boosting

Sat, 07 Nov 2020 00:00:00 +0000

Boosting

Refers to any Ensemble method that can combine serval weak learners into a strong learner

💡 General idea: train predictors sequentially, each trying to correct its predecessor.

Popular boosting methods:

AdaBoost
Gradient Boost

Bagging and Pasting

Sat, 07 Nov 2020 00:00:00 +0000

TL;DR

Bootstrap Aggregating (Boosting): Sampling with replacement
Pasting: Sampling without replacement

Explaination

Ensemble methods work best when the predictors are as independent from one another as possible.

One way to get a diverse set of classifiers: use the same training algorithm for every predictor, but to train them on different random subsets of the training set

Sampling with replacement: boostrap aggregating (Bagging)
Sampling without replacement: pasting

Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors. The aggregation function is typically the statistical mode

classification: the most frequent prediction (just like a hard voting classifier)
regression: average

Each individual predictor has a higher bias than if it were trained on the original training set, but aggregation reduces both bias and variance. 👏

Generally, the net result is that the ensemble has a similar bias but a lower variance than a single predictor trained on the original training set.

##Advantages of Bagging and Pasting

Predictors can all be trained in parallel, via different CPU cores or even different servers.
Predictions can be made in parallel.

-> They scale very well 👍

Bagging vs. Pasting

Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting, but this also means that predictors end up being less correlated so the ensemble’s variance is reduced.
Overall, bagging often results in better models
However, if you have spare time and CPU power you can use cross- validation to evaluate both bagging and pasting and select the one that works best.

Out-of-Bag Evaluation

With bagging, some instances may be sampled several times for any given predictor, while others may not be sampled at all. This means that only about 63% of the training instances are sampled on average for each predictor.

The remaining 37% of the training instances that are not sampled are called out-of-bag (oob) instances. Note that they are not the same 37% for all predictors.

Since a predictor never sees the oob instances during training, it can be evaluated on these instances, without the need for a separate validation set. You can evaluate the ensemble itself by averaging out the oob evaluations of each predictor.

AdaBoost

Sat, 07 Nov 2020 00:00:00 +0000

Adaptive Boosting:

Correct its predecessor by paying a bit more attention to the training instance that the predecessor underfitted. This results in new predictors focusing more and more on the hard cases.

Pseudocode

Assign observation $i$ the weight for $d\_{1,i}=\frac{1}{n}$ (equal weights)
For $t=1:T$
1. Train weak learning alg orithm using data weighted by $d\_{ti}$. This produces weak classifier $h\_t$
2. Choose coefficient $\alpha\_t$ (tells us how good is the classifier is at that round)
$$ \begin{aligned} \operatorname{Error}\_{t} &= \displaystyle\sum\_{i; h\_{t}\left(x\_{i}\right) \neq y\_{i}} d\_{t} \quad \text{(sum of weights of misclassified points)} \\\\ \alpha\_t &= \frac{1}{2} (\frac{1 - \operatorname{Error}\_{t}}{\operatorname{Error}\_{t}}) \end{aligned} $$
1. Update weights
  $$ d\_{t+1, i}=\frac{d\_{t, i} \cdot \exp (-\alpha\_{t} y\_{i} h\_{t}\left(x\_{i}\right))}{Z\_{t}} $$
  - $Z\_t = \displaystyle \sum\_{i=1}^{n} d\_{t,i} $: normalization factor
    - If prediction $i$ is correct $\rightarrow y\_i h\_t(x\_i) = 1 \rightarrow $ Weight of observation $i$ will be decreased by $\exp(-\alpha\_t)$
    - If prediction $i$ is incorrect $ \rightarrow y\_i h\_t(x\_i) = -1 \rightarrow $ Weight of observation $i$ will be increased by $\exp(\alpha\_t)$
Output the final classifier

$ H(x)=\operatorname{sign}\left(\sum\_{t=1}^{T} \alpha\_{t} h\_{t}\left(x\_{i}\right)\right) $

Ensemble Learning | Haobin Tan

Ensemble Learning

Why ensemble learning?

Voting Classifier

Random Forest

Why is Random Forest good?

Ensemble Learners

Why emsemble learners?

Boosting

Boosting

Bagging and Pasting

TL;DR

Explaination

Bagging vs. Pasting

Out-of-Bag Evaluation

AdaBoost

Pseudocode

Example

Tutorial