Back To The Basics
I’ve decided to compile the notes I’ve made over the course of my ML journey as a series of blog posts here on my website.
You can view other topics in this series by clicking on the ML NOTES category in the article header above.
Disclaimer
I’ve read through multiple sources; articles, documentation pages, research papers and textbooks, to compile my notes. I was looking to maximise my understanding of the concepts and, previously, never intended to share them with the world. So, I did not do a good job of documenting sources for reference later on.
I’ll leave references to source materials if I have them saved. Please note that I’m not claiming sole authorship of these blog posts; these are just my personal notes and I’m sharing them here in the hopes that they’ll be helpful to you in your own ML journey.
Take these articles as a starting point to comprehend the concepts. If you spot any mistakes or errors in these articles or have suggestions for improvement, please feel free to share your thoughts with me through my LinkedIn.
We’ll take a look at Ensemble methods in this post
Ensembles
Ensemble methods combine multiple models to generate predictions as opposed to using a single model.
Ensemble models are based on the idea that using a bunch of diverse and uncorrelated models might result in achieving better performance over using a single model.
However, ensemble models are complex and harder to maintain in production, and are therefore, typically only used in situations where a small improvement in prediction accuracy translates to huge financial gains such as improving ad conversion rates, or a recommender system achieving better click through rates as a result of serving improved recommendations to users.
We’ll take a deeper look into the three main ensemble techniques.
1. Bootstrap Aggregation or Bagging
Bootstrap aggregation or bagging is an ensemble technique that uses different slices of the training data to train a bunch of decision trees.
These slices are created by sampling observations with replacement from the training data. This means data samples picked to be a part of one slice have an equal probability of being picked up for the next slice. These slices are called bootstraps.
Sampling with replacement could result in some samples being a part of multiple bootstraps, while other samples not being picked at all.
Then a decision tree is trained on each bootstrap and allowed to reach maximum depth. Pruning is not performed, so this process gives us a bunch of (or /slightly) different trees.
The key factor that enables the ensemble to outperform a single tree is that the errors of these decision tree are (or should be) uncorrelated (or have very low correlation) with the errors of other trees in the ensemble. Only then will the performance of the ensemble be better than the performance of an individual tree.
After the ensemble is trained, test predictions from each unpruned tree are coalesced into a single prediction. This is aggregation. For regression tasks, the average of the numeric predictions from each tree is calculated, whereas for classification tasks it could involve either hard-voting where the class label predicted by the majority of the trees is the output, or, soft-voting where each tree outputs its predicted probablities for each class which are then averaged.
One example of a bagging model is the random forest algorithm.
Advantages
Bagging reduces the variance of the ensemble.
choosing models with high variance such as shallow decision trees makes the best use of the bagging technique. Using models which already have low variance like KNNs, LDA etc, might not result in a greater performance improvement over the individual model.
Bagging provides an internal estimate of (each) model’s performance. While creating the bootstraps for each model within the ensemble, samples which are not picked to be a part of the bootstrap will form what is called the “out-of-bag” sample set. These out-of-bag samples can be used as an internal measure of performance, known as out-of-bag estimate.
This means you can use more data for training without needing separate validation data because the internal “out-of-bag” estimate is accurate and correlates well with cross-validation or test set estimates, saving you time, effort, and computational resources by eliminating the need for an additional cross-validation step.
2. Boosting
In the boosting ensemble method, a strong ensemble of models is generated from iteratively improving upon weak learners’ mistakes.
Process
- The boosting ensemble model starts off with a weak learner trained on the entire training dataset.
- The model’s predictions are analysed and those samples the model misclassified will be assigned higher weights. This re-weighted dataset is used as the training data in the next iteration.
- The misclassified samples are assigned higher weights in the hopes that the next model will pay greater attention to those samples and learn to avoid the previous model’s mistakes.
- In the second iteration, an ensemble is created consisting of the old model + a new weak learner which has been added at this stage. The ensemble undergoes training with the re-weighted dataset created in the previous step, and once again, samples misclassified by the ensemble will be assigned higher weights creating yet another but slightly different re-weighted dataset.
- Once again, in the subsequent iteration, another new weak learner is added to the ensemble, and the ensemble’s predictions are analysed to create the re-weighted dataset for the next step.
- Eventually, after repeating this process multiple times, the ensemble produced is a stronger model able to outperform single models.
- This process of re-weighting misclassified data points + adding new weak learners to the ensemble continues until some stopping criteria is met.
What models to use in Boosting?
Generally, any classification algorithm can be used with boosting, but decision trees are usually used. Boosting methods are good at solving the high variance problem and since decision trees can be made into weak learners by restricting the tree depth (producing stumps), these models have low bias and high variance.
So, boosting ensembles helps drive down the high variance leading to a low bias and low variance models.
Boosting doesn’t work as good to improve performance for models with low variance such as LDA, k-NN, as it does for high variance models such as neural networks or NB classifiers.
3. Stacking
Stacking (or stacked generalization) is an ensemble technique where multiple base learners’ predictions are fed to a meta-learner which is solely responsible for predicting the final output.
The meta-learner learns how to best combine predictions from each of the base learners.
The key aspect to remember when assembling a stacked ensemble is to choose models that fail in different ways, i.e., the errors should be uncorrelated (or poorly correlated) with each other.
This is because having diverse models helps the ensemble improve in performance over a single model; and prevents the ensemble from making (bigger) mistakes in the same direction.
The base learners can be linear/logistic regression models, decision trees, neural networks, and even other ensemble methods such as random forest.
The base learners’ predictions on out-of-sample data are gathered as inputs + the ground truth values corresponding to those samples are given as the outputs to the meta-learner which then learns how to combine the base learners’ predictions to get the best output.
Typically, K-fold cross validation is used to generate base learners’ predictions on the hold-out test sets in each iteration of the cross-validation process, and, the corresponding ground truth values are used to construct the training data for the meta-learner. Input features may also be given to the meta-learner as additional context for learning.
The meta-learner can be a simple model such as a linear or logistic regression model or a neural network.
Thanks for reading! Hope you’ve found this post helpful, and as always, pop in occasionally to this space to read more on machine learning.
References
- Applied Predictive Modelling by Max Kuhn, Kjell Johnson.
- Stacking Ensemble Machine Learning by Jason Brownlee
- Ensemble methods: bagging, boosting and stacking by Joseph Rocca