Introduction to Adaboost

Now that we have looked at the overall understanding of what boosting is, let’s start with the original boosting algorithm – Adaboost.

AdaBoost stands for Adaptive Boosting and was developed by Schapire and Freund, who later on won the 2003 Godel Prize for their work.

In the next video, Anjali will give you an overview of Adaboost and why we use it.

Before starting with a numerical example to understand Adaboost, let us get an overview of the steps we will take in Adaboost:

Adaboost starts with a uniform distribution of weights over training examples i.e. it gives equal weights to all its observations. These weights tell the importance of each datapoint being considered.
We start with a single weak learner to make the initial predictions.
Once the initial predictions are made, patterns which were not captured by previous weak learner is taken care of by the next weak learner by giving more weightage to the misclassified data points.
Apart from giving weightage to each observation, the model also gives weightage to each weak learner. More the errors in the weak learner, lesser is the weightage given to it. This helps when the ensembled model makes final predictions.
After getting these two weights – for the observations and the individual weak learners, the next model(weak learner) in the sequence trains on the resampled data (data sampled according to the weights) to make the next prediction.
The model will iteratively continues the steps mentioned above for a pre-specified number of weak learners.
In the end, we take a weighted sum of the predictions from all these weak learners to get an overall strong learner.

A strong learner is formed by the combination of multiple weak learners which are trained on the mistakes of the previous model

Now that you have an intuitive understanding of Adaboost, let’s take a look at the inner working of the algorithm with the help of a numerical example.

To summarize, here are the major takeaways from this video:

In Adaboost, we start with a base model with equal weights given to every observation. In the next step, the observations which are incorrectly classified will be given a higher weight so that when a new weak learner is trained, they will give more attention to these misclassified observations.

In the end, we get a series of models that have different say according to the predictions each weak model has made. If the model performs poorly and makes many incorrect predictions then it is given less significance, whereas if the model performs well and does correct predictions most of the time, then it is given more significance in the overall model.

The say/importance each weak learner, in our case the decision tree stump, has in the final classification depends on the total error it made.

α = 0.5 ln( (1 − Total error)/Total error )

The value of error rate lies between 0 & 1 so let’s see how alpha & error is related.

When the base model performs with less error overall then, as you can see in the plot above, the α is a large positive value, which means that the weak learner will have a high say in the final model.
If the error is 0.5, it means that it is not sure of the decision, then the α =0, i.e the weak learner will have no say or significance in the final model.
If the model produces large errors(i.e close to 1), then α is a large negative value, meaning that the predictions it makes is incorrect most of the time. Hence this weak learner will have a very low say in the final model.

After calculating the say/importance of each weak learner, we find the new weights of each observation present in the training dataset. The following formula is used to compute the new weight for each observation:

new sample weight for the incorrectly classified = original sample weight * eα
new sample weight for the correctly classified = original sample weight * e−α

After calculating we need to normalize these values in order to proceed further, using the following formula:

Normalized weights =p(xi)∑p(xi)

The samples which the previous stump incorrectly classified will be given higher weights and the ones which the previous stump classified correctly will be given lower weights.

Next, a new stump will be created by randomly sampling the weighted observations. Due to the weights given to each observation, the new dataset will have a tendency to contain multiple copies of the observations that were misclassified by the previous tree and may not contain all observations which were correctly classified. This will essentially help the next weak learner to give more importance to the incorrectly classified sample so that it can correct the mistake and correctly classify it now. This process will be repeated till a pre-specified number of trees are built i.e. the ensemble is built

The AdaBoost model makes predictions by having each tree in the ensemble classify the sample. Then, we split the trees into groups according to their decisions. For each group, we add up the significance of every tree inside the group. The final prediction made by the ensemble as a whole is determined by the sign of the weighted sum.

NOTE:

Weight is mentioned in two different contexts – one for each observation & the other is in context of the importance/say given to each weak learner in the model.

Report an error