IKH

Understanding Gradient Boosting – I

Gradient Boosting like AdaBoost trains many models in a gradual, additive, and sequential manner. But the major difference between the two is how they identify & handle the shortcomings of weak learners. In AdaBoost, we provide more say or weight to those data points which are misclassified/ wrongly predicted earlier. Gradient boosting performs the same by using gradients in the loss function.

In the next video, Anjali will explain the fundamentals of the Gradient boosting machine and how does it work.

The loss function is a measure indicating how good the model is able to  fit the underlying data. This varies from one problem statement to another problem statement and thus would depend on what we are trying to optimize. In the case of a regression problem, the loss function would be the error between true and predicted target values. For classification problems, the loss function would be a measure of how good our predictive model is at classifying each sample point that we have in our dataset. 

Let’s now understand GBM with the help of a numerical example for a regression problem.

Note

At timestamp 5:56, the step no is 6 which says: To use all of the trees in the ensemble model to make a final prediction.

To summarize here are the broader points on how does a GBM learn:

  • Build the first weak learner using a sample from the training data; we will consider a decision tree as the weak learner or the base model. It may not necessarily be a stump, can grow a bigger tree but will still be weak i.e. still not be fully grown.
  • Then the predictions are made on the training data using the decision tree just built.
  • The gradient, in our case the residuals are computed and these residuals are the new response or target values for the next weak learner.
  • A new weak learner is built with the residuals as the target values and a sample of observations from the original training data.
  • Add the predictions obtained from the current weak learner to the predictions obtained from all the previous weak learners. The predictions obtained at each step are multiplied by the learning rate so that no single model makes a huge contribution to the ensemble thereby avoiding overfitting. Essentially, with the addition of each weak learner, the model takes a very small step in the right direction.
  • The next weak learner fits on the residuals obtained till now and these steps are repeated, either for a prespecified number of weak learners or if the model starts overfitting i.e. it starts to capture the niche patterns of the training data.
  • GBM makes the final prediction by simply adding up the predictions from all the weak learners (multiplied by the learning rate).

Report an error