We got an intuition of how the gradient boosting process helps in reducing the error with each iteration. To summarise, the gradient boosting algorithm has two steps at each iteration:
- Find the residual and fit a new model on the residual.
- Add the new model to the older model and continue the next iteration.
Let’s now see how this can be seen as a gradient descent problem.
Comprehension – Gradient Descent
You have learned that in the gradient descent algorithm, a function G(x) decreases fastest around a point a if one goes along the negative gradient of G(x). It follows the iterative process:
$$//a_n+1=a_{n-}\lambda\frac{dG\left(x\right)}{dx}//$$
Now if the function is G(x,y), we have the following formulation:
$$//x_{n+1}=x_{n-}\lambda\frac{\partial G(x,y)}{\partial x}\\\\y_{n+1}=y_{n-}\lambda\frac{\partial G(x,y)}{\partial y}//$$
Note that $$//\frac\partial{\partial x}//$$ refers to the partial derivative.
Keeping these in mind answer the following questions:
Let’s now look at the following video in which Prof. Raghavan explains how the gradient boosting algorithm can be seen as analogous to the gradient descent algorithm.
We see that the negative gradient of the square loss function gives the residue. So the strategy of closing in on the residues at each step can be seen as taking a step towards the negative gradient of the loss function. We can observe here that gradient descent helps in generalising the boosting process.
Please note that this explanation and the aforementioned process of closing in on the residues is only valid for square error loss function.
Additional Reading
- The following link has the list of cost functions that can be used while solving a problem using Gradient Descent.