Note
The content covered in this segment is optional and will also be covered in a live session.
In this segment, we will understand how ridge and lasso regression estimate their beta coefficients visually. Before we move forward, it is recommended that you watch this video to get an understanding of contours. You can see that a two-dimensional plot of RSS would be a paraboloid, i.e., its shape would be as shown in the figure given below.
So, as you saw in the Khan academy video, imagine a number of planes passing through different points of the plot perpendicular to the RSS axis.
The red dot gives those values of β0 and β1 that yield the least RSS. We will refer to this as ^β. These are the coefficients that we obtain when we use OLS only, without any regularisation.
Then we convert this two-dimensional figure to one dimension (imagine looking at the image above from the top). We get contours as shown in the figure given below.
In the figure above, each ellipse is centred around ^β (indicated by the red dot), and all of the points on any given ellipse will share the same value of RSS.
With OLS, the red dot, at the center of the contours, would give the values for the model coefficients, i.e., β0 and β1, for which RSS is minimum. Now, what happens when we add a penalty in the form of constraints to this cost function? Let’s take a look at the figures below to find out.
Let us first understand what this figure represents. The model coefficients that we obtain on using OLS only are given by ^β, i.e., those values of the coefficients that may have led to the model overfitting. To handle the overfitting, we perform regularisation, which adds a penalty term to the cost function. We have seen that this penalty term is λ times the squared sum of all the coefficients for Ridge regression, and it is λ times the sum of the absolute values of the coefficients for Lasso. This penalty term regularises or shrinks the model coefficient estimates by adding constraints.
Referring to the image above, in two dimensions, the constraint for Ridge regression is the circle shown in the figure, and for Lasso, it is the diamond. In order to get the best coefficients for Ridge regression, they should be present within the region where RSS contours and the constraint region overlap. We know that we want those coefficient values for which RSS is minimum, and it should lie within the constraint region as well.
This we get when the ellipse just touches the circle, i.e., the constraint region for Ridge regression, since RSS increases more as we move farther away from ^β. The point where the two regions touch gives us those coefficient values that produce the lowest RSS given the constraint. Similar reasoning can be used for Lasso.
To understand this further, let us consider only two model coefficients: β0 and β1. For Ridge regression, this constraint will be given by (β20+β21<c) – represented by the circle in the figure above, with c being its radius. Similarly, for Lasso, it would be given by
(|β0|+|β1|<c), and this constraint is denoted with the diamond, with c being the length from the origin to each of the corners.
For OLS (without any constraints and only the contours for RSS), ^β represents the model coefficients which result in the least cost. However, since we have added a constraint on Ridge regression through a penalty, its model coefficients would have the smallest cost for the coefficient estimates present within the circle.
Similarly, the model coefficients for Lasso regression would have the smallest cost when they lie within the diamond. In other words, the model coefficients should lie in the region where the RSS contours and the constraint region overlap. The best model coefficients given the constraint are indicated by the red dot, where the contours touch the constraint region. This is because the farther we move away from the OLS model coefficients, the greater is the increase in RSS. Our idea is to get the least RSS given the constraint. Hence, given the constraint, the red dot gives us the best model coefficients, for Ridge or Lasso.
In other words, as we move towards the outer ellipses, we move away from the OLS coefficient estimates, and the value of RSS increases. The Ridge regression coefficient estimates are given by the first point at which the ellipse touches the constraint region, i.e., the circle. Similarly, the Lasso regression coefficient estimates are given by the first point at which the ellipses touch the constraint region, i.e., the diamond.
Why does Lasso perform feature selection?
Observing the plots above, we see that we can get the model coefficients to become 0 only if the ellipses touch the constraint region on either the x or the y axis. Since the Ridge regression constraint is circular, without any sharp points, the ellipse will generally not touch the circular constraint region at the axis.
Hence, the coefficients can become very small but would not become 0. In the case of Lasso regression, since the diamond constraint has a corner at each axis, the ellipse would touch the constraint at any of its corners often, resulting in that coefficient becoming 0. In higher dimensions, since there will be a higher number of corners, a higher number of coefficients can become 0 at the same time.
This is the reason Lasso regression can perform feature selection.
Please note that for very large values of c (smaller λ), which is the radius of the circle, this circle (the constraint region) would contain the centre of the ellipse, which is the ^β coefficients, and we would just get the OLS model coefficients. What this means is that for a large c, even Ridge regression would get the OLS coefficient estimates. We can extend this reasoning for Lasso regression as well.