So far, you have learnt that we need models that can identify general patterns in data and that they should work well with unseen data. In this segment, you will learn how Ridge regression helps us achieve this. So, let’s go ahead and learn more in the next video.

In OLS, we get the best coefficients by minimising the residual sum of squares (RSS). Similarly, with Ridge regression also, we estimate the model coefficients, but by minimising a different cost function. This cost function adds a penalty term to the RSS. As shown in the video, the penalty term is lambda multiplied by the sum of squared model coefficients. In the cost function, the penalty term, also called the shrinkage penalty, would be small only if the coefficients are small, i.e., close to 0. Hence, while fitting the Ridge regression model, since we need to find out the model coefficients that minimize the entire cost, i.e., RSS and a penalty, it would have the effect of shrinking the model coefficients, i.e., the betas, towards 0.

Now, what is the role of **lambda **here? If lambda is 0, then the cost function would not contain the penalty term and there will be no shrinkage of the model coefficients. They would be the same as those from OLS. However, since lambda moves towards higher values, the shrinkage penalty increases, pushing the coefficients further towards 0, which may lead to model underfitting. Choosing an appropriate lambda becomes crucial: If it is too small, then we would not be able to solve the problem of overfitting, and with too large a lambda, we may actually end up underfitting.

Another point to note is that in OLS, we will get only one set of model coefficients when the RSS is minimised. However, in Ridge regression, for each value of lambda, we will get a different set of model coefficients. Let’s try and understand this in the next video.

Visually, we can see that when lambda is 0, i.e., when there is no regularisation, we have a model that is clearly overfitting. The blue line indicates the ideal fit and the red line indicates the way our model fits the data. For a small value of lambda, 0.1, the fitted model comes quite close to the actual data – this is what we want. However, as the lambda value increases further, we notice that the model starts underfitting. We basically want models that do not overfit the data, but they should be able to identify underlying patterns in it. Hence, an appropriate choice of lambda becomes crucial. This can be achieved through hyperparameter tuning. We also observed that for each value of lambda, we will get different sets of coefficients in Ridge regression.

One point to keep in mind is that we need to **standardise **the data whenever working with Ridge regression. We have seen that regularisation puts a constraint on the magnitude of the model coefficients. Then the penalty term depends upon the magnitude of each coefficient. This makes it necessary to centre or standardise the variables. Centering the variable means that the intercept term will no longer be in the model. Please refer to this link for an example.

So, to summarise:

- Ridge regression has a particular advantage over OLS when the OLS estimates have high variance, i.e., when they overfit. Regularisation can significantly reduce model variance while not increasing bias much.

- The tuning parameter lambda helps us determine how much we wish to regularise the model. The higher the value of lambda, the lower the value of the model coefficients, and more is the regularisation.

- Choosing the right lambda is crucial so as to reduce only the variance in the model, without compromising much on identifying the underlying patterns, i.e., the bias.

- It is important to standardise the data when working with Ridge regression.

Ridge regression does have one obvious disadvantage. It would include all the predictors in the final model. This may not affect the accuracy of the predictions but can make model interpretation challenging when the number of predictors is very large. Let’s see how Lasso regression helps in addressing this in the next couple of segments.

In the next segment, we will look at the implementation of Ridge Regression.