In other words, bias in a model is high when it does not perform well on the training data itself, and variance is high when the model does not perform well on the test data. Please note that a model failing to fit on the test data means that the model results on the test data varies a lot as the training data changes. This may be because the model coefficients do not have high reliability.
You also saw that there is a trade-off between bias and variance with respect to model complexity. As seen in the video, a simple model would usually have high bias and low variance, whereas a complex model would have low bias and high variance. In either case, the total error would be high.
What we need is lowest total error, i.e., low bias and low variance, such that the model identifies all the patterns that it should and is also able to perform well with unseen data.
For this, we need to manage model complexity: It should neither be too high, which would lead to overfitting, nor too low, which would lead to a model with high bias (a biased model) that does not even identify necessary patterns in the data.
Note
Another point to keep in mind is that the model coefficients that we obtain from an ordinary-least-squares (OLS) model can be quite unreliable if among all the predictors that we used to build our model, only a few are related significantly to the response variable.
So, what is regularisation and how does it help solve this problem?
Regularisation helps with managing model complexity by essentially shrinking the model coefficient estimates towards 0. This discourages the model from becoming too complex, thus avoiding the risk of overfitting.
Let’s try and understand this a bit better now.
We know that when building an OLS model, we want to estimate the coefficients for which the cost/loss, i.e., RSS, is minimum. Optimising this cost function results in model coefficients with the least possible bias, although the model may have overfitted and hence have high variance.
In case of overfitting, we know that we need to manage the model’s complexity by primarily taking care of the magnitudes of the coefficients. The more extreme values of the coefficients are (high positive or negative values of the coefficients), the more complex the model is and, hence, the higher are the chances of overfitting. Let’s try and understand in the forthcoming video.
When we use regularisation, we add a penalty term to the model’s cost function.
Here, the cost function would be Cost = RSS + Penalty.
Adding this penalty term in the cost function helps suppress or shrink the magnitude of the model coefficients towards 0. This discourages the creation of a more complex model, thereby preventing the risk of overfitting.
When we add this penalty and try to get the model parameters that optimise this updated cost function (RSS + Penalty), the coefficients that we get given the training data may not be the best (maybe more biased). Although with this minor compromise in terms of bias, the variance of the model may see a marked reduction. Essentially, with regularisation, we compromise by allowing a little bias for a significant gain in variance.
As we saw in the video, when we perform regularisation, it has a smoothening effect on the model fit; in other words, when regularisation is used, the curve smoothens out and the fit is close to what we want it to be.
Note that we also need to remember two points about the model coefficients that we obtain from OLS:
- These coefficients can be highly unstable – this can happen when only a few of the predictors that we have considered to build our model are related significantly to the response variable and the rest are not very helpful, hence random variables.
- There may be a large variability in the model coefficients due to these unrelated random variables such that even a small change in the training data may lead to a large variance in the model coefficients. Such model coefficients are no longer reliable, since we may get different coefficient values each time we retrain the model.
Multicollinearity, i.e., the presence of highly correlated predictors, may be another reason for the variability of model coefficients. Regularisation helps here as well.
So, to summarise,we use regularisation because we want our models to work well with unseen data, without missing out on identifying underlying patterns in the data. For this, we are willing to make a compromise by allowing a little bias for a significant reduction in variance. We also understood that the more extreme the values of the model coefficients are, the higher are the chances of model overfitting. Regularisation prevents this by shrinking the coefficients towards 0. In the next two segments, we will discuss the two techniques of regularisation: Ridge and Lasso and understand how the penalty term helps with the shrinkage.