IKH

Bias-Variance Tradeoff

So far, we have discussed the pros and cons of simple and complex models. On one hand, simplicity is generalisable and robust and on the other hand, some problems are inherently complex in nature. There is a trade-off between the two, which is known as the bias-variance tradeoff in machine learning. You will learn about this topic in more detail in the sessions to come. Let us listen to Prof. Raghavan as he introduces this topic to you.

Bias and Variance

We considered the example of a model memorising the entire training dataset. If you change the dataset a little, this model will need to change drastically. The model is, therefore, unstable and sensitive to changes in training data, and this is called “high variance”.

The ‘variance’ of a model is the variance in its output on some test data with respect to the changes in the training data. In other words, variance here refers to the degree of changes in the model itself with respect to changes in training data.

Bias quantifies how accurate the model is likely to be on future (test) data. Extremely simple models are likely to fail in predicting complex real-world phenomena. Simplicity has its own disadvantages.

Imagine solving digital image processing problems using simple linear regression when much more complex models like neural networks are typically successful in these problems. We say that the linear model has a high bias since it is way too simple to be able to learn the complexity involved in the task.

The very first model in the house pricing prediction built on the feature ‘area’ is indicative of high bias as well due to its low performance of 28% on the training set and its inability to capture the underlying patterns in the data. Thus, it is an underfitting model with high bias.

In the heart disease prediction model again, the decision tree that we obtained after controlling the depth of the tree gave a train accuracy of 74% and a test accuracy of 60% is an example of a high variance model as there is a significant gap between the train and test accuracies. This means that the change in the data resulted in a huge difference in the model’s performance giving rise to high variability and model instability.

In the later parts of the module, you will be working on a bank marketing prediction model. You will try building different classification models on the same dataset to come up with the best model that suits your business requirements. You will see that the logistic regression model there shows nearly 71% of the train accuracy score and 69% test accuracy. This means that the model is performing almost similarly both on the train as well as the test sets. Hence, this is an example of low variance model which does not vary much in its performance when new data is introduced to the model.

There you will also see that the random forest model shows 99% training accuracy which is really very high indicating that it is a low bias model and is unlikely to perform well on the unseen data. 

A model which suffers from underfitting is usually a high bias and low variance model whereas a model which suffers from overfitting is usually a low bias and high variance model due to increased complexity.

In an ideal case, we want to reduce both the bias and the variance, because the expected total error of a model is the sum of the errors in bias and the variance, as shown in the figure below.

Although, in practice, we often cannot have a low bias and low variance model. As the model complexity goes up, the bias reduces while the variance increases, hence the trade-off.

Recall that in the competitive exam analogy, the first person learns using a much more complex mental model than the second one.

Report an error