Bias Variance Tradeoff

Linear regression is a model which has low variance and usually high bias. As you must be reading articles on Machine Learning, you definitely would have stumbled upon the term Bias-Variance Trade-Off. Hence, let’s start by understanding what Bias and Variance mean.

You have seen earlier in the course during the Python demonstration of linear regression using Numpy, we added a little noise to the linear line to get the data points. Similarly, the data that we get in real life has noise in it. Any model you create is to model the true underlying function. As earlier, let’s assume that the true function that fits the dataset is given by f(x) and the model built by us is given by f′(x). Then, the relationship between the response value Y (which essentially are the data points available for modelling) and the predictor variables X is given by

Y=f(X)+ϵ

where ∈ is the irreducible error term which cannot be modelled. Note that the values belonging to the irreducible error will be normally distributed with mean 0 since it is noise and cannot be modelled.
Hence, the Mean Square Error (MSE) at an unknown dataset is the expected value of(Y−f'(x))2

Then MSE can be decomposed into the following,

MSE=(E[f'(x)]−E[f(x)])2+(E[f'(x)2]−E[f'(x)]2)+σ2ϵ

The first term is known as the bias of the model and the second term is the variance of f’(x), the model used for prediction while the third term is the variance of the irreducible error.

MSE=Bias(f'(x))2+Variance(f'(x))+σ2ϵ

Or,

MSE=Bias2+Variance+Irreducibleerror

As you can see from the above equation, for the same MSE, if the Bias decreases, Variance increases and vice versa. Hence, there is a trade-off between bias and variance in the ML modelling process. As you go through this segment, you’ll realise that the best model is the one which has a low bias as well as a low variance.

Bias error is the error between the predicted value and true value of the data. You should understand here that the predicted value depends on the assumptions made by the model. Hence, if there are a lot of assumptions made about the data (like we do in case of linear regression), then the bias error will be high and vice versa.

Note that the true value may differ from the actual value available with you due to the presence of the irreducible error. But for the purpose of evaluation of your model, you often consider the training error as a representative of the Bias error. Hence, if the training error is high, the bias is high and vice versa.

Bias quantifies how accurate the model is likely to be on future (test) data. Extremely simple models are likely to fail in predicting complex real-world phenomena. Simplicity has its own disadvantages.

Imagine solving digital image processing problems using simple linear regression when much more complex models like neural networks are typically successful in such problems. We say that the linear model has a high bias as it is way too simple to be able to learn the complexities involved in the task.

In the diagram given below, the data has been assumed to follow a linear trend; hence, a linear model has been fitted, resulting in a high bias.

Since this model is found to poorly fit the training data, such a model is said to be underfitting.

High Bias suggests that the model has been generated by considering many assumptions and making the algorithm less complex. Linear Regression is an example of a High Bias algorithm.

Variance error is an error generated by paying too much attention to training data.

The ‘variance’ of a model is the variance in its output on the same test data with respect to the changes in the training data. In other words, variance here refers to the degree of changes in the model itself with respect to changes in training data.

Consider the example of the model (shown below) memorising the entire training dataset. If you change the dataset a little, this model will change drastically. The model is, therefore, unstable and sensitive to changes in training data, and this is called high variance. This has happened because the model has also modelled the irreducible error which cannot be modelled.

As you can see, the model has been fitted perfectly on the training data; however, it poorly fits the testing data. Hence, the testing error is often considered as a representation of the model variance.

Since this model is found to capture the unwanted noise within the training data and is overly sensitive to the training data, such a model is said to be overfitting. Overfitting is a phenomenon where a model becomes too specific to the data it is trained on and fails to generalise to other unseen data points in the larger domain. A model that has become too specific to a training dataset has actually ‘learnt’ not just the hidden patterns in the data but also the noise and inconsistencies in the data. In a typical case of overfitting, the model performs very well on the training data but fails miserably on the test data.

Bias-Variance Trade-off

In an ideal case, we want to reduce both the bias and the variance because the expected total error of a model is the sum of the errors in bias and variance. In practice, we often cannot have low bias and low variance model. As the model complexity increases, the bias decreases but the variance increases, resulting in a trade-off.

These terms are used very frequently in the domain of Machine Learning when you are building models and validating the test results.

Comprehension – Bias-Variance Tradeoff

An artificially generated dataset was used to generate data of the form (x, 2x + 55 + e), where e is a normally distributed noise with mean zero and variance 1. Three regression models have been created to fit the data – linear, a degree-15 polynomial and a higher degree polynomial which passes through all the training points.

Report an error