Overfitting and Underfitting

Let’s understand what an overfitted and underfitted model mean in the context of machine learning.

Overfitting is a phenomenon where a model becomes too specific to the data it is trained on and fails to generalise to other unseen data points in the larger domain. A model that has become too specific to a training dataset has actually ‘learnt’ not just the hidden patterns in the data but also the noise and the inconsistencies in the data. In a typical case of overfitting, the model performs very well on the training data but fails miserably on the test data.

Luckily, there are quite a few techniques you can use to tackle overfitting. These techniques include:

Increase the training data
Reduce the number of insignificant features by feature selection
Apply cross-validation schemes to train multiple models on the same dataset
Early stopping to prevent the model from memorising the dataset
Apply regularisation techniques to reduce model complexity. You can understand more about regularisation from the additional resources available in the last session of this module.

Underfitting is a phenomenon where a model is too naive to capture the underlying patterns/trends of the data. Such a model does not learn enough from the training data and hence fails to generalise on new data which leads to unreliable results. Underfitting usually happens when you have less training data to build a sufficiently complex model. In a typical case of underfitting, a model fails to perform well on the training data as well as the test data.

Like overfitting, you have set techniques to tackle underfitting as well; these are:

Increase the training data
Increase the number of features by feature engineering
Increase the model complexity by adding more parameters
Increase the training time until the objective function is minimised

The image above clearly depicts the cases of underfitting and overfitting. The model in the first graph does not fit the data points at all and shows underfitting. The third graph overly captures the patterns in the data and seems like memorising it. The second graph shows a right balance between the other two models and is neither underfitting nor overfitting. Thus, the goal lies in building a model which strikes the right balance and fits the data points sufficiently well for a good model.

Recall the heart disease prediction model that you built in the Tree Models module. The decision tree that we got after controlling the depth of the tree gave a train accuracy of 74% and a test accuracy of 60% which clearly shows that the model is overfitting. This model is actually memorising the data giving high training performance but poor test results. Hence, this is an example of overfitting model.

In the house pricing prediction model that you built in the Linear Regression module, the very first model that you built on the feature ‘area’ had a very low model performance of 28% on the training set. This is an example of underfitting as the trained model is unable to capture the underlying patterns of the data with a very low training performance. When a model shows poor performance on the training set, it indicates the case of underfitting.

Report an error