IKH

MNIST: Epoch Batch, Overfitting and Underfitting

In the previous segment, you learnt how to write the code snippets for building and training a neural network using Keras. You saw its implementations on two different examples and understood the meaning behind each line of code while building the model.

In this segment, we will discuss different attributes such as batch and epoch in depth. We will also discuss the different aspects of the output summary obtained from the network. 

Let’s concentrate on what we had done while building the ANN for MNIST. We built a full-fledged classification architecture. Let’s now start analysing the main elements of the architecture other than the parameters of the architecture. These elements are called hyperparameters.

Epochs:

Let’s start with the term ‘epochs’. In the following line of code, a model is being trained using the function fit(). A number of arguments are passed to this function which we have seen earlier. We will concentrate on the ‘epochs’ argument first: model.fit(X_train, y_train, batch_size=64, epochs=5, validation_data=(X_val, y_val))

The number of epochs mentioned in the code snippet defines the number of times the learning algorithm will work through the entire data set. One epoch indicates that each training example has had an opportunity to update the internal model parameters, i.e., the weights and biases.

Now, lets consider the batch size hyperparameter represented by the argument ‘batch_size’ in the following line of code:model.fit(X_train, y_train, batch_size=64, epochs=5, validation_data=(X_val, y_val))

Batch size:

The term Batch size refers to the number of training examples utilised in one iteration. The model decides the number of examples to work with in each iteration before updating the internal model parameters.

Model summary:

model summary states the details of the parameters used and displays the layers of the architecture. A simple summary() function is required for this:model.summary()

This can be used after we compile the model (refer to the previous segment for more details on this). For the MNIST architecture, we have the summary given below.Model: “sequential” _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 128) 100480 _________________________________________________________________ dense_1 (Dense) (None, 128) 16512 _________________________________________________________________ dense_2 (Dense) (None, 128) 16512 _________________________________________________________________ dense_3 (Dense) (None, 10) 1290 ================================================================= Total params: 134,794 Trainable params: 134,794 Non-trainable params: 0 _________________________________________________________________

Some important points regarding this model summary are as follows:

  1. The model is given as ‘sequential’, which means that the layers are set up one after the other in a singular sequence.
  2. We know the input data’s size is 784, which means that 784 neurons are present in the input layer.
  3. The first hidden layer is a dense layer, which means that all of its neurons are fully connected with the neurons of the previous layer, which is the input layer. The output shape is defined as 128, which means that the hidden layer has 128 neurons.The number of parameters is as follows:
    1. If the input layer has 784 neurons and the first hidden layer has 128 neurons and is fully connected (dense), then the weight matrix will be of size 784 x 128, and there will be 128 elements in the bias vector, one for each neuron in the hidden layer.
    2. Total elements in the weight matrix = 784 x 128 = 100352
    3. Total elements in the bias vector = 128
    4. Total number of parameters  = 100352 + 128 = 100480 (as is given)
  4. The second hidden layer is a dense layer and has 128 neurons.The number of parameters is as follows:
    1. If the first hidden layer has 128 neurons and the second hidden layer has 128 neurons, then the weight matrix is of size 128 x 128 and the bias vector is of size 128.
    2. Total elements in the weight matrix = 128 x 128 = 16384
    3. Total elements in the bias vector = 128
    4. Total number of parameters = 16384 + 128 = 16512 (as is given)
  5. Similarly, the third hidden layer is dense and has 128 neurons. Since the second hidden layer has 128 neurons too, the total number of parameters will be 16512 (just as described in the previous point).
  6. The last dense layer is the output layer with 10 neurons (classes). The number of parameters is as follows:
    1. The third hidden layer has 128 neurons, and the output layer has 10 neurons. The weight matrix is of size 128 x 10 and the bias vector is of size 10. 
    2. Total elements in the weight matrix = 128 x 10 = 1280
    3. Total number of parameters = 1280 + 10 = 1290 (as is given)
  7. In the end, the summary shows the total of trainable and non-trainable parameters. Trainable parameters are the ones going through the learning process, i.e., the weights and biases. Non-trainable parameters are the ones that do not go through the training process. For example in the following code, ‘0.3’ defines the number of randomly selected weights set to zero, and this is not going to change throughout the training process.keras.layers.Dropout(0.3) Hence, it is a non-trainable parameter. (Note: You will learn more about Dropouts in the optional session.)

So far, only weights and biases and no other types of parameters are included in this model; hence, all the parameters are trainable parameters. This gives us the sum of all the parameters in all the layers as 134794.

While training the model after using model.fit() function, you must have seen something like this:591/591 [==============================] – ETA: 0s – loss: 1.6376 – accuracy: 0.8366591/591 [==============================] – 7s 7ms/step – loss: 1.6376 – accuracy: 0.8366 – val_loss: 0.4750 – val_accuracy: 0.8967 Epoch 2/5591/591 [==============================] – 4s 7ms/step – loss: 0.3136 – accuracy: 0.9265 – val_loss: 0.3424 – val_accuracy: 0.9255 Epoch 3/5591/591 [==============================] – 4s 6ms/step – loss: 0.2075 – accuracy: 0.9460 – val_loss: 0.3897 – val_accuracy: 0.9140 Epoch 4/5591/591 [==============================] – 4s 6ms/step – loss: 0.1615 – accuracy: 0.9563 – val_loss: 0.2382 – val_accuracy: 0.9479 Epoch 5/5591/591 [==============================] – 4s 6ms/step – loss: 0.1431 – accuracy: 0.9599 – val_loss: 0.2876 – val_accuracy: 0.9343

The text above can be analysed as follows: 

  1. The text 591/591 indicates the number of batches the training step is running through. Since the batch size is 64 and the training data set is of size 37800 (90% of the total  dataset which is 42000) while the rest 10% is validation dataset, it will be separated into 591 batches (37800/64).
  2. You can also see the amount of time it is taking for training a single batch, for example, 7ms/step.
  3. The text snippets loss and val_loss show the sparse categorical cross-entropy loss (mentioned while compiling the model; refer to the previous segment for this).
  4. The text snippets accuracy and val_accuracy show the proportion of matches between the predicted class and the actual class. This proportion is calculated on the whole training and validation data sets.
  5. And, after 5 epochs (5 run-throughs of training the whole data set), we can see the validation accuracy reached is approximately 93%.

Note: The calculation speed, loss, and accuracy may differ slightly in each runtime as it depends on computational power.

There are two points to keep in mind regarding training a model on a data set. Firstly, the model should be able to determine generalised trends in a proper manner for smarter predictions. Secondly, it should be able to apply these observations and trends to future data (the data that the model has never seen) and make predictions accurately. To measure how it is performing on these two bases, we assess whether the model may be overfitting or underfitting. If the model overfits, it will perform well on the training data set, but not on the testing data set. If the model underfits, it will find it difficult to identify even the major patterns present in the data.

If the model is not able to understand the underlying trend of the given data set, then the model is said to be underfitting. And, the accuracy of the model is low even on the training data set. This usually happens when the amount of data to train on is less or the model defined has linear elements with few non-linear relationships for it to be able to understand complex trends and patterns. When this happens, the model becomes more free and flexible, which results in incorrect predictions even on the training data. The opposite is true when the model overfits, that is, the model learns the exact patterns of the training data and is unable to generalise on unseen data. Both underfitting and overfitting are issues that needs to be addressed.

To summarise, you have understood what different aspects of the model output mean when training the model after we have defined the model architecture and hyperparameters.

The training process requires your careful attention to ensure that the model is learning well, not too much or not too less, and is able to observe underlying trends and patterns for making more accurate predictions on future data.

This brings us to the end of this session. In the next segment, you will look at the summary of all the concepts learned in this module.

Report an error