In this segment, you will understand how neural networks are trained. Recall that the training task is to compute the optimal weights and biases by minimising some cost function. Let’s start with a quick recap on defining the training task.
The task of training neural networks is similar to that of other ML models such as linear regression and logistic regression. The predicted output (output from the last layer) minus the actual output is the cost (or the loss), and we have to tune the parameters w and b such that the total cost is minimised.
The loss function for a regression model can be given as follows:
Loss(L)=RSS=∑(actual−hL)2Loss(L)=f(W,b)
To start training a neural network, we randomly initialise the weights at the outset.
An important point to note is that if the data is large (which is often the case), the loss calculation itself can get pretty messy. For example, if you have a million data points, they will be fed into the network (in batches), the output will be calculated using feedforward, and the loss/cost Li(for ith data point) will be calculated. The total loss is the sum of losses of all the individual data points. Hence:
Totalloss=L=L1+L2+L3+……..+L1000000
The total loss L is a function of w’s and b’s. Once the total loss is computed, the weights and biases are updated (in the direction of decreasing loss). In other words, L is minimised with respect to the w’s and b’s.
One important point to note here is that we minimise the average of the total loss and not the total loss that you will get to see shortly. Minimising the average loss implies that the total loss is getting minimised.
This can be done using any optimisation routine such as gradient descent.
The parameter being optimised is iterated in the direction of reducing cost according to the following rule
Wnew=Wold−α∂L∂W
The same can be written for biases. Note that weights and biases are often collectively represented by one matrix called W. Going forward, W will, by default, refer to the matrix of all weights and biases.
The main challenge is that W is a huge matrix, and thus, the total loss L as a function of W is a complex function. Let’s watch the next video to understand how to deal with this complexity.
As you learnt in the video above, the loss function for a very small and simple neural network can be quite complex. The best way to minimise this complex loss function is by using gradient descent.
Let us next summarise what you learnt in this session.
Report an error