In the previous session, you learnt that training refers to the task of finding the optimal combination of weights and biases to minimize the total loss (with a fixed set of hyperparameters).
This optimization is achieved using the familiar gradient descent algorithm.
For a neural network, you will learn how the loss function is minimized using the gradient descent function by finding the optimum values of weights and biases using backpropagation. In the next video, you will learn how backpropagation works.
In this video, you learnt that the gradient descent algorithm presents us with the following parameter update equation:
wlkj=wlkj−η∂L∂wlkj
where k and j are the indices of the weight in the weight matrix and l is the index of the layer to which it belongs.
Given the neural network in the diagram, for the output layer, the following weights and bias terms will be updated using the gradient descent update equation:
w211=w211−η∂L∂w211
w212=w212−η∂L∂w212
b21=b21−η∂L∂b21
Now, for the hidden layer, the following weight and biases will be updated:
w111=w111−η∂L∂w111 w121=w121−η∂L∂w121
w112=w112−η∂L∂w112 w122=w122−η∂L∂w122
b11=b11−η∂L∂b11 b12=b12−η∂L∂b12
To compute these gradients, we use an algorithm called backpropagation.
As you can see in the formulas above, there exist partial derivatives of the loss function L with respect to the weights and biases in these equations given above. To compute these, we use the chain rule; you can observe the dependencies of different layers in the gradient computation.
Now, let’s simplify the neural network given above and represent it in a condensed format – this is shown below.
In this case, the loss function is a function of w1, b1, w2 and b2.
The loss function, the activation function and the cumulative input are shown in the following expressions:
Loss function: Loss=12(y−h2)2
Cumulative Input: zi=wihi−1+bi
Output using tanh activation function: hi=tanh(zi)
Now, let’s compute the gradient of the loss function with respect to one of the weights to understand how backpropagation works. Suppose we want to calculate ∂L∂w2:
Using the chain rule, we can say that:
∂L∂w2=∂L∂h2∂h2∂z2∂z2∂w2
Based on the definition of the loss function, L is a direct function of h2, h2 is a direct function of z2 and z2 is a direct function of w2. Now, in the following video, let’s calculate all of them one by one.
In this video, you saw the following computations of the gradients to compute ∂L∂w2.
We know that ∂L∂w2=∂L∂h2∂h2∂z2∂z2∂w2; let’s compute each term present in this expression.
First term:
Loss function: L=12(y−h2)2
Taking the derivative of the loss function L with respect to h2: ∂L∂h2=∂∂h212(y−h2)2=−(y−h2)….(1)
Second term:
Applying the tanh activation function on the cumulative input z2 we get h2:
h2=tanz(z2)
Taking the derivative of h2 with respect to z2:∂h2∂z2=1−tanh2(z2)=1−(h2)2….(2)
Third term:
Cumulative Input: z2=w2h1+b2
Taking the derivative of z2 with respect to w2:∂z2∂w2=h1….(3)
Hence, from expressions (1),(2) and (3), we get the gradient of the Loss function L with respect to w2, which is shown below.
∂L∂w2=∂L∂h2∂h2∂z2∂z2∂w2=[−(y−h2)][1−(h2)2][h1]
Now, we have completed the computation of the gradient of the loss function L with respect to the weight w2 for backpropagation. We can similarly compute the gradient of the loss function with respect to all the weights and biases present in the network.
In the next video, we will observe the iterative nature of the computation of the gradients for multiple hidden layers.
Once the gradients of all the weights and biases are computed, the gradient descent update equation can be used to obtain the updated values of the weights and biases.
Before proceeding to the next segment, attempt the following question to deepen your understanding of backpropagation. Please write down the equations referring to the theory discussed in this segment when attempting this question.
Now that you have learnt about backpropagation in a simple network. Let’s proceed to the next segment to apply the technique of backpropogation to a numerical example
Report an error