IKH

Numerical Example Demonstrating Backpropagation

You gained an understanding of the backpropagation technique on a very simple neural network. You will now learn how the weights and biases, i.e., the parameters of the neural network considered for the house price prediction example, change. 

The housing data set has two inputs, which are the size of the house and the number of rooms available, and one output, which is the price of the house.

As seen in the computation of the forward pass, we randomly initialise the weights and biases in the network. Let’s take the same initialisation and the same input observation that we used earlier while doing forward propagation.

Note: Solve these equations using pen and paper for better understanding. Do keep in mind that understanding how backpropagation works takes time and effort and it may take a few repetitions to understand it well. Please be patient!

So, we initialised the parameters to the values shown below when doing forward propagation. We will use the same values when doing backpropagation. Also, we will consider the same input observation. The values of the weights, biases and the input are as follows:

Weights:

W1=[w111w112w121w122]=[0.20.150.50.6]W2=[w211w212]=[0.30.2]

Biases:

b1=[b11b12]=[0.10.25]b2=[b21]=[0.4]

Input data:

 X1=[x1x2]=[−0.32−0.66]

The network architecture we consider is shown below:

As we have calculated previously, the output prediction h21 obtained is 0.63, whereas the actual output y is −0.54. Using backpropagation, we will update the weights and biases such that this difference between the predicted and the actual output gets minimised.

Now, let’s revise the notations of some terms:

  1. h2 is the final predicted output that is obtained by applying the activation function (in this case, linear) on the cumulative input. 
  2. z2 is the cumulative input fed to the output neuron.
  3. W2 and b2 are the weights and biases between the hidden layer and the output layer.
  4. h1 is the output of the hidden layer.
  5. z1 is the cumulative input to the hidden layer.
  6. W1 and b1 are the weights and biases of the hidden layer, respectively.
  7. X1 is the input from the housing data set.

We will only focus on updating the weights and biases in this network. 

We strongly recommend that you perform the calculations yourself along with Gunnvant to grasp the concepts efficiently.

Note: At timestamp 3:32, Gunnvant mentioned that “we can use the gradient descent update equation to get the updated value of the error term”. It is not the updated value of the error term but the updated value of the weight term w211.

As seen in the video, the steps taken to update the weights and biases between the hidden layer and the output layer are shown below. 

First, we will focus on the weights for the output layer.

Output Layer: Compute gradient of L with respect to w211:

First, let’s take the gradient of L with respect to w211.
We know that:
∂L∂w211=∂L∂h21∂h21∂z21∂z21∂w211(using the chain rule)

1) First term ∂L∂h21:

 ∂L∂h21=∂∂h2112(y−h21)2=−(y−h21)

∂L∂h21=−(−0.54−0.63)=1.17

2) Second term ∂h21∂z21: 

 ∂h21∂z21=1 as h21=z21 (using linear activation function)

3) Third term:

∂z21∂w211=∂∂w21112(b21+w211h11+w212h12)

 ∂z21∂h21=(h11)=0.484

Hence, ∂L∂w211 evaluates to ∂L∂w211=1.17∗1∗0.484=0.5663

Now, using the update rule for gradient descent and considering the learning rate η as 0.2.
w211(updated)=w211−η∂L∂w211=0.3−(0.2∗0.5663)=0.1867

Output Layer: Compute gradient of L with respect to w212:

Similarly ∂L∂w212=∂L∂h21∂h21∂z21∂z21∂w212

Since we have already computed the first two derivatives, let’s compute the third one:

∂z21∂w212=(h21)=0.424

Hence, this evaluates to ∂L∂w112=1.17∗1∗0.424=0.4961

Now, using gradient descent update equation,

w212(updated)=w212−η∂L∂w212=0.2−(0.2∗0.4961)=0.1008

Similarly, for the bias term, we know that:
∂L∂b21=∂L∂h21∂h21∂z21∂z21∂b21

We have computed the first two derivatives already, and the third one can be computed as shown below: 

∂z12∂b21=∂∂b21(b21+w211h11+w212h12)

∂z12∂b21=1

Hence, this evaluates to, 

∂L∂b21=1.17∗1∗1=1.17

Now,

b21(updated)=b21−η∂L∂b21=0.4−(0.2∗1.17)=0.166

So, we have updated values of weights and biases of the output layer from a single iteration:

w211(updated)=0.1867,w212(updated)=0.1008,b21(updated)=0.166

In the next video, we will move to the previous layer, and you will learn how to update the weights and biases of the first layer (hidden layer). 

As seen in this video, the steps involved in computing the updated weights and biases in the hidden layer are shown below. 

Now, let’s start with computing the weights and biases corresponding to the first neuron of the hidden layer.

Hidden Layer: Compute gradient of L with respect to w111:
Taking the gradient of L with respect to w111, we can say that:

∂L∂w111=∂L∂h21∂h21∂h11∂h11∂w111=∂L∂h21∂h21∂h11[∂h11∂z11∂z11∂w111]

We know the first derivative term: 

∂L∂h21=∂∂h2112(y−h21)2=−(y−h21)=−(−0.54−0.63)=1.17

Now, let’s compute the second, third and fourth derivative terms:

1) ∂h21∂h11=∂∂h11(b21+w211h11+w212h12)=w211=0.30

2) ∂h11∂z11=σ(z11)(1−σ(z11))=h11(1−h11)=0.484(1−0.484) (Considering a sigmoid activation function)

3) ∂z11∂w111=∂∂w111(b21+w111x1+w112x2)=x1=−0.32

Hence, this evaluates to ∂L∂w111=1.17∗0.30∗0.484∗(1−0.484)∗(−0.32)=−0.028

Now, using the gradient descent update equation,

w111(updated)=w111−η∂L∂w111=0.2−0.2∗(−0.028)=0.2056

Hidden Layer: Compute gradient of L with respect to w112:

Similarly, ∂L∂w212=∂L∂h21∂h21∂h11∂h11∂w112=∂L∂h21∂h21∂h11[∂h11∂z11∂z11∂w112]

Since we have already computed the values of the first three terms, we simply need to calculate the pending derivative term: 

∂z11∂w112=∂∂w112(b11+w111x1+w112x2)=x1=0.66

Hence, this evaluates to

∂L∂w112=1.17∗0.30∗0.484∗(1−0.484)∗(−0.66)=−0.058

Now, using the gradient descent update equation,

w112(updated)=w112−η∂L∂w112=0.15−0.2∗(−0.058)=0.1616

Hidden Layer: Compute gradient of L with respect to b112:

∂L∂b11=∂L∂h21∂h21∂h11[∂h11∂z11∂z11∂b112]

Consider the last term on the right-hand side of the equation above:

∂s11∂b11=∂∂b11(b11+w111x1+w112x2)=1

Hence, this evaluates to ∂L∂b11=1.17∗0.30∗0.484∗(1−0.484)∗1=0.088

Now, using the gradient descent update equation,

b11(updated)=b11−η∂L∂b11=0.1−0.2∗(0.088)=0.0824

Hence, for the first node, the updated values of the weights and biases using gradient descent and a learning rate of 0.2 (η) are:

w111(updated)=w111−η∂L∂w111=0.2−0.2∗(−0.028)=0.2056

w112(updated)=w112−η∂L∂w112=0.15−0.2∗(−0.058)=0.1616

b11(updated)=b11−η∂L∂b11=0.1−0.2∗(0.088)=0.0824

In the same manner, we calculate the weights and biases corresponding to the second neuron in the hidden layer.

Hidden Layer: Compute gradient of L with respect to w121:

Starting with finding the derivative of the loss function L with respect to w121:

∂L∂w121=∂L∂h21∂h21∂h12∂h12∂w121=∂L∂h21∂h21∂h12[∂h12∂z12∂z12∂w121]

We have already computed the first derivative term:

∂L∂h21=∂∂h2112(y−h21)2=−(y−h21)=−(−0.54−0.63)=1.17

Let’s compute the second, third and fourth terms:

1) ∂h21∂h12=∂∂h12(b21+w211h11+w212h12)=w212=0.20

2) ∂h12∂z12=σ(z12)(1−σ(z12))=h12(1−h12)=0.424(1−0.424)

3)∂z12∂w121=∂∂w121(b21+w121x1+w122x2)=x1=−0.32

Also, for w122 and b12, the first three terms will remain the same, only the last term will change. Hence, we will compute only the last terms: 

∂z12∂w122=∂∂w122(b12+w121x1+w122x2)=x2=−0.66

∂z12∂b12=∂∂b12(b12+w121x1+w122x2)=1

Hence, for the second node:

∂L∂w121=∂L∂h21∂h21∂h11[∂h11∂z11∂z12∂w121]=1.17∗0.20∗0.424∗(1−0.424)∗(−0.32)=−0.018

∂L∂w122=∂L∂h21∂h21∂h11[∂h11∂z11∂z12∂w122]=1.17∗0.20∗0.424∗(1−0.424)∗(−0.32)=−0.038

∂L∂b12=∂L∂h21∂h21∂h11[∂h11∂z11∂z12∂b12]=1.17∗0.20∗0.424∗(1−0.424)∗1=−0.057

Now, computing the updated values of weights and biases using gradient descent and a learning rate of 0.2 (η):

w111(updated)=w121−η∂L∂w121=0.5−0.2∗(−0.018)=0.5036

w122(updated)=w122−η∂L∂w122=0.6−0.2∗(−0.038)=0.6076

b12(updated)=b12−η∂L∂b12=0.25−0.2∗(−0.57)=0.2386

To summarise, using gradient descent in backpropagation, we can update the weights and biases of the whole neural network. 

Given below are the new values for weights and biases after one step of gradient descent for the hidden and the output layer, respectively.

Updated weights:

w1(updated)=[w111(updated)w111(updated)w121(updated)w122(updated)]=[0.20560.16160.50360.6076]w2(updated)=[w222(updated)w212(updated)]=[0.18670.1008]

Updated biases:

[b11(updated)b12(updated)]=[0.08240.2386][b21(updated)]=[0.166]

Forward Pass with updated parameters

Now, let’s perform another forward pass and check if performing backpropagation and updating the weights and biases once has helped in reducing the loss.

You can see that the loss function computed on the updated weights and biases is lower than earlier, which is what we want. By repeatedly performing backpropagation to get optimum values of weights and biases, we can continue reducing the loss. This, eventually, will help us obtain the predicted output that is as close as possible to the actual expected output. This is how a neural network learns using backpropagation.

Since this is a simple neural network, you could compute these values manually. But as the number of hidden layers and of neurons per hidden layer increase, computing these values manually will not be possible. The machine will perform these computations. The aim of considering this example is to demonstrate how a basic neural network behaves so that you can extrapolate the ideas learnt in this simple example to larger networks.

Now that you have an in-depth understanding of how weights and biases are optimised using the loss function and gradient descent through the neural network, we will now cover a generalised step-by-step algorithm to summarise backpropagation.

Now, you have an in-depth understanding of how weights and biases are optimised using the loss function and gradient descent through the neural network. In the next segment, we will cover a generalised step-by-step algorithm to summarise backpropagation.

Report an error