Adam – IKH

Adam is yet another gradient descent optimization method that uses both momentum and RMS Prop but with the exponentially weighted average with the corresponding bias corrections. Let us learn about this algorithm in the upcoming video.

Adam is the abbreviated form of Adaptive Moment Estimation. As you saw in the video above, it uses both momentum-based optimisers and RMS Prop techniques. Let us understand the implementation of the Adam optimiser as shown below:

Initialize vdW=0, vdb=0, SdW=0, Sdb=0
On iteration t:
1. Compute dW, db for the current mini-batch
2. vdW=β1∗vdW+(1−β1)dW
3. vdb=β1∗vdb+(1−β1)db
4. SdW=β2∗SdW+(1−β2)dW2
5. Sdb=β2∗Sdb+(1−β2)db2
6. Apply bias correction on all the terms as follows:
  1. vcorrecteddW=vdW1−βt1
  2. vcorrecteddb=vdb1−βt1
  3. ScorrecteddW=SdW1−βt2
  4. Scorrecteddb=Sdb1−βt2
7. Wnew=Wold−α∗(vcorrecteddWScorrecteddW+ϵ)
8. bnew=bold−α∗(vcorrecteddbScorrecteddb+ϵ)

Let us understand a bit more about the significance of bias correction in the Adam optimisers. As you can see, there are two values of β i.e, β1 and β2. You have already seen that the value of β1(momentum related hyperparameter) is usually 0.9, therefore, 1−β1 becomes 0.1. On the other hand, β2 is usually considered to be 0.999, therefore, 1−β2 becomes 0.001.

Let us take the momentum equation and substitute the values of β to understand the need for bias correction. Take a look at the equation below:

vdW=β1∗vdW+(1−β1)dW=0.9∗vdW+0.1∗dW

The initial value of vdw=0. Therefore, the first iteration will be:

vdW1=0.9∗0+0.1∗dW

Therefore, vdW1=0.1dW

Now let us go back to the stock price forecasting example, you saw in the momentum-based optimisers. Let us consider the price of the stock for today to be 55 INR. The price of stock tomorrow will be given by the momentum equation where dW=55.

Tomorrow’s price, vdW1=0.1dW=0.1∗55

vdW1=5.5

There is a huge difference between the stock prices for today and tomorrow based on the prediction above. However, this is very unlikely and the prediction is incorrect. This is because we have initialised vdw=0 which when multiplied by β1=0.9 makes the equation biased. To fix this issue and get a good result, bias correction is performed as follows:

vcorrecteddW=vdW1−βt1

Therefore, for day 2(t=2), you will get:

vdW=0.9∗vdW+0.1∗dW=0.9∗5.5+0.1∗dW=4.95+0.1∗dW

vcorrecteddW=vdW1−βt1=4.95+0.1∗dW1−(0.92)=4.95+0.1∗dW0.19=26.05+0.53dW

For dW=55, you will get the following result:

vcorrecteddW=26.05+0.53dW=26.05+0.53∗55=55.2

As you can see that the value of stock predicted now lies near the current stock price. This is a more likely situation and a much more efficient prediction. This is how bias correction can help in the model performance.

As you can see above, Adam computes both the momentum terms and the RMS Prop terms and uses both in order to update the parameters. Therefore, it provides an adaptive learning rate along with noise-free gradients. This advantage makes adam a widely used optimiser in the world of neural networks.

Now that you have completed the different gradient descent optimisers, let us learn about a problem faced by neural networks and how to deal with it in the next segment.

Report an error