IKH

Summary

You have come a long way! Let’s look at a summary of all that you have covered in this session.

You learnt about the architecture of Transformers and how all the components of it work with each other.

Encoder:

  • The input to the encoder comprises both word embedding and positional encodings.
  • The input is then fed to a ‘multi-head attention’ block, which generates the context vector (Z).
  • This is then fed to the ‘add and normalise’ layer followed by the ‘position-wise feed forward’. Here we have residual connections employed around each of these sub-layers. Lastly, we have an additional ‘additional ‘add and normalise’ with residual connections around it.
  • These steps are repeated for six iterations, building newer representations on top of previous ones.

Decoder

  • We pass the target output as the input to the decoder along with the positional encoding. The combined encodings are shifted to the right, to ensure the prediction for the current position depends only on the previous of the input.
  • The multi-head attention in the decoder is applied to tokens at the current timestep only. This can be done using a ‘look-ahead mask’, which is a weight matrix consisting of ‘-inf’ values in the upper diagonal and ‘0’ values in the lower diagonal.
  • Once the attention layer are applied to the current positions, a normalisation layer is applied, similarly to what we have done in the encoder.
  • After applying the normalisation layer, a cross-attention (encoder-decoder attention) is applied to the output received from the previous layer(Q) and the output of the encoder(K, V).
  • Post cross-attention, the following layers are added:

Add and normalisation.

Feed-forward network.

Add and normalisation.

  • These steps are repeated for six iterations, building newer representations on top of previous ones.
  • Once the repeated iteration is performed, a linear layer followed by a softmax function is applied to generate the probability scores for the next token.
  • Once the most probable token is predicted, it is passed as the tail to the input sequence of the decoder.
  • This is repeated untill we receive the <end> token, which marks the completion of the output sequence.

In the next segment, apply all the learnings of this session to solve the graded questions.All the best!

Report an error