The main function of the decoder block is to predict the next token of the sequence using the previous tokens and attention from the encoder block. While training the decoder, we pass the target output as the input to the decoder. However, it is shifted to the right through the addition of a <start> token, as no tokens were previously generated. Hence, using the <start> token and the context vector, the decoder predicts the first token.
Every word predicted subsequently is added as an in the next timestep and the whole process is repeated until the <end> token is predicted. Let’s understand this process in the next video.
As explained in the video, the encoder and the decoder work like the typical encoder-decoder architecture you saw in the previous module.
In the diagram above of the decoder block, the output embedding and positional embedding layers have the same role and structure as in the encoder. However, the decoder block in the transformer architecture has some additional components. You will understand them in the next video.
As seen in video, there are mainly two additions in decoder block:
- Cross-attention (encoder-decoder attention.
- Look ahead mask.
Look Ahead Mask
The multi-head attention in the decoder is applied to tokens at the current timestep (index till which prediction is done by transformer) and not for the future tokens (the ones that are not predicted till now). However, since the input to the decoder is fed in parallel, we must block the future tokens. This can be done using a ‘look-ahead mask’, which is a weight matrix consisting of ‘-inf’ values in the upper diagonal and ‘0’ values in the lower diagonal. Therefore, the attention block with the look-ahead mask is referred to as the ‘masked multi-head attention’ layer.
Let’s understand this through an example in the next video.
Once the attention layers are applied to the current normalisation layer is applied, similarly to what we have done in the encoder.
This type of attention obtains its queries (Q) from the previous decoder layer (T*d_model, where T is the sequence length of the target), whereas the keys (K) and values (V) are acquired from the encoder output (S*d_model, where S is the sequence length of the source). This allows every position in the decoder to look over all the positions in the input sequence (similar to the typical encoder-decoder architecture).
Hence, this attention layer does not require any training, as it takes pretrained values for the query, key and value matrices.
$$//CrossAttention=\lbrack output\_decoder_{sublaye-1}(Q),output\_encoder(K,V)\rbrack//$$
In the next video, Ankush will explain this in detail.
- Like the earlier attention operations, here, we first perform a dot product between the query (taken from the encoder) and the key (taken from the decoder). This gives us attention scores with the shape (T∗d) X (S∗d)T = T∗S
- On top of the attention scores, we apply scaling and softmax to normalise the values in a range (0,1).
- Finally, we multiply the normalised attention scores with the value (S∗d). The resulting output will have the shape (T∗d).
- This is done for eight other attention heads, which finally transforms the output into the shape of the T∗dmodel(same as the input to the decoder).
- Post cross-attention, the following layers are added:
Add and normalisation.
Feed-forward network.
Add and normalisation.
- These steps are repeated for six iterations, building newer representations on top of previous ones.
- Once the repeated iteration is performed, a linear layer followed by a softmax function is applied to generate the probability scores for the next token.
- Once the most probable token is predicted, it is passed as the tail to the input sequence of the decoder.
- This is repeated until we receive the token, which marks the completion of the output sequence.
This brings us to the end of our discussion on the encoder and decoder blocks. In the next segments, we will take a look at the different variants of transformer architecture.