IKH

Pointwise Feed-Forward Network

In the previous segments, you learnt about the various operations that we perform on an input sequence. But, sometimes we might lose positional information while moving from one operation to the next. Therefore, before we pass the output of the multi-head attention block to the next sub-layer, we must use a skip-connection to keep an untouched copy of the original input.

Let’s understand this in the next video

As you saw in the previous video, there is an’ layer after the ‘multi-head attention’ layer. The ‘residual connection’ added around the layers are the most important part of this sub-layer for the following reasons:

  • It helps retain the position-related information we add to the input representation/ embedding across the network.
  • Therefore,it allows us to look back at the input of the previous layer and its output simultaneously.
  • Like the ResNet architecture, it also helps in avoiding gradient vanishing/explosion while backpropagating the gradients.

Also, it has been observed that the network showed catastrophic results once the residual connections were removed.

To regulate computation, a normalisation operation is operation is performed on the resulting output from the skip connection so each feature (column) has the same average and deviation.

The output from this block is 𝒙 ′ = LayerNorm(Multi-Head Attention(𝑿) + 𝑿).

The next layer is a position-wise fully connected feed-forward network. This network is applied to each position individually and exactly in the same way. it is a couple of linear layers with a ReLU activation function in between. Comparatively, it is identical to a two-layer convolution with a kernel size1,

In the next video, Ankush will explain the utility of this layer.

Here, we must keep the shapes of the input and output of this layer the same. Therefore, the last layer is of a dimension = 512, and the first layer is necessarily of a higher dimension. The authors of the paper suggested a dimension = 2048 for the first layer.

This wraps up all the components of the encoder block. Let’s quickly recap the layers of the encoder block in the next video.

Let’s summarise all the steps that we have performed in the encoder block:

  • The input to the encoder comprises both word embeddings and positional encodings
  • The input is then fed to a ‘multi-head attention’ block, which generates the context vector (Z).
  • This is then fed to the ‘add and normalise’ layer followed by the ‘position-wise feed forward’. Here we have residual connections employed around each of these sub-layers. Lastly, we have an additional ‘add and normalise’ with residual connections around it.
  • These steps are repeated for six iterationas, building newer representations on top of previous ones.

The next session will introduce you to the decoder in transformer architecture and how it works together with the encoder.

Report an error