IKH

The Birth of Transformers

Throughout this program, with each module, you have seen how different algorithms have evolved and rendered the previous generation of algorithms obsolete through their superior performance and better results. In the field of NLP, you would have seen how recurrent networks aided in the processing of text with their capabilities to process information through recurrent neurons. However,they suffered from long-term dependencies when longer/multiple sentences were fed to them, as they could not retain information and were thus unable to understand the context of the information fed to them.

This was an issue, particularly in the encoder-decoder setting, where the information has to travel between two sets of recurrent networks. In the previous module, we also raised an important question: Is the last hidden state enough to capture global information pertaining to neural machine translation?

This gap was bridged through the introduction of an attention mechanism, which helped in creating a context vector, that can effectively capture the relative importance of each token in a given sequence. 

Using the cosine similarity between a decoder’s hidden state and an encoder’s hidden state, an attention network tell us what states of the encoder are important at a particular time stamp. Therefor, the attention mechanism give the decoder access to ‘look-back’ at all of the encoder’s hidden states based on its current state. This allows the decoder to extract only relevant information from the encoder states(key) for a particular decoder’s hidden state (query), thus learning more complicated dependencies between the input and the output.

But even with the advent of such a mechanism, text processing is still executed as a sequential process. The architecture of transformers was created to resolve this problem. You will understand this in the following video.

As Ankush explained, the design of the transformer architecture is inspired by the combination of relative dependencies provided by attention mechanisms and the parallel processing capability provided by a convolutional neural network (CNN).

With such an architecture, the model can process each sequence in parallel and give a context vector that can map the relational dependencies of any input fed to it.

The paper from the Google Brain team – ‘Attention Is All You Need’ – was a breakthrough and one of the most revolutionary papers in the field of NLP. It introduced everyone to the architecture of transformers and changed the progress of research in the domain of natural language understanding (NLU). It proposes a novel approach to encode positional information and apply the attention mechanism parallelly, thus accelerating the training. 

However, the architecture does not look simple at all. Therefore, let’s dissect each block to understand the whole working of a transformer.

The base architecture of a transformer is like a sequence-to-sequence model consisting of two main building blocks: The encoder generates a sequence of embedding vectors Z = (z1, …, 𝒛ₙ) from an input representation sequence x = (𝒙₁, …, 𝒙ₙ). Using this contextual information (𝒁), the decoder generates the output sequence y = (𝒚₁, …, 𝒚ₘ), one at a timestep.

Note:

Here, the decoder produces the output using an auto-regressive technique, where the model uses the previously generated word along with the context vector to predict the next word.

Let’s break this architecture down in the next video.

As explained in the video, a transformer has multiple encoders and decoders stacked on top of each other (stack of N = 6 identical layers).

Each layer is composed of sub-layers, whose main components are as follows:

  • Multi-head self-attention mechanism.
  • Position-encoder and position-wise feed-forward NNs.

We will cover each of these layers and components individually to help you understand the complete architecture at the end.

In the next segment, you will understand the encoder layer and its components.

Additional Reading:

How Uber Predicts Arrival Times Using Transformers

Check out this blog from Uber to understand how their ML team have used Transformers to predict ETA.

Report an error