Variants of Transformers

As seen earlier, the architecture of transformers comprises stacks of encoder blocks. However, it is not necessary to have both the encoder and the decoder to perform some NLP tasks. For example, for the next word prediction task, there is no utility in keeping the encoder block. Similarly, to classify texts, you do not need a decoder block. Therefore, the architecture of transformer models has inspired many researchers to play around with it and come up with different variants. You will understand them in the next video.

Broadly, you can categorise the variants into three different categories:

Autoregressive models

They correspond to the decoder of the original transformer model.

They are pre-trained on the classic language modelling task: such as next word prediction.

Their most natural application is text generation, e.g., the GPT family.

Autoencoding models

They correspond to the encoder of the original transformer model.

They are pre-trained by masking the input tokens and training the model to reconstruct the original sentence.

Their most natural application is sentence classification or token classification, e.g., the BERT family.

Sequence-to-sequence models

They use both the encoder and the decoder of the original transformer.

Their most natural applications are translation, summarisation and question answering.

The original transformer model is an example of such a model (only for translation). Other examples include T₅, BART, etc.

You can check this link to understand other variants of the transformer architecture.

As you saw in the deep learning course, it is not advisable to create a transformer architecture and train it from scratch. Also, transformer models are very large and consist of billions of parameters to train. So, you can use the principle of transfer learning to leverage the power of pre-trained models. As you learnt earlier, the body of a pre-trained model is used with a new model head in most cases of transfer learning. The new head, designed based on the task at hand, uses the learning from the previous layers to reduce the overall training time and improve performance. Watch the next video to learn more about this.

Many models have been built by developers and researchers to perform numerous NLP tasks. You can check them out in the Hugging Face model hub. All you need to do is pick the domain you want to work with and select the model. Voila!

One of the most amazing features of an auto-regressive transformer model, when it comes to training, is that it does not need target labels specifically. It can use the same input as ground truth for learning. Ankush will help you understand this in the next video.

For most problems in NLP systems, we follow the approach of pre-training the model and fine-tuning it. Using a large amount of unlabelled text, you can build a general model capable of understanding language. Once this is achieved, the same model can be fine-tuned for specific tasks such as text generation, machine translation and text summarisation.

As explained in the video, there are two popular ways to pre-train methods, which use a part of the input sequence as the target to teach the model.

Causal language models

This type of modelling helps predict the next token in a sequence of tokens, and the model can only attend to tokens on the left. Such a training scheme makes this model unidirectional in nature.

Masked language models

This type of modelling masks a word or a certain percentage of words in a given sentence, and the model is trained to predict those masked words based on the rest of the words in the sentence. Such a training scheme makes this model bidirectional in nature because the representation of the masked word is learnt based on the words that occur to its left and right.

This brings us to the end of our theoretical discussion on the architecture of transformers. In the next segment, let’s summarise all the learnings of this session.

Report an error