IKH

Understanding the Encoder-Decoder Architecture

Before proceeding further, let’s take a pause and play a translation game. In the next video, Mohit will introduce you to the rules of the game and how it is played.

As explained in the video,the teacher passes a set of words in English to a group of students belonging to a particular class. Each student intakes the new word passed by the teacher along with the information received by the preceding student, after receiving both the information, whispers the message into the next student’s ear. Once the set of words are finished, the teacher ends the communication bysending a stop signal to the student at the end of this group.

It is this student’s responsibility to pass the entire information to a student of another class. The job of the new class is to intake this information and provide the necessary translation in Hindi.The entire information acts as the context for this new class but each student of this class will provide only one word as part of the translation. Once a student produces a translated word, the next student hears it and provides the next word as part of the entire translation. This continues until a student provides a stop signal, which halts the translation. Thus, the entire set of words provided by the new class acts as the translation for the input set provided by the teacher. 

Here, the first group of students together acts as the encoder for the game and the decoder is played by the second class who provides us the required translation.

This translation game is analogous to the sequence to sequence (seq-2-seq) model, where each student can be represented as a part of a Recurrent Neural Network (RNN). 

The next video will help you in connecting the dots between the game and the seq2seq model.

Sequence-to-sequence (seq2seq) model follows an encoder-decoder architecture, where it is made up of two RNN’s. Both the encoder & decoder consists of a series of RNN cells where each layer’s output is the input to the next layer.

Encoder

The encoder RNN takes the input sequence and encodes it into a fixed size context vector. It does the job by reading the input tokens one at a time. The context vector generated by the final layer’s hidden state is then fed to the decoder as the input. 

One of the attractive features of the seq-2-seq model is its ability to turn a sequence of words into a vector of fixed dimensionality (context vector). The below plot shows some of the learned representations of the input sequence. The phrases used in the plot are clustered as per their meaning (which also captures their word order).

Decoder

The decoder RNN uses this context vector to generate the output sequence. The first cell of the decoder is initialised with the hidden state that it received from the encoder. The decoder here acts as a language model as it predicts the next word based on the previous prediction and the hidden step passed from the cell at the previous time step. 

The entire seq2seq/NMT model is called a Conditional Language model.

  • Conditional model as the decoder’s prediction is based on the context input (condition) that it has received from the encoder.
  • Language model as the decoder is predicting the next word based on the previous prediction. 

NOTE:

 A language model estimates the unconditional probability of a sequence p(yt|yt−1)  but a seq-2-seq model estimates the conditional probability p(yt|yt−1,x)of a sequence given a source. 

To understand the decoding process, let’s assume that the input sentence is represented by  (x1,x2,…,xn) and the target is represented by (y1,y2,….,yn). At a timestep t, the model produces a probability distribution p(∗|y1,y2,……,yt−1,x1,x2,…,xn).

You must have noticed that once the input tokens are read by the encoder, we pass a special token to the encoder <stop>/<end>. This token indicates the encoder to stop encoding and pass the last layer’s hidden state to the decoder. The special token can be named as <eos> (end of sentence) or anything else as it is just an indicator for the model’s convenience. 


Along with the context vector, the decoder receives a special token as well. In this case, the <start>/<sos> (start of sentence) token indicates the model to start decoding. Once it has provided the relevant translation, it will generate an <end>/<eos> token indicating the end of the target sentence.

Based on input and output, you can understand that the NMT architecture can handle variable length of input and output. To produce an efficient translation, the NMT model should satisfy certain conditions which we will talk about in the next segment – Requirements of NMT Architecture.

Report an error