Model Training and Prediction

We have seen the architecture and how it works. But how do we train the model?

Let’s quickly go through the entire process once again.

The encoder in the NMT model takes in the input sequence and produces a context vector which is the encoder representation of the entire sequence/sentence. here, we only keep the hidden state and discard the outputs coming from the encoder RNN.
It takes in a sequence of embedding vectors produced by the embedding layer from the list of token IDs generated by the pre-processing layer. alone with the special token, the input token is processed by each layer of the GRU(the variant of the RNN considered here), where each layer’s outputs are the input sequence to the next layer. the last hidden cell of the encoder serves as the context/ conditioning to the decoder.
The decoder,
once initialised with the context vector, is a conditional language model which produces an output based on the input it received (<start> token).
Along with the output, the GRU produces a hidden state, which is fed to the next cell. The dense layer placed after the GRU does a linear transformation on the previous output and, thus, you get a list of probability values for all the words present in the vocabulary. The word with the highest probability is selected as the prediction for that timestep.
This predicted word is passed as an input to the next cell of the decoder and thus the GRU repeats the generation. This sequence is continued till we receive the <end>/<stop> token at the end, which signifies the end of the translation.

The input sequence to the NMT model needs to be of fixed length and, generally, it is limited to the maximum length of a sentence present in the sample data. The shape of the input data should be of (batch_size, max_length). Therefore, for sentences shorter than the maximum length, paddings (empty tokens) are added to make all sequences in a batch fit a given standard length.

Consider the example consisting of tokenized words: