Model Training and Prediction

We have seen the architecture and how it works. But how do we train the model?

Let, quickly go through the entire process once again.

The encoder in the NMT model takes in the input sequence and produces a context vector which is the encoded representation of the entire sequence/ sentence. Here, we only keep the hidden state and discard the outputs coming from the encoder RNN.

It takes in a sequence of embedding vectors produced by the embedding layer from the list of token IDs generated by the pre-processing layer. Along with the special tokens, the input token is processed by each layer of the GRU (the variant of the RNN considered here), where each layer’s outputs are the input sequence to the next layer. The last hidden cell of the encoder serves as the context/conditioning to the decoder.

The decoder, once initialised with the context vector, is a conditional language model which produces an output based on the input it received (<start> token).

Along with the output, the GRU produces a hidden state, which is fed to the next cell. The dense layer placed after the GRU does a linear transformation on the previous output and, thus, you get a list of probability values for all the words present in the vocabulary. The word with the highest probability is selected as the prediction for that timestep.

This predicted word is passed as an input to the next cell of the decoder and thus the GRU repeats the generation. This sequence is continued till we receive the <end>/<stop> token at the end, which signifies the end of the translation.

To summarise this, mohit will recap how the entire NMT architecture looks.

The input sequence to the NMT model needs to be of fixed length and, generally, it is limited to the maximum lenght of a sentence present in the sample data. the shape of the input should be of(batch_size, max_length). Therefor,for sentences shorter than the maximum lenght, paddings (empty tokens) are added to make all sequences in a batch fit a given standard length.

In the next video, Mohit will explain how to add the padding.

Consider the example consisting of tokenized words:

Converting these sentences to their respective tokens, the data will look like follows:

From the mentioned list of samples, it can be seen that the maximum length is 8.so, samples shorter than the 8 need to be padded with empty(o) tokens.

With the help of tf.keras.preprocessing. sequence.pad_sequences, you can pad the samples to max length.

Note

Since the padding is kept as ‘post’, all the empty tokens are added after each sequence. It is recommended to keep post padding when working with RNN layers.

Now that you have understood how to pre-process the data, let’s look into the whole training process.

Model Training

As we have seen earlier, the decoder is trained to predict the next characters of the target sequence given the previous prediction.But what if, for a target sequence “A boy climbing up the ladder”, the model at the second timestep produces a wrong prediction -“dog” instead of “boy”? For the third timestep, this will be fed back into the model, and it will lead to another wrong prediction. this process will be continued and at each timestep, the loss will accumulate, and it becomes difficult for the model to learn.

Once the entire sequence is predicted, we may get a totally different prediction – “A dog running on the streets”.

Let’s try to understand this using another example. Imagine an exam where you are presented with a question that has multiple parts. Each question is based on the answer to the previous question. So, if you answer a question incorrectly, it will carry forward to the question following it, leading to an overall poor score.

What if for each question, your teacher checks your response immediately and lets you know the correct answer? With this kind of approach, you will always have a better chance in answering the next question.

When the same analogy is applied to model training, it is called teacher forcing, where the model is fed with the correct target token at each timestep, regardless of the model’s predictions at previous timestep. With this technique, the model becomes more robust then when it was sometimes fed its own predictions at the previous timestep.

Therefore, the decoder learns to generate target[ : i+1] given target[ : i] and thus helps in converging to a better model faster.

Model Inference

During inference i.e during testing, we change the prediction process slightly.

Once the decoder receives the context vector, it is fed with an input token <start>.

Based on this input, the trained decoder will produce a list of prediction (probability scores) for the first word.

By taking the argmax of the predictions, the next word is computed and appended to an empty list ‘Result’.

This previous prediction, stored in ‘Result’ , is then sent as an input to the model in the next timestep.

The entire process is repeated until the model produces an <end> token, which indicates the end of the prediction process.

Loss function

While training, at each step the loss function maximises the probability that the model has assigned to the correct word. This probability distribution can be further expanded as

$$//P(\ast\backslash Y_{1\;,\;}Y_{2,}\cdots\cdots,Y_{t-1,}\;X_{1,\;}X_{2,}\dots,X_n)//$$

at a timestep t.

The loss function(i.e cross-entropy loss) would compare this probability ( P ) with the target distribution ^ (which is represented by a list of one-hot encoded values).

$$//LOSS(\overset\wedge P,P)=-\overset\wedge P\log\left(p\right)=-{\textstyle\sum_{i=1}^{\left|\nu\right|}}\overset\wedge{P_i}\log(P_i)\;,where\left|V\right|is\;the\;total\;vocabulary\;size.//$$

Since the target P consists of only one non-zero value(i.e the target word y_t), all the other products will become zero.

$$//\;LOSS(\;\overset\wedge P,P)=-\log(P(Y_t))=-\log(P(Y_t\vert y_1,y_2,….,y_{t-1}\vert x).//$$

Therefore, in order to minimise this loss, the probability of the correct word is maximised.

This loss that you have calculated is for individual words. For the entire sequence, you will cumulate the loss for all the words in the sequence

$$//\;=\;-{\textstyle\sum_{t=1}^n}\log(p(y_t\vert y_1,y_2,….,y_{t-1}\vert x).\;//$$

The illustration below demonstrates how the loss is calculated at each timestep.

Now that you have understood the whole model training process, let’s look into the evaluation metric that we employ prediction – the BLEU score.

BLEU (BiLingual Evaluation Understudy) is an evaluation metric for automatically evaluating a machine-translated text. The value of the BLEU score determines the ‘difference’ between the predicted sentence from the high-quality reference translations and is represented by a value between zero and one. Translations with a value of 0 (low quality) signify that there is no overlap between the predicted/machine-translated output and the reference translation while a value of 1 (high quality) means there is a perfect overlap between the two. The BLEU score correlates well with the human judgment of translation quality and thus is a good benchmark to determine the quality of machine translation.

BLEU Score for candidate 2: 1.0

In order to capture the quality of the prediction better the BLEU score is taken as a fraction of n-grams in the predicted sentence that appears in the ground truth. you can choose the number of words to be matched for example 1-gram matches or in pairs (2-gram) or triplets (3-grams) or (4-gram). The default BLEU calculates a score for up to 4-grams using uniform weights (thus called BLEU-4)

The weights of each gram are specified as a tuple where each index refers to the gram order. For 1-gram matches, you can specify a weight of 1 for 1-gram and 0 for other indexes :(1, 0, 0, 0)

Now that you understand all the basics of NMT, let’s see how you can implement a traditional machine translation system in TensorFlow in the next segment.

Report an error