Training RNNs

In the previous segment you studied some different types of RNN architectures. Let’s now study how RNNs are trained. The training procedure differs slightly for different architectures – let’s study the training process and these differences.

You saw that the loss calculation depends on the type of task and the architecture. In a many-to-one architecture (such as classifying a sentence as correct/incorrect), the loss is simply the difference between the predicted and the actual label. The loss is computed and backpropagated after the entire sequence has been digested by the network.

On the other hand, in a many-to-many architecture, the network emits an output at multiple time steps, and the loss is calculated at each time step. The total loss (=the sum of the losses at each time step) is propagated back into the network after the entire sequence has been ingested.

Let’s formally write down the loss expressions for a general RNN-based network. Let the network have T₁ input time steps and T₂ output time steps. The input-output pairs are thus (x₁, x₂ ,…..,xt₁ ) and (y₁, y_2,…..,yt₂ ) and T₁ and T₂ can be unequal.

In many-to-one architectures, the output length T₂ = 1 , i.e. there is only a single output You for each sequence. If the actual correct label of the sequence is y, then the loss L for each sequence is (assuming a cross-entropy loss):

L = cross – entropy(yout, y)

In a many-to-many architecture, if the actual output is (y₁, y₂,….., yt₂) and the predicted output is (y′1,y′2,…,y′T2), then the loss L for each sequence is:

L = ∑T2i=1cross−entropy(y′i,yi).

We can now add the losses for all sequences (i.e. for a batch of input sequences) and backpropagate the total loss into the network.

In the next section, you’ll continue to learn some other RNN architectures.

Report an error