IKH

Comprehension: RNN Architecture

In a previous section, you studied the feedforward equations of RNNs. Let’s now analyse the architecture in a bit more detail – we compute the dimensions of the weight and bias matrices, the outputs of layers etc. we will also look at a concise form of writing the feedforward equations.

As you already know, the architecture of an RNN and its feedforward equations are as follows:

The WF’s are the feedforward weights which propagate information from one layer to another (left to right). The WR’s are the recurrent weights which propagate the information across the time dimension. Each layer has a bias bl also.

Let’s now analyse the dimensions of the weight matrices and the biases. You’re already aware of the dimensions of the weights WF and the bias term b from the knowledge of the first module on neural networks.

Thus, the WF’s are the usual feedforward weights. Now let’s look at the dimensions of the other entities.

You know that the recurrent weights WR connect the same layer to its different states across different time steps. For example, the recurrent weights of layer -3 W3R connects the outputs of the third layer from one time step to the next:  a31 to a32, a32 to a33 and so on. Notice that each of these outputs have the same size (it is the same layer, so the number of neurons at each time step is the same, and hence the output size is the same). Thus, all WR’s are square matrices. The size of the output vector alt+1 (for any l and t) is the same as that of alt. In other words, the recurrent operation does not change the size of the output.

e same as that of alt. In other words, the recurrent operation does not change the size of the output.

The biases of each layer bl, as usual, have the size equal to the number of neurons in that layer. 

You can now figure out the number of parameters and shapes of outputs at each layer easily. Since data is usually fed in batches (of size m), the output from each layer alt is a matrix of size (l,m). The following table summarise the sizes of the various components of an RNN. Please pay careful attention to the batch size. 

To answer the following questions, consider a neural network with three neurons in the input layer (layer-0), 7 neurons in the only hidden layer (layer-1) and a single neuron in the output softmax layer (layer-2). Consider a batch size of 64. We’ll denote the parameters of the layers 0, 1 and 2 as (W0,b0),(W1,b1),(W2,b2) respectively.

This network is being trained to classify each input sentence as grammatically correct or incorrect. The sequence size, that is, the number of words in a sentence (= the number of time steps T) is 10.

If you are thinking ‘what if all sentences do not have exactly 10 words?’ – hold on, we will discuss that shortly. Hint: padding.

Attempt the following quiz based on this architecture.

RNNs: Simplified Notations

You may commonly come across a concise, simplified notation scheme for RNNs. Let’s discuss that as well. The RNN feedforward equations are:

zlt=WlFal−1t+WlRalt−1+bl

alt=fl(zlt)

The above equation can be written in the following matrix form:

zlt=[WlFWlR][al−1talt−1]+bl

In the above equation, the term [WlFWlR][al−1talt−1] is equal to WlFal−1t+WlRalt−1

You can now merge the two weight matrices into one to get the following notation:

zlt=Wl[al−1t,alt−1]+bl

where, Wl denotes the feedforward + recurrent weights of layer l formed by stacking (or concatenating) WlF and WlR side by side and [al−1t,alt−1] is formed by stacking the activations al−1t and alt−1 on top of each other. Try doing a consistency check of the dimensions once and convince yourself that the new notations are consistent with the old one.

This form is not only more concise but also more computationally efficient. Rather than doing two matrix multiplications and adding them, the network can do one large matrix multiplication. 

Now consider the same example with the modified notations in mind. You have a neural network with three neurons in the input layer (layer-0), 7 neurons in the hidden layer (layer-1) and one neuron in the output softmax layer (layer-2). Consider a batch size of 64. The sequence size is 10.

In the next few sections, you’ll look various types of RNN architectures and some problems that can be solved using them.

Report an error