Let’s now look at how the architecture of an RNN visually and compare it to a normal feedforward network.
The following figure shows the RNN architecture along with the feedforward equations.
The green layer is the input layer in which the xi’s are elements in a sequence – words of a sentence, frames of a video, etc. The layers in red are the ‘recurrent layers‘ – they represent the various states evolving over time as new input are seen by the network.
The blue layer is the output layer where the Y1’s are the output emitted by the network at each time step. For example, in a part of speech (POS) tagging task (assigning tags such as noun, verb, adjective etc. to each word in a sentence), the
yi’s will be the POS tags of the corresponding xi’s. Note that in this figure, the input and output sequences are of equal lengths, but this is not necessary. For e.g. to classify a sentence as ‘positive/negative’ (sentiment-wise), the output layer will emit just one label (0/1) at the end of T timesteps (you will see some more examples shortly).
You can see that the layers of an RNN are similar to the vanilla neural nets (MLPS) -each layer has some neurons and is interconnected to the previous and the next layers. The only difference is that now each layer has a copy of itself along the time dimension (the various states of a layer, shown in red colour). Thus, the layer along the time dimension have the same numbers of neurons (since they represent the various states of same Lth layer over time).
For example, let’s say that layer-2 and layer-3 have 10 and 20 neurons respectively. Each of the red copies of the second layer will have 10 neurons, and likewise for layer-3.
The flow of information in RNNs is as follows: each layer gets the input from two directions – activations from the previous layer at the current timestep and activations from the current layer at the previous timestep. Similarly, the activations (outputs from each layer) go in two directions – towards the next layer at the current timestep (through WF), and towards the next timestep in the same layer (through WR).
In the next section, you will see how various types of sequences are fed to the network.
Report an error