In the next few segments, you will build the network using the ResNet architecture. On this page, we will recap the architecture of ResNets and discuss some improvements proposed in it later. This is a text-only, optional page intended to give you a high-level overview of the architecture – you can skip this if you want to learn the Python implementation directly.
ResNets – Original Architecture and Proposed Improvements
Since ResNets have become quite prevalent in the industry, it is worth spending some time to understand the important elements of their architecture, You may quickly revisit the ResNet segment here, though the broad ideas are discussed below again.
Let’s start with the original architecture proposed here. The basic problem ResNet had solved was that training very deep networks was computationally hard – e.g. a 56-layer net had a lower training accuracy than a 20- layer net. By the way, before ResNets anything having more than 20 layers was called very deep.
The ResNet team argued that a net with n +1 layers should perform at least as good as the one with n layers. This is because even if the additional layer simply lets the input pass through it (i.e. acts as an identity function f (x) = x), it will perform identically to the n-layered network.
Now let’s see how ResNets had solved this problem. Consider the figure below (from the paper). Let’s say the input to some ‘unit’ of a network is X (the unit has two weight layers). Let’s say that, ideally, this unit should have learnt some function H (x), i.e. given the input X this unit should have learnt to produce the desired output H (x).
In a normal neural net, these two layers (i.e. this unit) would try to learn the function H(x). But ResNets tried a different trick. They argued: let F(x) denote the residual between H(x) and x, i.e. F(x)=H(x)−x. They hypothesized that it will be easier to learn the residual function F(x) than to learn H(x). In the extreme case that the unit should simply let the signal pass-through it (i.e. H(x)=x is the optimal thing to learn), it would be easier to push the residual F(x) to zero than to learn H(x).
Experiments on deep nets proved that this hypothesis was indeed true – if learning to let the signal pass-through was the optimal thing to do (i.e. reduced the loss), the units learnt F(x)=0; but if something useful was to be learnt, the units learnt that. These units are called residual units.
After the network has learnt the residual F(x), feedforward goes on as usual with the output H(x)=F(x)+x (since F(x)=H(x)−x). This addition is facilitated by the shortcut (or skip) connections shown in the figure. The connections do not add any extra parameter (and thus complexity) to the network – they simply add the input to the residual.
Bottleneck Residual Blocks
The figure above shows a residual block (or unit) of two layers. The ResNet team had experimented with other types of blocks as well. One particularly successful one was the bottleneck architecture designed especially for deeper nets (we’ll be using this in the upcoming sections). The bottleneck block has three layers in this sequence: (1, 1), (3, 3) and (1, 1) filters (right side in the figure below).
The reason why the bottleneck architecture works better than the vanilla one is beyond the scope of this discussion – you can read the intuition here and the details in the original paper. For practical purposes, it will suffice to remember that they facilitate training of deeper nets.
Improved ResNet Architecture
In 2016, the ResNet team had proposed some improvements in the original architecture here. Using these modifications, they had trained nets of more than 1000 layers (e.g. ResNet-1001) which had shown improved performance on the CIFAR-10 and 100 datasets. The basic ideas of the proposed design are explained below.
You know that skip connections act as a ‘direct path’ of information propagation within a residual block. The new architecture basically stretched the idea of skip connections from residual blocks to the entire network, i.e. if multiple units should ideally learn identity functions, the signal could be directly propagated across units.
Another unconventional change proposed here was to use the activation function (ReLU) both before and after the weight layer (called pre and post-activation). On the right side of the figure above, the grey arrows show the ‘direct path’ whereas the other layers (BN, ReLU etc.) are on the usual path. This modification had boosted training efficiency (i.e. gradient propagation) and was thus used to train nets deeper than 1000 layers.
You can read more about the original and the proposed architectures in the papers (provided below). In the next few segments, you will use some variants of ResNet (ResNet-18, ResNet-34 etc.).
Additional Reading
Report an error