AlexNet and VGGNet

In this session, we will briefly look into the architectures of AlexNet and VGGNet.

The AlexNet was one of the very first architectures to achieve extraordinary results in the ImageNet competition (with about a 17% error rate). It had used 8 layers (5 convolutional and 3 fully connected). One distinct feature of AlexNet was that it had used various kernels of large sizes such as (11, 11), (5, 5), etc. Also, AlexNet was the first to use dropouts, which were quite recent back then.

You are already familiar with VGGNet from the previous session. Recollect that the VGGNet has used all filters of the same size (3, 3) and had more layers (The VGG-16 had 16 layers with trainable weights, VGG-19 had 19 layers etc.).

The VGGNet had succeeded AlexNet in the ImageNet challenge by reducing the error rate from about 17% to less than 8%. Let’s compare the architectures of both the nets.

There are some other important points to note about AlexNet which are summarised below. We highly recommend you to go through the AlexNet paper (you should be able to read most CNN papers comfortably now).

Because of the lack of good computing hardware, it was trained on smaller GPUs (with only 3 GB of RAM). Thus, the training was distributed across two GPUs in parallel (figure shown below). AlexNet was also the first architecture to use the ReLU activation heavily.

Comprehension – Effective Receptive Field

The key idea in moving from AlexNet to VGGNet was to increase the depth of the network by using smaller filters. Let’s understand what happens when we use a smaller filter of size (3, 3) instead of larger ones such as (5, 5) or (7, 7).

Consider the example below. Say we have a 5 x 5 image, and in two different convolution experiments, we use two different filters of size (5, 5) and (3, 3) respectively.

In the first convolution, the (5, 5) filter produces a feature map with a single element (note that the convolution is followed by a non-linear function as well). This filter has 25 parameters.

In the second case with the (3, 3) filter, two successive convolutions (with stride=1, no padding) produce a feature map with one element.

We say that the stack of two (3, 3) filters has the same effective receptive field as that of one (5, 5) filter. This is because both these convolutions produce the same output (of size 1 x1 here) whose receptive field is the same 5 x 5 image.

Notice that with a smaller (3, 3) filter, we can make a deeper network with more non-linearities and fewer parameters. In the above case:

The (5, 5) filter has 25 parameters and one non-linearity
The (3, 3) filter has 18 (9+9) parameters and two non-linearities.

Since VGGNet had used smaller filters (all of 3 x 3) compared to AlexNet (which had used 11 x 11 and 5 x 5 filters), it was able to use a higher number of non-linear activations with a reduced number of parameters.

In the next segment, we will briefly study GoogleNet which had outperformed VGGNet.

Additional Readings

We strongly recommend you to read the VGGNet paper provided below. Now you should be able to read many CNN-based papers comfortably.

The VGGNet paper, Karen Simonyan et. al.

Report an error