In this exercise, we will dissect each layer of the VGG-16 architecture. This exercise will help you apply all the concepts learnt so far.

The VGG-16 was trained on the ImageNet challenge (ILSVRC)1000-class **classification** task. The network takes a (224, 224, 3) RBG image as the input. The ’16’ in its name comes from the fact that the network has 16 layers with trainable weights – **13 convolutional layers** and **3 fully connected** ones (the VGG team had tried many other configurations, such as the VGG-19, which is also quite popular).

The architecture is given in the table below (taken from the original paper). Each column in the table (from A-E) denotes an architecture the team had experimented with. In this discussion, we will refer only to column D which refers to VGG-16 (column E is VGG-19).

The convolutional layers are denoted in the table as conv<size of filter>-<number of filters>. Thus, conv3-64 means 64(3,3) square filters. Note that all the conv layers in VGG-16 use (3,3) filters and that the number of filters increases in powers of two (64,128,256,512).

In all the convolutional layers, the same stride length of 1 pixel is used with a padding of 1 pixel on each side, thereby preserving the spatial dimensions (height and width) of the output.

After every set of convolutional layers, there is a max pooling layer. All the pooling layers in the network use a window of 2 x 2 pixels with stride 2. Finally, the output of the last pooling layer is flattened and fed to a fully connected (FC) layer with 4096 neurons, followed by another FC layer 4096 neurons, and finally to a1000-softmax output. The softmax layer uses the usual cross-entropy loss. All layers apart from the softmax use the ReLU activation function.

The number of parameters and the output size from any layer can be calculated as demonstrated in the MNIST notebook on the previous page. For example, the first convolutional layer takes a (224, 224, 3) image as the input and has 64 filters of size (3, 3, 3). Note that the **depth of a filter** is always **equal to the number of channels** in the input which it convolves. Thus, the first convolutional layer has 64 x 3 x 3 x 3 (weights) + 64 (biases) = 1792 trainable parameters. Since stride and padding of 1 pixel are used, the output spatial size is preserved, and the output will be (224, 224, 64).

Now answer the following questions (you will need a calculator). Keep track of the number of channels at each layer. Don’t forget to add the biases.

The total number of trainable parameters in the VGG-16 is about **138 million** (138,357,544 exactly), which is enormous. In an upcoming session, you will see that some of the recent architectures (such as ResNet etc.) have achieved much better performance with far less number of parameters.

In the next few segments, the professor will demonstrate some experiments with various CNN hyperparameters on the CIFAR-10 dataset.

Report an error