IKH

Summary

In this session, you learnt the basics of convolutional neural networks and their common applications in computer vision such as image classification, object detection, etc. You also learnt that CNNs are not limited to images but can be extended to videos, text, audio etc. 

The design of CNNs uses many observations from the animal visual system, such as each retinal neuron looks at its own (identical) receptive field, some neurons respond proportionally to the summation over excitatory regions (pooling), the images are perceived in a hierarchical manner, etc.  

You learned that images are naturally represented in the form of arrays of numbers. Greyscale images have a single channel while colour images have three channels – red, green and blue (RGB). The number of channels or the ‘depth’ of the image can vary depending on how we represent the image. Each channel of a pixel, usually between 0-255, indicates the ‘intensity’ of a certain colour.

You saw that specialised filters, or kernels can be designed to extract specific features from an image (such as vertical edges). A filter convolves an image and extracts features from each ‘patch’. Multiple filters are used to extract different features from the image. Convolutions can be done using various strides and paddings.

The formula to calculate the output shape after convolution is given by:

(n+2P−kS+1),(n+2P−kS+1) , where

  • The image  is of size- n x n
  • The filter is k x k
  • Padding is P
  • Stride is S

The filters are learned during training (backpropagation). Each filter (consisting of weights and biases) is called a neuron. Multiple neurons are used to convolve an image (or feature maps from the previous layers) to generate new feature maps.  The feature maps contain the output of convolution + non-linear activation operations on the input. 

A typical CNN unit (or layer) in a large CNN-based network comprises multiple filters (or neurons), followed by non-linear activations, and then a pooling layer. The pooling layer computes a statistical aggregate (max, sum etc.) over various regions of the input and reduces sensitivity to minor, local variations in the image.  Multiple such CNN units are stacked together, finally followed by some fully connected layers, to form deep convolutional networks.

In the next session, you will learn to build and train CNNs using Python + Keras + GPUs.

Report an error