So far, we have been doing convolutions only on 2D arrays (images), say of size 6×6. But most real images are coloured (RGB) images and are 3D arrays of size m x n x 3. Generally, we represent an image as a 3D matrix of size height x width x channels.
To convolve such images, we simply use 3D filters. The basic idea of convolution is still the same – we take the element-wise product and sum up the values. The only difference is that now the filters will be 3-dimensional, for e.g. 3 x 3 x 3, or 5 x 5 x 3 (the last ‘3’ represents the fact that the filter has as many channels as the image).
Let’s now see how convolutions are performed on 3D arrays and what it is that a CNN ‘learns’ during training.
To summarise, you learnt the following thinsg in the video:
- We use 3D filters to perform convolution on 3D images. For e.g. if we have an image of size (224, 224, 3), we can use filters of sizes (3, 3, 3), (5, 5, 3), (7, 7, 3) etc. (with appropriate padding etc.). We can use a filter of any size as long as the number of channels in the filter is the same as that in the input image.
- The filters are learnt during training (i.e. during backpropagation). Hence, the individual values of the filters are often called the weights of a CNN.
Comprehension – Weights and Biases
In the discussion so far, we have talked about only weights, but convolutional layers (i.e. filters) also have biases. Let’s see an example to understand this concretely.
Suppose we have an RGB image and a (2, 2, 3) filter as shown below. The filter has three channels, and each channel of the filter convolves the corresponding channel of the image. Thus, each step in the convolution involves the element-wise multiplication of 12 pairs of numbers and adding the resultant products to get a single scalar output.
The GIF below shows the convolution operation – note that in each step, a single scalar number is generated, and at the end of the convolution, a 2D array is generated:
You can express the convolution operation as a dot product between the weights and the input image. If you treat the (2, 2, 3) filter as a vector w of length 12, and the 12 corresponding elements of the input image as the vector p (i.e. both unrolled to a 1D vector), each step of the convolution is simply the dot product of wT and p. The dot product is computed at every patch to get a (3, 3) output array, as shown above.
Apart from the weights, each filter can also have a bias. In this case, the output of the convolutional operation is a (3, 3) array (or a vector of length 9). So, the bias will be a vector of length 9. However, a common practice in CNNs is that all the individual elements in the bias vector have the same value (called tied biases). For example, a tied bias for the filter shown above can be represented as:
wT.x+b=⎡⎢
⎢⎣sum(wT.p11)sum(wT.p12)sum(wT.p13)sum(wT.p21)sum(wT.p22)sum(wT.p23)sum(wT.p31)sum(wT.p32)sum(wT.p33)⎤⎥
⎥⎦+⎡⎢⎣bbbbbbbbb⎤⎥⎦
=⎡⎢⎣−4−553116
Report an error