Understanding the Visual System of Mammals

You have already seen that every neuron is trained to look at a particular patch in the retina, called the receptive field of that neuron.

This raises some questions such as: Are the shapes and sizes of these receptive fields identical across neurons or do they vary? Do all the neurons ‘see’ the same ‘features’, or are some neurons specialised to ‘see’ certain features?

Let’s seek answers to some of these questions. You will also study, at a high level, how higher-level abstract ‘features’ such as ‘movement’ are detected by the visual system.

In this lecture, we studied two main observations from the paper:

The receptive fields of all neurons are almost identical in shape and size
There is a hierarchy in the units: Units at the initial level do very basic tasks such as picking raw features (such as horizontal edges) in the image. The subsequent units extract more abstract features, such as identifying textures, detecting movement, etc. The layers ‘higher’ in the hierarchy typically aggregate the features in the lower ones.

The image below illustrates the hierarchy in units – the first level extracts low-level features (such as vertical edges) from the image, while the second level calculates the statistical aggregate of the first layer to extract higher-level features (such as texture, colour schemes etc.).

Using this idea, if we design a complex network with multiple layers to do image classification (for example), the layers in the network should do something like this:

The first layer extracts raw features, like vertical and horizontal edges
The second layer extracts more abstract features such as textures (using the features extracted by the first layer)
The subsequent layers may identify certain parts of the image such as skin, hair, nose, mouth etc. based on the textures.
Layers further up may identify faces, limbs etc.
Finally, the last layer may classify the image as ‘human’, ‘cat’ etc.

Apart from explaining the visual system, the paper also suggested that similar phenomena have been observed in the auditory system and touch and pressure in the somatosensory system. This suggests that CNN-like architectures can be used for speech processing and analysing signals coming from touch sensors or pressure sensors as well.

Let’s have a look at some of the conclusions.

We have already discussed most of the key ideas of the CNN architecture through this paper. Summarising the main points below:

Each unit, or neuron, is dedicated to its own receptive field. Thus, every unit is meant to ignore everything other than what is found in its own receptive field.
The receptive field of each neuron is almost identical in shape and size.
The subsequent layers compute the statistical aggregate of the previous layers of units. This is analogous to the ‘pooling layer’ in a typical CNN.
Inference or the perception of the image happens at various levels of abstraction. The first layer pulls out raw features, subsequent layers pull out higher-level features based on the previous features and so on. Finally, the network gets an overall perception of an image in the last layer.

The next segment onwards, you will study specific elements of the CNN architecture.

Report an error