IKH

Video Analysis

In this segment, you will understand the process of video analysis using CNNs. A video is basically a sequence of frames where each frame is an image. You already know that CNNs can be used to extract features from an image. Let’s now see how CNNs can be used to process a series of images (i.e. videos). 

Let’s summarise the process of video analysis using a CNN + RNN (Recurrent Neural Network) stack. At this point, you only need to understand that RNNs are good at processing sequential information such as videos (a sequence of images), text (a sequence of words or sentences), etc. You will study RNN in the next module. 

For a video classification task, here’s what we can do. Suppose the videos are of length 1 minute each. If we extract frames from each video at the rate of 2 frames per second (FPS), we will have 120 frames (or images) per video. Push each of these images into a convolutional net (such as VGGNet) and extract a feature vector (of size 4096, say) for each image. Thus, we have 120 feature vectors representing each video. 

These 120 feature vectors, representing a video as a sequence of images, can now be fed sequentially into an RNN which classifies the videos into one of the categories.

The main point here is that a CNN acts as a feature extractor for images, and thus, can be used in a variety of ways to process images.

In the next few segments, you will study the main elements of CNNs in detail – convolutions, pooling, feature maps etc.

Report an error