Till now you’d used CNNs on images as they’re good at extracting the spatial features of an image. In this session, you will learn to use CNN for text processing.
Why use CNNs?
It turns out that CNNs are also really effective in extracting features from the text other than RNN. Although, you can directly feed text to RNN but there are cases when the sequences are really long which makes it harder to use RNN because they’re computationally expensive.
Consider that you have twenty news headlines that run on each day ( which is an understatement in the first place considering that the number of headlines is in thousands), it is a really long sequence to process with RNN. Think about training an RNN where the input is 1000 word long sequence and you have 1,000,000 such sequences! Training an RNN model on such large sequences will be a massive computational expense. However, you can use CNN as they are really faster and significantly less expensive than RNN.
Why it is called 1D convolution?
As you can see above, the length of the filter (3 in this case) is the same as the length of word embedding ( 3 in this case). The filter can move only in one direction (x and y axis) in the case of 2D convolution. and that’s precisely the reason why it’s called 1D convolution. Here, the input dimension is 4 x 3 and output dimension is just 3 x 1. Even if a word had embedding dimension of 300, the output dimension would still be 3 x 1 , which mean is greatly reduces the dimension.
In the example presented above, since the width of the filter is 2 (represented by the number of rows in the filter), it can process a sequence of two words at a time. At the first instant, the filter processes two words ‘I’ and ‘am’. Next, it processes ‘am’ and’going’ and then ‘going’ and ‘home’
In the above figure, we try to compare the activations w.r.t a filter with 2 different sentences (‘I am going home’ and ‘He is leaving office’). Since the context of ‘going home’ and ‘leaving office’ is the same, this convolutional filter provides high activation (here 0.97 and 0.90) for 2-gram representing similar meaning. Similarly, we can have a stack of multiple 2-gram filters to extract various meaning in a sentence.
Here, we are processing 2 words at a time, so the width of the filter is ‘2’. Similarly, the filter width will be ‘3 ‘ if we process 3 words at a time.
Report an error