Till now, you have loaded the GloVe word embedding. The next steps are as follows:
- Change the words present in the headlines into integers,
- Pad the headlines to bring the news of each day to a fixed length.
While working on the POS tagging task, you’d seen you can use Keras’ Tokenizer() and pad_sequences() function to achieve the two tasks listed above. However, in this segment, you’ll learn how to do these steps in Python, without using Keras.
For each day, there are multiple headlines and each headline has a variable length. Therefore, the padding task was two-fold:
- For a headline having less than 16 words, append it as it. For a headline that is longer than 16 words, truncate it from the right and append the first 16 words only.
- Combine all the headlines into a single headline and pad it to a length of 200 words in case it is shorter than 200 words. In case it is longer than 200 words, truncate it from the right side.
With the above preprocessing, the data is ready to feed into the CNN-RNN model. In the next segment, You’ll learn how a 1D convolution operation works over text, followed by how to train a CNN-RNN model in Keras.
Report an error