Let us continue with the text pre-processing for a larger corpus.
We use the same tokenizer that we defined before, but our corpus has changed.
‘The quick brown fox jumps over the lazy dog.’,
‘The quick brown fox.’,
‘The lazy dog.’,
‘The dog.’,
‘Dog and the fox.’,
‘Hello, world!’
Here, we tokenized each of the sentences. For the words that are not present in the sentence, ‘The quick brown fox jumps over the lazy dog’, the tokenizer assigned a number 0.
After that, padding is applied to this sequence using the pad_sequences function. We get a number 0 in all places before the sentence to make the length of each sentence equal to the maximum sequence length.
After creating a matrix of size (6,9), let’s create word embeddings for all the tokens present in the matrix in the following video.
We call the embedding layer function on a matrix of tokens. The embedding layer creates word vectors for each of the tokens. We get a tensor of size (6,9,128).
We cannot feed a matrix into the neural network; hence, we perform global average pooling of this tensor.
This global average pooling takes an average across all the tokens and gives the output matrix of size (6,128).
Each sentence is represented with a vector of size 128.
To summarise, for our classification problem, we want to feed a review as an input to the neural network. Each review is represented by a matrix that needs to be converted to a vector.
The next segment will cover the case study of the IMDB movie review classification.