IKH

Text Pre-Processing-Part1

Now that you have understood how the algorithm works, you will learn how to code it. The case study that will be covered at the end of this session is an IMDB movie review classification. The key steps involved in the case study are as follows:

1.  Extract word vectors using a corpus of text

2. Feed these word vectors into a neural network to classify the reviews according to the sentiments

In the second step, you need to understand the Global Average Pooling 1D and Embedding Layer concepts. Hence, before we dive into the case study, let’s understand these concepts in detail.

Note: In the upcoming later segments, we will understand the case study in detail.

After getting the word vectors from the corpus, we need to preprocess it before feeding it into the neural network. Now, let’s understand this in detail.

Before you start the code, please install the following libraries:

1.     tensorflow

2.    numpy

3.    tsensor (https://pypi.org/project/tensor-sensor/)

You can download the code files from here. Refer to README.md present in the link for proper instructions on downloading the code files.

We created the token for the words in the sentence using the Tokenizer()function. After tokenising the words, we need to convert each token into its vectors. In the next video, Jaidev will demonstrate this.

Here, we have used the embedding function from the Keras library to create the word 

embedding for each token.

When we feed the vector of size (1,9) in the embedding function, we get the output tensor of size (1,9,128).

We need to feed this into our neural network architecture, but matrices cannot be fed. In the next video, you will learn how this is achieved.

We convert the words into matrices and, form them into a vector using:

PowerShell
GlobalAveragePooling1D()

Here, we average across tokens and get one vector for the whole sentence rather than a matrix. As seen in the table given above, we average the first element of each word and do the same for all other columns. We have one vector for the whole sentence.

We have a matrix of (1,9,128) that will be converted into (1,128).

The transition goes from (1,9) to (1,9,128) to (1,128).

This vector can then be fed into neural networks for further processing.

We just went through an example of a sentence, but in reality, multiple sentences are present in the corpus. In the next segment, you will learn how to preprocess multiple sentences.

Report an error