In the previous segment, you learnt how to tokenise and pad the input text data. Let’s convert these tokens into word embeddings.
In the previous session, we used the embedding function from Keras to convert the text into word vectors. Here, we have used our own model trained on the Wikipedia data on countries to convert the tokens into word embeddings.
Please keep the countries .wiki model that we trained in the previous session in the same folder as the Python document.
After getting the word embeddings for the tokens, we created a model using a sequential function from Keras. We used GlobalAveragePooling1D in the sequential function to convert the matrix for each sentence into a vector for each sentence. Now, let’s train this model and view the results.
We used the countries.wiki model that was trained on the Wikipedia page on countries to create the word vectors. The accuracy obtained was approximately 75%. The problem here was that we trained our data on the Wikipedia pages on countries and applied it on the IMDB movie reviews; hence, the accuracy was approximately 75%. Now, let’s check whether we can solve the problem.
Download the glove word embeddings from here or
Download Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 300d vectors, 822 MB download): glove.6B.zip from https://github.com/stanfordnlp/GloVe
Global Vectors for Word Representation (GloVe) Vectors are trained on the entire Wikipedia data. The accuracy is approximately 85%.
In the graph given below, you can observe how accuracy increases when words are trained using the entire Wikipedia data.
To summarise, we converted the IMDB text data using two pre-trained models:
Countries Wikipedia entries and GloVe Vectors. The accuracy of classifying the models is higher in the case of GloVe vectors because they are trained on the entire Wikipedia data.