In the previous segment, you learnt to train word vectors using a corpus of your choice and did some basic exploratory analysis on them.
In this segment, we will experiment with parameters of word embedding, such as the length of word vectors , the technique used to generate co-occurrence matrices (skip-gram, continuous bag-of-words or CBOW) etc. and see how the ‘quality’ of word vectors is affected by them.
Length of Word Embedding
Let’s first see how the length (dimensionality) of word vectors affect the output of a word2vec model.
So, you saw that higher word length captures more subtleties. Note that in absence of quantitative metrics to measure the ‘goodness’ of word vectors, we are measuring ‘goodness’ purely qualitatively by observing a few instances.
Skip-gram and CBOW
Apart from the skip-gram model, there is one more model that can be used to extract word embeddings for the word. This model is called Continuous-Bag-of-Words (CBOW) model.
Recall that skip-gram takes the target/given word as the input and predicts the context words (in the window), whereas CBOW takes the context terms as the input and predicts the target/given term.
You saw that word embeddings trained using skip-grams are slightly ‘better’ than those trained using CBOW for less frequent words (such as ‘meandering’). By ‘better’, we simply mean that words similar to ‘meandering’, an infrequent word, will also be infrequent words.
You also learnt another important (though subtle) concept – the effect of window size on what the embeddings learn. When you choose a small window size (such as 5), ‘similar’ words will often be synonyms, whereas if you choose a larger window size (such as 10), ‘similar’ words will be contextually similar words.
Effect of Context (Training Corpus) – Optional
The training corpus also has a significant effect on what the embeddings learn, as you would expect. For example, if you train word embeddings on a dataset related to movies and financial documents, they would learn very different terms and semantic associations.
In the following optional lecture, you will compare word embeddings trained on the brown and the movie corpora from NLTK.
Glove Embeddings
We had mentioned earlier that apart from Word2Vec, several other word embeddings have been developed by various teams. One of the most popular is GloVe (Global Vectors for Words) developed by a Stanford research group. These embeddings are trained on about 6 billion unique tokens and are available as pre-trained word vectors ready to use for text applications.
Let’s now learn to use the pre-trained GloVe embeddings.