Comprehension – Word2Vec

Word2vec is a technique used to compute word-embeddings (or word vectors) using some large corpora as the training data.

Say you have a large corpus of vocabulary |V|=10,000 words. the task is to create a word embedding of say 300 dimensions for each word (i.e. each word should be a vactor of size 300 in this 300-dimensional space).

The first step is to

create a distributed representation of the corpus using a technique such as skip-gram where each word is used to predict the neighbouring ‘context words’. Let’s assume that you have used some k-skip-n-grams.

The model’s (a neural network, shown below) task is to learn to predict the context words correctly for each input word. The input to the network is a one-hot encoded vector representing one term. For e.g. the figure below shows an input vector for the word ‘ants’

The hidden layer is a layer of neurons – in this case, 300 neurons. Each of the 10,000 elements in the input vector is connected to each of the 300 neurons (though only three connections are shown above). Each of these 10,000 x 300 connections has a weight associated to it. This matrix of weights is of size 10,000 x 300, where each row represents a word vector of size 300.

The output of the network is a vector of size 10,000. Each element of this 10,000-vector represents the probability of an output context word for the given (one-hot) input word.

For e.g., if the context words for the word ‘ants’ are ‘bite’ and ‘walk’, the elements corresponding to these two words should have much higher probabilities (close to 1) than the other words. This layer is called the softmax layer since it uses the ‘softmax function’ to convert discrete classes (words) to probabilities of classes.

The cost function of the network is thus the difference between the ideal output (probabilities of ‘bite’ and ‘walk’) and the actual output (whatever the output is with the current set of weights).

The training task is to learn the weights such that the output of the network is as close to the expected output. Once trained (using some optimisation routine such as gradient descent), the 10,000 x 300 weights of the network represent the word embeddings – each of the 10,000 words having an embedding/vector of size 300.

The neural network mentioned above is informally called ‘shallow’ because it has only one hidden layer, though one can increase the number of such layers. Such a shallow network architecture was used by Mikolov et al. to train word embeddings for about 1.6 billion words, which become popularly known as Word2Vec.

Other Word Embeddings

After the widespread success of Word2Vec, several other (and perhaps more effective) word embedding techniques have been developed by various teams. One of the most popular is GloVe (Global Vectors for Words) developed by a Stanford research group.

Another recently developed and fast-growing word embedding library is fastText developed by Facebook AI Research (FAIR). Apart from word embeddings of English, it contains pre-trained embeddings for about 157 languages (including Hindi, Marathi, Goan, Malayalam, Tamil).

These embeddings are trained on billions of words (i.e. unique tokens), and thankfully, are available as pre-trained word vectors ready to use for text applications (for free!). You will learn to use pre-trained word vectors for text processing tasks such as classification and clustering in the upcoming sessions.

You can read more about Word2Vec, GloVe and fastText in the additional readings provided below.

Additional Readings

Word2Vec: Original paper of word2vec by Mikolov et al.
GloVe vectors: Homepage (contains downloadable trained GloVe vectors, training methodology etc.).
fastText: Word embeddings developed by FAIR on multiple languages, available for download here

Report an error