Skip-Gram Model

In the previous segments, you learnt in detail how the CBOW model works. In this segment, you will gain an understanding of the working of the Skip-Gram model.

In the case of skip-gram, we try to predict the context words given the target word. So, the training data for the skip-gram model for the context size 1 looks like this.

The input of CBOW becomes the output of skip-gram and vice versa .An example is given below.

Model	X(input)	Y(output)
CBOW	[with,has]	upGrad
Skip-Gram	upGrad	[with, has]

Remember that these training samples need to be converted to OHE before feeding into the neural network.

Now, you have learnt how to get the training input samples.Next, you will gain an understanding of the architecture and training of skip -gram model.

The archiecture of skip- gram is same as CBOW model.

The neural network is a shallow neural network with 1 hidden layer.

The input layer and the output layer have 7 neurons, and this is equal to the size of the vocabulary.

The hidden layer has a linear activation function , wherase the output layer has the softmax activation function.

The difference appears when you perform forward pass and backpropagation.

The main difference is in the output of the first layer, which appears when you take the average of two input words in CBOW, but it is not necessary to consider the average of the input words, as only one input word is present in the skip-gram model.

The output of the first layer in CBOW is

Python

$$//L_1out=(<\;x_1\;W_1>\;+<x_2\;W_2>)/2 
//$$

On the contrary, the output of the first layer in skip-gram is

Python

$$// L_1\;out=<x_1,W_1>
        //$$

The differentiating factor in the backpropagation step is that the error needs to be calculated for the two output words [with,has] in the skip-gram model as shown below.

After training the skipgram model, the weight matrix W_1 is a word embedding matrix, which is the same as the CBOW model.