In the previous segments, you learnt in detail how the CBOW model works. In this segment, you will gain an understanding of the working of the Skip-Gram model.
In the case of skip-gram, we try to predict the context words given the target word. So, the training data for the skip-gram model for the context size 1 looks like this.
The input of CBOW becomes the output of skip-gram and vice versa .An example is given below.
Model | X(input) | Y(output) |
CBOW | [with,has] | upGrad |
Skip-Gram | upGrad | [with, has] |
Remember that these training samples need to be converted to OHE before feeding into the neural network.
Now, you have learnt how to get the training input samples.Next, you will gain an understanding of the architecture and training of skip -gram model.
The archiecture of skip- gram is same as CBOW model.
The neural network is a shallow neural network with 1 hidden layer.
The input layer and the output layer have 7 neurons, and this is equal to the size of the vocabulary.
The hidden layer has a linear activation function , wherase the output layer has the softmax activation function.
The difference appears when you perform forward pass and backpropagation.
The main difference is in the output of the first layer, which appears when you take the average of two input words in CBOW, but it is not necessary to consider the average of the input words, as only one input word is present in the skip-gram model.
The output of the first layer in CBOW is $$ // //$$