In the previous segement, you learnt how to prepare training data for the CBOW model. In the next video, you will gain an understanding of the architecture of neural network.
The archieture of CBOW is shown below.The neural network is a shallow neural network with one hidden layer.
The input layer and the output layer have 7 neurons each,which is equal to the size of vocabulary.You decide the numberof neurons in the hidden layer according to the number of dimensions that you want in your word embeddings.
In our case, the hidden layer has a linear activation function, whereas the output layer has the softmax activation function.
For a training pair ([with, has], upGrad)
input X = [with,has]
output Y = upGrad
The task of the neural network is to predict the output, given the following input
One hot encoding of the words ‘with’ and ‘has’ will be the inputs of the neural network, and the output of the network should be upGrad.
The OHE of the inputs flows through the network in the forward pass in the following manner:
$$//l_1\left[out\right]\;\;=\frac{\left( l_1out\;is(\;1\;x_3\;)\;vector l_0out\;=f( Note: The output of the first hidden layer is calculated by taking the average of the input vectors as specified in the https//c5224 stanford edu lecture_notes/notes1 https//web stanforrd edu/~jurafsky/slp3/6. Other heuristics can also be used. The output of the neural network will not give a one hot encoding vector of upGrad. As the output layer has the activation function of softmax, we will train the network such that the probability of the target word is high in the output. Now that you have learnt how an input passes through the forward pass, you will learn how the network is trained using backpropagation. The neural network uses a backpropagation algorithm that compares the actual output and predicted output and tries to minimise the loss. Although you saw this for only one training sample, weight matrices are updated after training for all the training samples that were defined earlier. Now that our training is complete, we can predict a word given its context words. However, considering you already knew those words, why are we doing this? In the next segment, you will gain an understanding of this.
As shown in the diagram given above, the probability is the highest in the fifth position of upGrad.