IKH

Semantic Processing-Topic Modelling

Industry Applications of Distributional Semantics

Architecture for Binary Classification

In the pervious module , you learnt how traditional algorithm such as skip gram and CBOW work. in this session, you will learn about another heuristic of the word2vec models that is required the computational load.

Binary classification is used to decrease the computational load. in the previous session, our text corpus had a vocabulary of size7. in reality, the text corpus has a huge vocabulary size of approximately half a million.

You will notice the number of unique words in different languages in the following table.

LanguageApproximate Number of wordDictionary
English4,70,000Webster
Dutch4,00,000Woordenboek
Tamil3,80,000Sorkuvai
Chinese3,78,103Hanyu Da Cidian

For a text corpus that has half a million vocabulary size, the output layer will have to compute the softmax function for half a million neurons.

Let’s consider an example sentence: ‘An aardvark ate a zyzzyva.’ one Hot Encoding of this sentence is as follows:

The vocabulary V is approximately half million; hence, the shape of W1, the word embedding matrix, is (V ,d), Which is huge. in the next video, You will take a look at the parameters that can be changed to be converted into binary classification problems.

In the original CBOW model, we had to preedict a centre word given one or more context words. In the skip gram model, we had to predict context words given a centre word. In both cases, it was a classification problem with half a million classes.

In binary classification, if  given a pair of words,  it predicts if they are in each other’s context.

The input to the input layer will be concatenation/addition/other heuristic of the OHE of a pair of words. The output neuron will predict 1 if the pairs occur in each other’s context or will output 0 if the pair does not occur in each other’s context. If we consider the following example.

‘The quick brown fox jumps over the lazy dog’

For the context size of 2:

The pair (quick, fox) will give an output of 1.

The pair (quick, jumps) will give an output of 0.

In the next segment, let us understand how to create training data for negative sampling.

Report an error