IKH

Bag of Words Representation

In the previous module on lexical processing, you learnt about frequency-based methods such as TfiDf or the bag-of-words approach for creating word vectors.

Let us revise the bag of words representation of creating word vectors

Bag of Words is a representation of text that describes the occurrence of words within a corpus of text, treating each word independently.


For example, the following text extract from A Tale of Two Cities authored by Charles Dickens can be converted into a bag-of-words representation: 

“It was the best of times,

it was the worst of times.

It was the age of wisdom,

it was the age of foolishness.

It was the season of Light,

it was the season of Darkness.

It was the spring of hope,

it was the winter of despair”

best worstwisdomfoolishnesshopedespairspringwinterlightsessontimeage
It was 100000000010
010000000010
001000000001
000100000001
000000001100
000000000100
000010100000
000001010000

The above table represents the one-hot encoded vector for the text extract. If a word is present in the sentence, a value of ‘1’ is assigned; otherwise, ‘0’ is assigned.

Note that attention is not given to the meaning, sequence or context of the words in this representation. In the next video, Jaidev will discuss the limitations of this technique.

To understand the relationship between words, cosine similarity can be applied on the one-hot encoded representation of this text corpus. For example, on taking the words ‘best’ and ‘worst’ as shown below and applying cosine similarity, we get 0. This means that the words are completely unrelated.

S=cos(x,y)=x.y||x||||y||

S(best,worst)=01X1=0

However, ‘best’ and ‘worst’ are antonyms, and their cosine similarity should be negative as you learnt earlier. 

Similarly:

S(wisdom, foolishness) = 0

S(wisdom, winter) = 0

S(winter,light) = 0

S(winter, season) = 0

S(spring, season) = 0

The cosine similarity for unrelated words, related words and antonyms is zero. 

This proves that the bag-of-words representation is not the most accurate way to convert words into its vectors. The vectors captured by Bag of Words do not capture the meaning of words. 

Report an error