In the previous section, you learnt the following two approaches to create a co-occurrence matrix:
- Occurrence context matrix
- x-skip-n-grams(more generally, skip- grams)
In this lecture, you will learn each of these approaches in detail. Please note that , in the following video, c1 and c2 represent context 1 and context 2 respectively.
Note-
At 0:25, the cell corresponding to cat (row) and wall (column) have value of 1 instead of 0.
At 1:39,3-skip-2-grams also contain one more pair that is ‘fly wall’.
That it has been assumed that a word occurs in its own context. So, all the diagonal elements in the matrix are1.
To summarise , there are two approaches to create the term-term co-occurrence matrix.
Occurrence context:
A context can be defined as , for e.g., an entire sentence. Two words are said to co-occur if they appear in the same sentence.
Skipgrams:
3-skip means that the two words that are being considered should have at max3 words in, between them , and 2-gram means that we are going to select two words from the window.
In a previous question on term – document matrices, you had used the dot product of two vectors to compare the similarities between vectors. Let’s now look at some other similarity metrics we can use.
Let’s now visualize the word vectors.
Comprehension- Word Vectors
Say you are given the following paragraph from the book Harry Potter and the Sorcerer’s Stone:
“Sorry, he grunted, as the tiny old man stumbled and almost fell. It was a few seconds before Mr Dursley realized that the man was wearing a violet cloak. he didn’t seem at all upset at being almost knocked to the ground”.
Let’s assume that our vocabulary only contains a few words as listed below. After removing the stop words and punctuations and retaining only the words in our vocabulary, the paragraph becomes:
Man stumbled seconds Dursley man cloak upset knocket ground
Create a co-occurrence matrix using this paragraph using the 3-skip-2-gram technique and answer the following questions (choose a similarity metric of your choice).
The vocabulary would be:
(man, stumbled, seconds, Dursley, cloak, upset, knocked, ground)
The co-occurrence pairs that you get would be(the positions of left and right words do not matter, they can be switched as well):
(Man, stumbled) (Man, seconds) (Man, Dursley) (Man, man)
(stumbled, seconds) (stumbled, Dursley) (stumbled, man) (stumbled, cloak)
(seconds, Dursley) (seconds, man) (seconds, cloak) (seconds, upset)
(Dursley, man) (Dursley, cloak) (Dursley, upset) (Dursley, knocked)
(man, cloak) (man, upset) (man, knocked) (man, ground)
(cloak, upset) (cloak, knocked) (cloak, ground)
(upset, knocked) (upset, ground)
(knocked, ground)
Now, fill the co-occurrence matrix with 1 if a word-pair exists in the above co-occurrence pairs.
Please ensure that the co-occurrence matrix that you get matches the following matrix.
| man | stumbled | seconds | Dursley | cloak | upset | knocked | ground | |
| Man | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| stumbled | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
| seconds | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
| Dursley | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
| cloak | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| upset | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| knocked | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
| ground | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |