Till now you have learnt about reducing words to their base form But there is another common scenario that you’ll encounter while working with text. Suppose there is an article titled “Higher Technical Education in India” which talks about the state of Indian education system in engineering space. Let’s say, it contains names of various Indian colleges such as ‘International Institute of Information Technology, Bangalore’, ‘Indian Institute of Technology, Mumbai’, ‘National Institute of Technology, Kurukshetra’ and many other colleges. Now, when you tokenise this document, all these college names will be broken into individual words such as ‘Indian’, ‘Institute’, ‘International’, ‘National’, ‘Technology’ and so on. But you don’t want this. You want an entire college name to be represented by one token.
To solve this issue, you could either replace these college names by a single term. So, ‘International Institute of Information Technology, Bangalore’ could be replaced by ‘IIITB’. But this seems like a really manual process. To replace words in such manner, you would need to read the entire corpus and look for such terms.
Turns out that there is a metric called the pointwise mutual information, also called the PMI. You can calculate the PMI score of each of these terms. PMI score of terms such as ‘International Institute of Information Technology, Bangalore’ will be much higher than other terms. If the PMI score is more than a certain threshold then you can choose to replace these terms with a single term such as ‘International_Institute_of_Information_Technology_Bangalore’.
But what is PMI and how is it calculated? In the following video, professor Srinath explains PMI.
You saw how to calculate PMI of a term that has two words. The PMI score for such term is:
PMI(x, y) = log ( P(x, y)/P(x)P(y) )
For terms with three words, the formula becomes:
PMI(z, y, x) = log [(P(z,y,x))/(P(z)P(y)P(x))]
= log [(P(z|y, x)*P(y|x))*P(x)/(P(z)P(y)P(x))]
= log [(P(z|y, x)*P(y|x))/([P(z)P(y))]
Now, how do you actually calculate these probabilities? This is explained in the following video.
Note : At 1:02,It should be “Construction” instead of “Constraction”.
Correction:
At 1:49, for the calculation of PMI(New Delhi) should be log ( P(New Delhi)/P(New)P(Delhi) )
Till now, to calculate the probability of your word you chose words as the occurrence context. But you could also choose a sentence or even a paragraph as the occurrence context.
If we choose words as the occurrence context, then the probability of a word is:
P(w) = Number of times given word ‘w’ appears in the text corpus/ Total number of words in the corpus
Similarly, if a sentence is the occurrence context, then the probability of a word is given by:
P(w) = Number of sentences that contain ‘w’ / Total number of sentences in the corpus
Similarly, you could calculate the probability of a word with paragraphs as occurrence context.
Once you have the probabilities, you can simply plug in the values and have the PMI score.
Now, you’re given the following corpus of text:
“The Nobel Prize is a set of five annual international awards bestowed in several categories by Swedish and Norwegian
institutions in recognition of academic, cultural, or scientific advances. In the 19th century, the Nobel family who were known for their innovations to the oil industry in Azerbaijan was the leading representative of foreign capital in Baku. The Nobel Prize was funded by personal fortune of Alfred Nobel. The Board of the Nobel Foundation decided that after this addition, it would allow no further new prize.”
Consider the above corpus to answer the questions of the following exercise. Take each sentence of the corpus as the occurrence context, and attempt the following exercise.
In the next section, you’ll learn how to calculate PMI of terms with more than two words.