In the following segments, you’ll study a commonly used probabilistic algorithm for POS tagging – the Hidden Markov Model (HMM).
Before moving ahead, it will be useful to recall the Bayes’ theorem and the chain rule of probability which you had learnt in the Naive Bayes module. Say you have two features X= (x1, x2) and a binary target variable y (class = 0/1). According to the Bayes’ rule, the probability of a point (x1, x2) belonging to the class c1 is given by:
$$//P\left(class=c1\vert x1,x2\right)=\frac{P\left(x1,x2\vert c1\right).P\left(c1\right)}{P\left(x1,x2\right)}//$$
Now, according to the chain rule of probability, the term P(x1, x2 | c1) can be rewritten as P(x1|c1). P(x2|c1). You can now compute all the probabilities on the right-hand side to compute P(class = c1 | x1, x2).
A similar idea is used by probabilistic parsers to assign POS tags to sequences of words. Say you want to tag the word sequence ‘The high cost’ and you want to compute the probability that the tag sequence is (DT, JJ, NN). For simplicity, let’s work with only these three POS tags, and also assume that only three tag sequences are possible – (DT, JJ, NN), (DT, NN, JJ) and (JJ, DT, NN).
The probability that the tag sequence is (DT, JJ, NN) can be computed as:
$$//P(DT,JJ,NN\backslash The,high,\cos t)=\frac{P(The,high,\cos t\backslash DT,JJ,NN).P(DT,JJ,NN)}{P(The,high,\cos t)}//$$
Similarly, we can compute the probabilities of the other two sequences and assign the sequence that has the maximum probability. We will need to use some simplifying assumptions to compute the right-hand side of the equation. Let’s see how.
Let’s analyse the process of stochastic parsing step by step.
The objective is to find the most probable POS tag sequencefor the phrase: ‘The high cost’.
We have made a simplifying assumption that only three possible tags exist: Determinant (DT), Adjective (JJ) and Noun(NN).
Each word in the sentence can be assigned any one of the three tags DT, JJ or NN. So, there could be 3^3 = 27 possible tag sequences (DT-DT-DT, DT-JJ-NN, DT-JJ-JJ, ….). For simplicity, we’ll consider only these three possible tag sequences:
- (DT, JJ, NN)
- (DT, NN, JJ)
- (JJ, DT, NN)
We need to find the maximum of [P(tag sequence|observation sequence)] among the possible tag sequences. Let’s start with first tag sequence: (DT, JJ, NN)
$$//P(DT,JJ,NN\vert The,high,\cos t)=\frac{P(The,high,\cos t\vert DT,JJ,NN)\ast P(DT,JJ,NN)}{P(The,high,\cos t)}\\\\\\\\\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;//$$
Expanding the terms on the right side by using the chain-rule
P(DT,JJ,NN\the,high,cost)=
$$//\frac{P(the\vert DT,JJ,NN)\ast P(high\vert the,DT,JJ,NN)\ast P(\cos t\vert the,high,DT,JJ,NN)\ast P(DT)\ast P(JJ\vert DT)\ast P(NN\vert DT,JJ)}{P(the,high,\cos t)}\\\\\\\\\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;//$$
We make an important assumption here called the Markov assumption. (You’ll learn about the Markov process formally in the upcoming segments). Briefly, a Markov process transitions from one ‘state’ to another with some probability. In this case, the states are the POS tags.
The Markov assumption states that the probability of a state depends only on the probability of the previous state leading to it. That implies, in a tag sequence (DT, JJ, NN), the probability that a word is tagged as NN depends only on the previous tag JJ and not on DT.
Another simplifying assumption we make is that the probability of a word w being assigned a tag t depends only on the tag t and not on any other tag. Thus, the probability P(the|DT), i.e. the probability that a word is ‘the’ given its tag is DT, depends only on DT and not on any of the other tags NN or JJ.
Simplifying the equation using these assumptions, we get:
$$//P(DT,JJ,NN\backslash The,high,\cos t)=\frac{P(the\vert DT)\ast P(high\vert JJ)\ast P(\cos t\vert NN)\ast P(DT)\ast P(JJ\vert DT)\ast P(NN\vert JJ)}{P(The,high,\cos t}\\\\\\\\\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;//$$
Let’s write similar equations for the other two candidate sequences:
$$//P(DT,JJ,NN\backslash The,high,\cos t)=\frac{P(the\vert DT)\ast P(high\vert NN)\ast P(\cos t\vert JJ)\ast P(DT)\ast P(NN\vert DT)\ast P(JJ\vert NN)}{P(The,high,\cos t)}\\\\\\\\\\\\\\\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;//$$
$$//\;\;\;P(JJ,DT,NN\vert The,high,\cos t)=\frac{P(the\vert JJ)\ast P(high\vert DT)\ast P(\cos t\vert NN)\ast P(JJ)\ast P(DT\vert JJ)\ast P(NN\vert DT)}{P(The,high,\cos t)}\\\\\\\\\\\\\\\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;//$$
The denominator can simply be ignored since it is the same across all candidate sequences.
In the next segment, you’ll see how to optimally arrive at a solution for getting the most probable tag sequence without calculating the probabilities for the 27 tag sequences. For now, let’s solve some problems.