In the previous segment, you learnt how to calculate the probability of a tag sequence given a sequence of words. The idea is to compute the probability of all possible tag sequences and assign the sequence having the maximum probability.
Although this approach can work in principle, it is computationally very expensive. For e.g. if you have just three POS tags – DT, JJ, NN, and you want to tag the sentence “The high cost”, there are 33=27 possible tag sequences (each word can have three possible tags).
In general, for a sequence of n words and t tags, a total of tn tag sequences are possible. The Penn Treebank dataset in NLTK itself has 36 POS tags, so for a sentence of length say 10, there are 3610 possible tag sequences (that’s about three thousand trillion!).
Clearly, computing trillions of probabilities to tag a 10-word sentence is impractical. Thus, we need to find a much more efficient approach to tagging.
Prof. Srinath will explain one such approach called the Viterbi heuristic, also commonly known as the Viterbi algorithm.
To summarise, the basic idea of the Viterbi algorithm is as follows – given a list of observations (words) O1,O2….On to be tagged, rather than computing the probabilities of all possible tag sequences, you assign tags sequentially, i.e. assign the most likely tag to each word using the previous tag.
More formally, you assign the tag Tj to each word Oi such that it maximises the likelihood:
$$//P(T_J\backslash O_I)=P(O_I\backslash T_J)\;\ast\;P(T_J\backslash T_{J-1}),//$$
where Tj−1 is the tag assigned to the previous word. Recall that according to the Markov assumption, the probability of a tag Tj is assumed to be dependent only on the previous tag Tj−1, and hence the term P(Tj|Tj−1).
In other words, you assign tags sequentially such that the tag assigned to every word Oi maximises the likelihood P(Tj|Oi) locally, i.e. it is assumed to be dependent only on the current word and the previous tag. This algorithm does not guarantee that the resulting tag sequence will be the most likely sequence, but by making these simplifying assumptions, it reduces the computational time manifold (you’ll see how much in the next lecture).
This is also why it is called a greedy algorithm – it assigns the tag that is most likely at every word, rather than looking for the overall most likely sequence. In the following video, Prof. Srinath will compare the computational costs of the two tagging algorithms – the brute force algorithm and the Viterbi algorithm.
Note
The professor had mistakenly derived final result as O(ns). Basically it is wrong. So please ignore just below video.
Correct explanation :
The Viterbi algorithm finds the most likely sequence of hidden states, called the ‘Viterbi path’, conditioned on a sequence of observations in an HMM.
If there is a sequence of n words and we need to assign one of T possible tags, then the order would be O(nT2).
For each word, as per Viterbi algorithm, we need to first compute the probability; the order is T for that. And then the next step is to compute the maximum among the probabilities for that word, and the order for this step is also T.
This has to be done for n words in the sequence.
$$//T^2+T^2+T^2\;\dots n\;times//$$
So, the time order complexity is
$$//O(nT^2).//$$
Now that you are familiar with the basic Viterbi algorithm, you will study Markov processes and Hidden Markov Models more formally in the next segment. In a later segment, you’ll also learn to write a POS tagging algorithm using the Viterbi heuristic.