You have learnt the basic idea behind the two main problems in building an HMM for POS tagging – **the learning problem** (learning the probabilities) and **the explanation problem **(solved using the Viterbi algorithm). In the next segment, you will learn to build an HMM using the Penn Treebank as the training corpus.

Before that, let’s take a while to reiterate the important concepts and understand the pseudocode of this program.

The Penn Treebank is a manual of around a million words taken from 1989 Wall Street Journal’s articles. This manual is available in NLTK toolkit of Python. You can explore this data by importing NLTK and then run the following code:

The ‘wsj’ object is a list of tuples. Each tuple is in the form of (word, POS tag). You have already performed EDA on the Treebank data while building the lexicon and rule-based taggers.

**Learning the HMM Model Parameters**

Let’s start with training the HMM to learn the probabilities. For this, you need to calculate the **transition probability** (from one state to another), the **emission probability** (of a state emitting a word), and the **initial state probabilities** (of a state appearing at the start of a sentence).

**Transition Probability**

Transition probability is the probability of moving from one state to another. You learnt how to calculate it in the last segment. Let’s write the pseudocode to calculate the transition probabilities.

Thus, we compute the transition probabilities in a loop to get the probabilities for all possible POS tag pairs

$$//P(t_j\backslash\;t_i).//$$

**Emission Probability**

Next, we calculate the **emission probabilities**, i.e. the probability that a tag ‘t’ will ‘emit’ a word w.

We can compute the emission probabilities for all possible word-tag pairs in the corpus. Also, we can infer the start probabilities, i.e. P(t|start) for all possible tags t by computing the number of times the tag t appears at the start of a sentence.

**The Explanation/Decoding Problem: Viterbi Algorithm**

After learning the model parameters, we need to find the best possible state (tag) sequence for each given sentence. We’ll use the Viterbi algorithm – for every word w in the sentence, a tag t is assigned to w such that it maximises the likelihood of the occurrence of P(tag|word).

P(tag|word) = P(word|tag) * P(tag|previous tag)

` = Emission probability * Transition probability`

In other words, we assign the tag t to the word w which has the max P(tag|word).

We’ll keep on storing the assigned tags and words as a list of tuples. As we move to the next word in the list, each tag to be assigned will use the tag of the previous word. You may find **the trellis **quite useful to visualise the algorithm.

For the start word of a sentence, which is a ‘boundary case’, the sentence terminator of the previous tag can be used to calculate the initial state probabilities.

This forms the basis of developing a POS tagger using the Viterbi heuristic. In the next segment, we’ll see how to implement this pseudocode in Python.