IKH

HMM & the Viterbi Algorithm: Python lementation

You will now learn to build an HMM POS tagger using the Penn Treebank dataset as the training corpus.

In the following few exercises, you will learn to do the following in python:

  • Conduct exploratory analysis on a tagged corpus.
  • Sample the data into 70:30 train-test sets.
  • Train an HMM model using the tagged corpus:
  • Calculating the emission probabilities 

$$//P(w_i\backslash\;t_j).//$$

  • Calculating the transition probabilities

$$//P(t_i\backslash\;t_{i-1}).//$$

  • Write the Viterbi algorithm to POS tag sequence of words (sentence).
  • Evaluate the model predictions against the ground truth.

Prof. Srinath will explain the Python implementation step-by-step. Please download the following Jupyter notebook. We recommend that you run the code along the video and experiment with it.

So, we explored the Penn Treebank data set. The tagged sentences are in the form of a list of tuples, where the first element of the tuple is a word and the second element is the POS tag of the word. We have also sampled the data into train and test sets.

In the next lecture, the professor will explain the code for computing the emission and transition probabilities.

Correction: At 00:50, the professor mentions that ‘T represents the terms and V represents the tags, though he meant T represents the tags while V represents the terms/vocabulary’. Also, at 05:18, the tag ‘PRP’ refers to a pronoun, not a preposition.

You saw how to compute the emission, transitionand the initial state probabilities (P(tag | start)). Note that a sentence can end with either of the three terms ‘.’, ‘?’ or ‘!’. They are all called sentence terminators and are tagged as ‘.’. Thus, P(tag|start) is equivalent to P(tag| ‘.’).