You saw that the lexicon tagger uses a simple statistical tagging algorithm: for each token, it assigns the most frequently assigned POS tag. For example, it will assign the tag “verb” to any occurrence of the word “run” if “run” is used as a verb more often than any other tag.
Rule-based taggers first assign the tag using the lexicon method and then apply predefined rules. Some examples of rules are:
- Change the tag to VBG for words ending with ‘-ing’.
- Changes the tag to VBD for words ending with ‘-ed’.
- Replace VBD with VBN if the previous word is ‘has/have/had’
Defining such rules require some exploratory data analysis and intuition.
In this segment, you’ll learn to implement the lexicon and rule-based tagger on the Treebank corpus of NLTK. Let’s first explore the corpus.
Programming Exercise – Exploratory Analysis for POS Tagging
In the following practice exercise, you will use the Jupyter notebook attached below to answer the following questions. The notebook contains some starter code to read the data and explain its structure. We recommend you to try answering the questions by writing code to conduct some exploratory analysis. The solution to these questions is provided in the TA videos at the bottom of this page.
The following TA video demonstrates the solution of the questions and also provides the intuition for using the lexicon and rule-based approaches.
The following notebook contains the exploratory analysis code.
Now that you have an intuition of how lexicon and rule-based taggers work, let’s build these taggers in NLTK. Since NLTK comes with built-in functions for lexicon and rule-based taggers, called Unigram and Regular Expression taggers respectively in NLTK, we’ll use them to train taggers using the Penn Treebank corpus.
You can refer to this Stack Overflow answer to learn more about the backoff technique. Next, you will study a widely used probabilistic POS tagging model – the Hidden Markov Model (HMM).