In this segment, we’ll use the following two probabilistic models to get the most probable IOB tags for words. Recall that you have studied the unigram and bigram models for POS tagging earlier:
- Unigram chunker computes the unigram probabilities P(IOB label | pos) for each word and assigns the label that is most likely for the POS tag.
- Bigram chunker works similar to a unigram chunker, the only difference being that now the probability of a POS tag having an IOB label is computed using the current and the previous POS tags, i.e. P(label | pos, prev_pos).
Let’s study both these chunkers in detail.
Note
We have used Python classes in the code. If you are not familiar with classes and object-oriented-programming in general, we highly recommend going studying them briefly. The additional resources provided below will help you study basics of classes and Object-Oriented Programming in Python. Besides, you should be able to rewrite the code using functions only.
Another way to identify named entities (like cities and states) is to look up a dictionary or a gazetteer. A gazetteer is a geographical directory which stores data regarding the names of geographical entities (cities, states, countries) and some other features related to the geographies. An example gazetteer file for the US is given below.
Data download URL: https://raw.githubusercontent.com/grammakov/USA-cities-and-states/master/us_cities_states_counties.csv
In the next section, you’ll learn to use this lookup function on Gazetteer as a feature to predict IOB labels.
Additional Resources
- Corey Schafer’s tutorials on classes and OOPs in Python (highly recommended for beginners in classes and OOPs)
- Official Python documentation of classes (recommended only after going through a gentler introduction such as the one above)