IKH

Probabilistic Models for Entity Recognition

In this segment, we’ll use the following two probabilistic models to get the most probable IOB tags for words. Recall that you have studied the unigram and bigram models for POS tagging earlier:

  • Unigram chunker computes the unigram probabilities P(IOB label | pos) for each word and assigns the label that is most likely for the POS tag.
  • Bigram chunker works similar to a unigram chunker, the only difference being that now the probability of a POS tag having an IOB label is computed using the current and the previous POS tags, i.e. P(label | pos, prev_pos).

Let’s study both these chunkers in detail.

Note

We have used Python classes in the code. If you are not familiar with classes and object-oriented-programming in general, we highly recommend going studying them briefly. The additional resources provided below will help you study basics of classes and Object-Oriented Programming in Python. Besides, you should be able to rewrite the code using functions only.

Another way to identify named entities (like cities and states) is to look up a dictionary or a gazetteer. A gazetteer is a geographical directory which stores data regarding the names of geographical entities (cities, states, countries) and some other features related to the geographies. An example gazetteer file for the US is given below.

Data download URL: https://raw.githubusercontent.com/grammakov/USA-cities-and-states/master/us_cities_states_counties.csv

In the next section, you’ll learn to use this lookup function on Gazetteer as a feature to predict IOB labels.

Additional Resources

Report an error