IKH

POS Tagging

Since the ATIS dataset is available in the form of individual tokens, the initial preprocessing steps (-tokenisation etc.) are not required. So, we move to POS tagging as the first preprocessing task. POS tagging can give a good intuition of what words could form an entity.

The main objective of this session is to learn to accurately assign IOB labels to the tokens. It is similar to POS tagging in that it is a sequence labelling task, where instead of parts-of-speech tags, we want to assign IOB labels to words.

Named Entity Recognition task identifies ‘entities’ in the text. Entities could refer to names of people, organizations (e.g. Air India, United Airlines), places/cities (Mumbai, Chicago), dates and time points (May, Wednesday, morning flight), numbers of specific types (e.g. money – 5000 INR) etc. POS tagging in itself won’t be able to identify such word entities. Therefore, IOB labelling is required. So, NER task is to predict IOB labels of each word.

Ashish will now list down and explain the different approaches one can use for sequence labelling tasks:

To summarise, NER is a sequence labelling task where the labels are the IOB labels. There are different approaches using which we can predict the IOB labels:

Rule-based techniques:

  • Regular Expression-Based Rules.
  • Chunking

Probabilistic models

  • Unigram and Bigram models.
  • Naive Bayes Classifier, Decision Trees, SVMs etc.
  • Conditional Random Fields (CRFs).

We’ll go through each of these approaches in the subsequent sections. Also, you will learn about chunking and CRFs in detail. Before that, let’s start with POS tagging.

[Correction- At 7:00: in the lecture, Ashish incorrectly mentions that the POS tag ‘PRP’ is a preposition (for the word ‘us’), though it is actually a pronoun.]

You saw that the NLTK POS tagger is not accurate – any word after ‘to’ gets tagged as a verb. For e.g. in all the queries of the form ‘.. from city_1 to city_2’, city_2 is getting tagged as a verb. Think about why this might be happening.

To correct the POS tags manually, you can use the backoff option in the nltk.tag() method.  The backoff option allows you to chain multiple taggers together. If one tagger doesn’t know how to tag a word, it can back off to another one.

It is difficult to get 100% accuracy in POS tagging. Therefore, in this exercise, we’ll stick to the NLTK POS tagger and use it for predicting the IOB labels. Also, while building classifiers for NER in the next sections, you’ll see that POS tags form just one ‘feature’ for prediction, we use other features as well (such as morphology of words, the words themselves, other derived features etc.). We’ll see these features in later segments.

Now, for each word, we have the POS tag. The final dataset is in a 3-tuple form (word, POS tag, IOB label). But NLTK doesn’t process the data in form of tuples. So, these tuples are converted to trees using the method conlltags2tree() in NLTK.

You can read more about this function from the following link: https://stackoverflow.com/questions/40879520/nltk-convert-a-chunked-tree-into-a-list-iob-tagging

Report an error