The ATIS dataset has five zip files. Each zip file has three datasets: train, test and validation, and a dictionary. In the upcoming lecture, Ashish will walk you through the structure of the dataset.
NOTE:
In the recent version, RandomizedSearchCV is now under sklearn.model_selection
, and not any more under sklearn.grid_search
,
All the three datasets are in form of tuples containing three lists. The first list is the tokenized words of the queries, encoded by integers such as 554, 194, 268 … and so on. For e.g. the first three integers 554, 194, 268 are encoded values of the words ‘what’, ‘flights’, ‘leave’ . Ignore the second list. The third list contains the (encoded) label of each word.
Labels are similar to POS tags, where instead of using noun, verb, etc, we’ll use IOB (inside-outside-beginning) tags of entities like flight-time, source-city, etc. You’ll learn about the IOB tagging in the next segment.
Decoding the list
Let’s decode the lists of words and labels using the dictionaries provided in the ATIS data set. In the upcoming lecture, you’ll see how to decode using the dictionary provided in the dataset.
So, there are three dictionaries in ATIS dataset, out of which two are required ‘words2idx’, (which will convert the first list to words (actual words of queries)), and ‘labels2idx’ (which will convert the third list to labels.)
You also saw some sample queries asking information about the flights. The structure of each query is quite different. So, you’ll learn how to build a machine learning model which could fit some structure to these queries and derive relevant entities from it.
IOB Labels
The next lecture focuses on decoding using the dictionary of labels and IOB (or BIO) labelling. IOB labelling (also called BIO in some texts) is a standard way of labelling named entities.
IOB (or BIO) method tags each token in the sentence with one of the three labels: I – inside (the entity), O- outside (the entity) and B – beginning (of entity). You saw that IOB labeling is especially helpful if the entities contain multiple words. We would want our system to read words like ‘Air India’, ‘New Delhi’, etc, as single entities.
Consider the following example for IOB labeling:
I | booked | a | at | Smoke | House | Deli | for | two | on | Wednesday | at | 8:00 | ||
O | O | B-NP | O | B-restname | I-restname | I- restname | O | B-count | O | B-day | O | B-time |
Any entity with more than 2 words such as ‘Dallas Fort Worth’, ‘Smoke House Deli’, the first word of the entity would be labeled as B-entity and other words in it would be labeled as I-entity, rest would be labeled as O.