Let’s now study the python implementation of CRF on the ATIF data. We will build a CRF model by defining some features Also, we will create some new features compared to what we had used previously, such as the suffix and prefix of the word(i.e. the first and the last few letters in the word; this is just demonstrate that one can create such features based on word morphologies).
Let’s now understand what the CRF classifier has learnt. First, we will save the trained CRF model in a pickle file, and then use it to make predictions on the test set.
This brings us to the end of CRFs and the session on Information Extraction.
Building a Flight Booking Application-An Exercise
The code at the bottom part of the notebook(under the heading ‘Building an Application’) contains some extra code that is not covered in the lectures. That section is totally optional, thought we recommend you to go through the code and experiment with it (only if you have extra time).
The idea of ‘building an application’ is that once you are able to parse the queries and extract entities from it, you can use the entities to query a database or an external API which can take the entities extracted by your model as input and return a list of flights (i.e. schedules) which you can suggest to the user. Such a system can be used to build, say, a flight-booking chatbot.
In this case, we have used the flightstats API to get flight schedules.In the code, you will observe that this task involves a number of non-trivial data processing/cleaning steps, such as:
- Resolving inconsistencies in data format:
The API needs the date of flight in a certain format, such as ‘dd/mm/yyyy’, so you need to convert entities such as ‘the thirtieth of August’ to ’30/08/2018′ etc. (this is a fairly non-trivial problem).
There are multiple airports in some cities (New York has three), and the API call returns a list of flights to/from all airports. You may want to filter the list to contain only the main airports before showing results to the user. Also, the API needs airport codes such as ‘JFK’ for the New York airport, so you need to map the city name to the corresponding airport code.
- Preprocessing:
User queries will be in the form of strings, not a list of tuples like we have in the training set. You need to word-tokenize the query, assign POS tags, and convert it to a suitable format to feed to the model.
This exercise is intended to give you some exposure to the tasks that one needs to perform to build an ML-based product, though in many cases these are done by engineering teams, not data scientists.