Even though textual data is widely available, the complexity of natural language makes it extremely difficult to extract useful information from text. In this session, you’ll learn to build an Information Extraction (IE) system that can extract structured data from unstructured textual data. A key component in information extraction systems is Named-Entity-Recognition (NER). You’ll learn various techniques and models for building NER systems in this session.
Let’s say you are making a conversational flight-booking system, which can show relevant flights when given a natural-language query such as “Please show me all morning flights from Bangalore to Mumbai on next Monday.” For the system to be able to process this query, it has to extract useful named entities from the unstructured text query and convert them to a structured format, such as the following dictionary/JSON object:
Using these entities, you could query a database and get all relevant flight results. In general, named entities refer to names of people, organizations (Google), places (India, Mumbai), specific dates and time (Monday, 8 pm) etc.
In this session,you will learn to build such systems which can extract structured information from unstructured text data. In the process, you will learn the concepts mentioned below.
In this session
This session will introduce you to the following topics:
Named-Entity Recognition.
- I-O-B labels
Building models for Entity Recognition.
- Rule-based techniques.
Regular expression-based techniques.
Chunking
- Probabilistic models.
Unigram & Bigram models.
Naive Bayes Classifier.
Decision trees
Conditional Random Fields (CRFs) -Optional
People you will hear from in this session:
Software Engineer, Google
We’ll use the ATIS (Airline Travel Information System) dataset to build an IE system. The ATIS dataset consists of English language queries for booking (or requesting information about) flights in the US. In the following lecture, Ashish will provide an overview of this session and the ATIS dataset.
You can download the ATIS dataset from here:
We’ll use the following Jupyter notebook for the entire session. Also, it is recommended that you run the code along with the videos. Go through each code chunk and understand the logic behind it.
NOTE: In the recent version, RandomizedSearchCV is now under sklearn.model_selection
, and not any more under sklearn.grid_search
.
In the next few segments, you will understand the structure of the dataset and do some basic
preprocessing steps such as POS tagging etc.