The first layer of a converstional system,Natural Language Understanding(NLU), , interprets the free text provided by the user. It basically takes an unstructured text phrase or sentence, understands what the user probably intends to say, extracts entities from the text phrase or sentence, and converts it into structured data.
Consider the following example where a user wants to know the weather on a given day:
Unstructured user query: What ‘s the weather like in Bangalore today.
Intent: Weather search; Entities: [location=Bangalore, date=today]
This data can then be stored, for example, as a dictionary:
{"intent":"weather_search",
"entity":
{"entity_value":"Bangalore",
"entity_type":"location"},
{"entity_value":"today",
"entity_type":"date"}
}
In the next video, Aiana will explain the concept of intents and entities in more detail.
To summarise, Rasa NLU and Rasa Core are two open-source libraries for building conversational agents. Rasa NLU is the tool used for intent classification and entity extraction.
You can read more about Rasa NLU from its official page.
Note that Rasa NLU and Rasa Core are two independent layers (APIs) – you can use any one or both of them in a project, although in this project, you will use both.
In the next video, Aiana will explain the process of generating training data for Rasa NLU. Before that, please download the zipped folder provided below; it contains all the data sets and starter codes used in this session. We recommend that you go through the files in the folder and run the code along with this session (for which you need to complete the installation).
In general, the training data for Rasa NLU is structured into three parts, namely:
- (Training) Examples
- Synonyms
- Regex features
If you look at the data folder (provided in the Rasa_basic_folder), you will see that the nlu.md file (inside the data folder) is divided into three components, namely, regex_features, entity_synonyms and common_examples.
The ‘common_examples’ component is the most important since it contains all your training examples. A higher number of and variance in training examples will improve the performance of your NLU layer (i.e., intent and entity recognition).
Each common example further has three components, namely, text, intent and entities:
- Text is the search query: It is an example of what would be submitted for parsing. [required]
- Intent is the intent that should be associated with the text[optional]
- Entities are specific part of the text that need to be identified.[optional]
## intent:restaurant_search
- I am looking for some restaurants in [Delhi](location).
Here, Delhi is the entity, specified in square brackets, and its entity type is location, which is specified in round brackets.
You can also create regular expression features for both intent and entity extraction. For example, the following piece of code specifies the regexes for extracting the zip code and to greet:
## regex:greet
- hey[^\s]*
## regex:zipcode
- [0-9]{5}
These regexes are used as feature functions of the Conditional Random Fields (CRFs). CRFs are also a sequence modelling algorithm like HMMs (Hidden Markov models) for which it uses feature functions.
The purpose of a feature function is to express some characteristics of the sequence that the data point represents. You can read in detail about CRFs from the optional content provided here. It is not necessary for you to go through CRFs in detail; in the ner_crf component provided by Rasa NLU, you’ll study NLU components shortly.
Also, note that in the nlu.md file, ‘zipcode’ and ‘greet’ are not entities; they are simply human-readable names that we have chosen for our convenience. In Rasa NLU, they will simply be used as feature functions of the CRF, and, hopefully, be useful for extracting some entities.
Rasa also supports map synonyms or misspellings to an entity.
For example: Bangalore <-> Bengaluru
Lookup tables can also be specified in the training data as external files or list of elements.
These lookup tables are designed to contain all of the known values that you’d expect your entities to take.
Lookup elements may be directly included as a list:
## lookup:plates
- beans
- rice
- tacos
- cheese
For additional reading on entity extraction, you can read the blog.
The training data can be in JSON format as well, but the markdown format is appreciated as it is more human-readable than JSON. You can read more about the markdown format here.
Read the documentation on NLU training and data format, and answer the following questions: