IKH

Information Extraction

Natural language is highly unstructured and complex, making it difficult for any system to process it. Information Extraction (IE) is the task of retrieving structured information from unstructured text data. IE is used in many applications such as conversational chatbots, extracting information from encyclopedias (such as Wikipedia), etc. Let’s look at some more applications of IE.

Information Extraction is used in a wide variety of NLP applications, such as extracting structured summaries from large corpora such as Wikipedia, conversational agents (chatbots), etc. In fact, modern virtual assistants such as Apple’s Siri, Amazon’s Alexa, Google Assistant etc. use sophisticated IE systems to extract information from large encyclopedias.

However, no matter how complex the IE task, there are some common steps (or subtasks) which form the pipeline of almost all IE systems.

Next, you’ll study the major steps in an IE pipeline and build them one by one. Most IE pipelines start with the usual text preprocessing steps – sentence segmentation, word tokenisation and POS tagging. After preprocessing, the common tasks are Named Entity Recognition (NER), and optionally relation recognition and record linkage. A generic IE pipeline looks something like this:

NER is arguably the most important and non-trivial task in the pipeline. Next, Ashish will discuss all elements of the pipeline:

To summarise, a generic IE pipeline is as follows:

Preprocessing

  • Sentence Tokenization: sequence segmentation of text.
  • Word Tokenization: breaks down sentences into tokens.
  • POS tagging – assigning POS tags to the tokens. The POS tags can be helpful in defining what words could form an entity.

Entity Recognition

  • Rule-based models
  • Probabilistic models

In entity recognition, every token is tagged with an IOB label and then nearby tokens are combined together basis their labels.

  • Relation Recognition is the task of identifying relationships between the named entities. Using entity recognition, we can identify places (pl), organisations (o), persons (p). Relation recognition will find the relation between (pl,o), such that o is located in pl. Or between (o,p), such that p is working in o, etc.
  • Record Linkage refers to the task of linking two or more records that belong to the same entity. For example, Bangalore and Bengaluru refer to the same entity.

Next, we will implement the first few preprocessing steps on the airlines’ dataset.

Additional Reading

  • Relation Recognition: You can read further on this topic from here. (Refer to the 6th segment).
  • Record Linkage: Refer to the record linkage toolkit in Python for further reading.

Report an error