IKH

Basic Lexical Processing

Introduction

Welcome to the second session of the first module. In the

 last session, you learnt regular expressions and their use cases. That was a very essential skill to learn before applying your hands on any kind of text processing.

In this session you will learn basic lexical processing. You will get to know the various preprocessing steps you need to apply before you can do any kind of text analytics such as apply machine learning on text, building language models, building chatbots, building sentiment analysis systems and so on. These steps are used in almost all applications that work with textual data. We will also build a spam-ham detector system side-by-side on a very unclean corpus of text. Corpus is just a name to refer to textual data in NLP jargon.

Now, you have already built a spam detector while learning about the naive-bayes classifier. Here, you will learn all the preprocessing steps that one needs to do before using a machine learning algorithm on the spam messages dataset. Note that, the preprocessing steps that we teach you here are not limited to building a spam detector. 

Specifically, you will learn:

  • How to preprocess text using techniques such as
    • Tokenisation
    • Stop words removal
    • Stemming
    • Lemmatization
  • How to build a spam detector using one of the following models:
    • Bag-of-words model
    • TF-IDF model

Prerequisites

There are no prerequisites for this session, other than, knowledge of the previous session and the previous module.

Guidelines for in-module questions

The in-video and in-content questions for this module are not graded. Note that graded questions are given on a separate page labeled ‘Graded Questions’ at the end of this session. The graded questions in this session will adhere to the following guidelines:

First Attempt MarksSecond Attempt Marks
Question with 2 Attempts105
Question with 1 Attempt100

Report an error