Introduction
In the previous session, you had learnt all the basic lexical processing such as removing stop word, tokenisation , stemming and lemmatization followed by creating bag -of the word and tf -idf models and finally building a spam detector . these preprocessing steps are applicable in almost every text analytics application.
Even after going through all those preprocessing steps that you learnt in the previous session, a lot of noise is still present in the data . For example , spelling mistake which happen by choice( informal words such as
well as by choice (informal words such as ‘lol’, ‘awsum’ etc.). To handle such situations, you’ll learn how to identify and process incorrectly spelt words. Also, you’ll learn how to deal with spelling variations of a word that occur due to different pronunciations (e.g. Bangalore, Bengaluru).
At the end of the session, you’ll also learn how to tokenise text efficiently. You’ve already learnt how to tokenise words, but one problem with the simple tokenisation approach is that it can’t detect terms that are made up of more than one word. Terms such as ‘Hong Kong’, ‘Calvin Klein’, ‘International Institute of Information Technology’, etc. are made of more than one word, whereas they represent the same ‘token’. There is no reason why we should have ‘Hong’ and ‘Kong’ as separate tokens. You’ll study techniques for building such intelligent tokenizers.
In this session, you’ll learn:
- Phonetic hashing and the Soundex algorithm to handle different pronunciations of a word
- The minimum-edit-distance algorithm and building a spell corrector
- Pointwise mutual information (PMI) score to preserve terms that comprise of more than one word
Prerequisites
There are no prerequisites for this session, other than, knowledge of the previous session and the previous module.
Guidelines for in-module questions
The in-video and in-content questions for this module are not graded. Note that graded questions are given on a separate page labelled ‘Graded Questions’ at the end of this session. The graded questions in this session will adhere to the following guidelines:
First Attempt Marks | Second Attempt Marks | |
Question with 2 Attempts | 10 | 5 |
Question with 1 Attempt | 10 | 0 |