In the last section , you saw how to deal with different pronunciations of a particular word . Next, you’ll learn how to deal with misspelling . As already discussed , misspellings need to be corrected in order to
stem or lemmatize efficiently. The problem of misspellings is so common these days, especially in text data from social media, that it makes working with text extremely difficult, if not dealt with.
Now, to handle misspellings, you’ll learn how to make a spell corrector. All the misspelt words will be corrected to the correct spelling. In other words, all the misspelt words will be canonicalised to the base form, which is the correct spelling of that word. But to really understand how a spell corrector works, you’ll need to understand the concept of edit distance.
An edit distance is a distance between two strings which is a non-negative integer number. Professor Srinath explains the concept of edit distance in the following lecture.
As you just learnt, an edit distance is the number of edits that are needed to convert a source string to a target string.
Now, the question that comes to the mind is – what’s an edit? An edit operation can be one of the following:
- Insertion of a letter in the source string. To convert ‘color’ to ‘colour’, you need to insert the letter ‘u’ in the source string.
- Deletion of a letter from the source string. To convert ‘Matt’ to ‘Mat’, you need to delete one of the ‘t’s from the source string.
- Substitution of a letter in the source string. To convert ‘Iran’ to ‘Iraq’, you need to substitute ‘n’ with ‘q’
So, that’s how the Levenshtein edit distance is calculated. Now, attempt the following exercise to practice and strengthen the concept of edit distance.
Since the process of calculating edit distance is fixed, and now that you know it, you can write an algorithm to automate this computation. Professor Srinath explains how to write the algorithm in the following lecture. Download the Jupyter notebook given below to follow along.
So that’s how you compute the edit distance between two given strings. You also saw another variation of the edit distance – the Damerau–Levenshtein distance. The Damerau–Levenshtein distance, apart from allowing the three edit operations, also allows the swap (transposition) operation between two adjacent characters which costs only one edit instead of two.
This edit operation was introduced because swapping is a very common mistake. For example, while typing, people mistype ‘relief’ to ‘releif’. This has to be accounted as a single mistake (one edit distance), not two.
But how to make a spell corrector which was the main objective in the first place? You’ll learn to do that in the next section.