A spell corrector is a widely used application that you would see almost everywhere on the internet.
If you have the autocorrect feature enabled on your phone, the incorrect spellings get replaced by the correct ones. Another example is when you use a search engine such as Google to search anything and mistype a word, it suggests the correct word.
Spell correction is an important part of lexical processing. In many applications, spell correction forms an initial preprocessing layer. For example, if you are making a chatbot to book flights, and you get the user request ‘Book a flight from Mumbai to Bangalor’, you want to gracefully handle that spelling error and return relevant results.
Now, people have made various attempts to make spell correctors using different techniques. Some are very basic and elementary which use lexical processing, while others are state-of-the-art performers which use deep learning architectures.
Here, you’re going to learn the Norvig’s spell corrector which gives you really good performance and result, given its simplicity. Watch the following video where Krishna explains how to build a Norvig spell corrector.
Now, let’s look at what each function does. The function words() is pretty straightforward. It tokenises any document that’s passed to it. You have already learnt how to tokenise words using NLTK library. You could also use regular expressions to tokenise words. The ‘Counter’ class, which you just saw in the Jupyter notebook, creates a frequency distribution of the words present in the seed document. Each word is stored along with its count in a Python dictionary format. You could also use the NLTK’s FreqDist() function to achieve the same results. It’s just that there are more than one way to do things in Python. And it’s always nice to know more than one way to perform the same task.
Now, the seed document ‘big.txt’ is nothing but a book. It’s the book ‘The Adventures of Sherlock Holmes’ present in text format at project Gutenberg’s website. A seed document acts like a lookup dictionary for a spell corrector since it contains the correct spellings of each word.
Now, you might ask why not just use a dictionary instead of a book? You’ll get to know why we’re using a book a little later. Now, in the next part, Krishna demonstrates the use of edit distance in the spell corrector.
You just saw two functions. The edits_one() function and the edits_two() function. The edits_one() function creates all the possible words that are one edit distance away from the input word. Most of the words that this function creates are garbage, i.e. they are not valid English words. For example, if you pass the word ‘laern’ (misspelling of the word ‘learn’) to edits_one(), it will create a list where the word ‘lgern’ will be present since it is an edit away from the word ‘laern’. But it’s not an English word. Only a subset of the words will be actual English words.
Similarly, the edits_two() function creates a list of all the possible words that are two edits away from the input word. Most of these words will also be garbage.
In the video present below, you will see how to filter out the valid English words from these lists by creating a separate function called ‘known()’ to do so.
The known() function filters out the valid English word from a list of given words. It uses the frequency distribution as a dictionary that was created using the seed document. If the words created using edits_one() and edits_two() are not in the dictionary, they’re discarded.
Now, the function possible_corrections() returns a list of all the potential words that can be the correct alternative spelling. For example, let’s say the user has typed the word ‘wut’ which is wrong. There are multiple words that could be the correct spelling of this word such as ‘cut’, ‘but’, ‘gut’, etc. This functions will return all these words for the given incorrect word ‘wut’. But, how does this function find all these word suggestions exactly? It works as follows:
- It first checks if the word is correct or not, i.e. if the word typed by the user is a present in the dictionary or not. If the word is present, it returns no spelling suggestions since it is already a correct dictionary word.
- If the user types a word which is not a dictionary word, then it creates a list of all the known words that are one edit distance away. If there are no valid words in the list created by edits_one() only then this function fetches a list of all known words that are two edits away from the input word
- If there are no known words that are two edits away, then the function returns the original input word. This means that there are no alternatives that the spell corrector could find. Hence, it simply returns the original word.
Finally, there is the prob() function. The function returns the probability of an input word. This is exactly why you need a seed document instead of a dictionary. A dictionary only contains a list of all correct English words. But, a seed document not only contains all the correct words but it could also be used to create a frequency distribution of all these words. This frequency will be taken into consideration when there are more than one possibly correct words for a given misspelled word. Let’s say the user has input the word ‘wut’. The correct alternative to this word could be one of these words – ‘cut’, ‘but’ and ‘gut’, etc. The possible_corrections() function will return all these words. But the prob() function will create a probability associated with each of these suggestions and return the one with highest probability. Suppose, if a word ‘but’ is present 2000 times out of a total of million words in the seed document, then it’s probability would be 2000/1000000, i.e. 0.002.