You’ve learnt all the basic preprocessing steps required for most text analytics. in this section , you will learn how to apply these steps to build a spam detector.
Until now ,you had learnt how to use the scikit- learn library to train machine learning algorithms. Here , Krishna will demonstrate how to build a spam detector
using NLTK library which is, as you might have already realised, is your go-to tool when you’re working with text.
Now, it is not necessary for you to learn how to use NLTK’s machine learning functions. But it’s always nice to have knowledge of more than one tool. More importantly, he’ll demonstrate how to extract features from the raw text without using the scikit-learn package. So take this demonstration as a bonus as you’ll learn how to preprocess text and build a classifier using NLTK. Before getting started, download the Jupyter notebook provided below to follow along:
The code till now is simple. You just get the messages and preprocess them using the preprocess function that you’ve already seen. Note that Krishna had eliminated all the words which are less than or equal to two characters long.
Words less than a certain threshold are removed to eliminate special characters such as double exclamation marks, or double dots (the period character). And you won’t lose any information by doing this because there are no words less than two characters other than some stopwords (such as ‘am’, ‘is’, etc.).
You’ve already learnt how to create a bag-of-words model by using the NLTK’s CountVectorizer function. However, Krishna will demonstrate how to build a bag-of-words model without using the NLTK function, that is, building the model manually. The first step towards achieving that goal is to create a vocabulary from the text corpus that you have. In the following video, you’re going to learn how to create vocabulary from the dataset.
You learnt how to create vocabulary manually using all the words in the text corpus. In the next section, you’ll look at how to create a bag-of-words model.