After creating the vocabulary, the next step is to create the matrix from the features(the bag of- words model) and then train a machine learning algorithm on it . the algorithm that we’re going to use is the naive bayes classifier.
Let’s hear it from Krishna as he explains the remaining steps to build the classifier.
We’ve got an excellent accuracy of 98% on the test set. Although this is an excellent accuracy, you could further improve it by trying other models.
Note that, Krishna has created a bag-of-words representation that’s created from scratch without using the CountVectorizer() function. He has used a binary representation instead of using the number of features to represent each word. In this bag-of-words table, ‘1’ means the word is present whereas ‘0’ means the absence of that word in that document. You can do this by setting the ‘binary’ parameter to ‘True’ in the CountVectorizer() function.
You also saw that Krishna used the pickle library to save the model. After creating models, they are saved using the pickle library on the disk. This way, you can even send the models to be used on a different computer or platform.
In the next video, Krishna explains what could we have done differently, in order to improve our detector even further.
The steps that you just saw should convince you that to get excellent results, you need to take extra care of the nuances of the dataset you’re working on. You need to understand the data inside-out to take these steps because these can’t be generalised to every text classifier or even other spam datasets.
This brings us to the end of the second session. In the next section, you’ll summarise all the concepts of this session.