IKH

Summary

In this session, you learnt a lot of essential preprocessing steps that you would need to apply when you’re working with a corpus of text. First, you learnt about word frequencies. You learnt how to plot word frequencies on a given piece of corpus. Then you learnt about stop words which are words that add no information in applications such as the spam detector. You also learnt how to remove English stopwords using the NLTK’s list of stopwords.

Next, you went through the the process of tokenising a document. You learnt that a document can be tokenised based on word, sentences, paragraphs, or even using your own custom regular expression.

Then you learnt about the importance of removing redundant words from the corpus by using two techniques – stemming and lemmatization. You learnt that stemming converts a word to its root from by chopping off its suffix. While lemmatization reduces a word to its base form, called the lemma, by going through the WordNet library. You also learnt the advantages and disadvantages of each technique.

Then you created a model from the text corpus that could be used to train a classifier, called the bag-of-words model. On similar lines, you also learnt about the more advanced tf-idf model, which is a more robust representation of the text than the bag-of-words model.

Finally, you went through the process of creating the spam detector using the all the preprocessing steps that you just learnt before it. You used at a different library than the scikit-learn library to build the spam classifier – the NLTK library.


This brings us to the end of session two. In the next section, you’ll test your newly found text preprocessing skills by attempting the graded questions.

Report an error