The bag of words representation, while effective, is a very naive way of representing text. It relies on just the word frequencies of the words of a document. But don’t you think word representation shouldn’t solely rely on the word frequency? There is another way to represent documents in a matrix format which represents a word in a smarter way. It’s called the TF-IDF representation and it is the one that is often preferred by most data scientists.
The term TF stands for term frequency, and the term IDF stands for inverse document frequency. How is this different from bag-of-words representation? Professor Srinath explains the concept of TF-IDF below.
The TF-IDF representation, also called the TF-IDF model, takes into the account the importance of each word. In the bag-of-words model, each word is assumed to be equally important, which is of course not correct.
The formula to calculate TF-IDF weight of a term in a document is:
$$//t\mbox{\large$f$}\nolimits_{t,}d=\frac{frequency\;of\;term’t’\;in\;document\;’d’}{total\;terms\;in\;document’d’}//$$
The log in the above formula is with base 10. Now, the tf-idf score for any term in a document is just the product of these two terms:
Higher weights are assigned to terms that are present frequently in a document and which are rare among all documents. On the other hand, a low score is assigned to terms which are common across all documents.
Now, attempt the following quiz. Questions 1-3 are based on the following set of documents:
Document1: “Vapour, Bangalore has a really great terrace seating and an awesome view of the Bangalore skyline”
Document2: “The beer at Vapour, Bangalore was amazing. My favourites are the wheat beer and the ale beer.”
Document3: “Vapour, Bangalore has the best view in Bangalore.”
Note that tf-idf is implemented in different ways in different languages and packages. In the tf score representation, some people use only the frequency of the term, i.e. they don’t divide the frequency of the term with the total number of terms. In the idf score representation, some people use natural log instead of the log with base 10. Due to this, you may see a different score of the same terms in the same set of documents. But the goal remains the same – assign a weight according to the word’s importance.
Now, let’s see how tf-idf is implemented in Python. Download the Jupyter notebook to follow along:
Now, attempt the following coding exercise to strengthen your skills.
In the next section, you’ll train a machine learning algorithm on the spam dataset to create a spam detector using NLTK library.