While working with any kind of data, the first step that you usually do is to explore and understand it better. In order to explore text data, you need to do some basic preprocessing steps. In the next few segments, you will learn some basic preprocessing and exploratory steps applicable to almost all types of textual data.
Now, a text is made of characters, words, sentences and paragraphs. The most basic statistical analysis you can do is to look at the word frequency distribution, i.e. visualising the word frequencies of a given text corpus.
It turns out that there is a common pattern you see when you plot word frequencies in a fairly large corpus of text, such as a corpus of news articles, user reviews, Wikipedia articles, etc. In the following lecture, professor Srinath will demonstrate some interesting insights from word frequency distributions. You will also learn what stopwords are and why they are lesser relevant than other words.
To summarise, the Zipf’s law (discovered by the linguist-statistician George Zipf) states that the frequency of a word is inversely proportional to the rank of the word, where rank 1 is given to the most frequent word, 2 to the second most frequent and so on. This is also called the power law distribution.
The Zipf’s law helps us form the basic intuition for stopwords – these are the words having the highest frequencies (or lowest ranks) in the text, and are typically of limited ‘importance’.
Broadly, there are three kinds of words present in any text corpus:
- Highly frequent words, called stop words, such as ‘is’, ‘an’, ‘the’, etc.
- Significant words, which are typically more important to understand the text
- Rarely occurring words, which are again less important than significant words
Generally speaking, stopwords are removed from the text for two reasons:
- They provide no useful information, especially in applications such as spam detector or search engine. Therefore, you’re going to remove stopwords from the spam dataset.
- Since the frequency of words is very high, removing stopwords results in a much smaller data as far as the size of data is concerned. Reduced size results in faster computation on text data. There’s also the advantage of less number of features to deal with if stopwords are removed.
However, there are exceptions when these words should not be removed. In the next module, you’ll learn concepts such as POS (parts of speech) tagging and parsing where stopwords are preserved because they provide meaningful (grammatical) information in those applications. Generally, stopwords are removed unless they prove to be very helpful in your application or analysis.
On the other hand, you’re not going to remove the rarely occurring words because they might provide useful information in spam detection. Also, removing them provides no added efficiency in computation since their frequency is so low.
Now, that you’ve learnt about word frequencies and stopwords, let’s see how to make the frequency chart on your own. Professor Srinath explains how to make frequency distribution from a text corpus and how to remove stopwords in python using the NLTK library.
You can download the Jupyter notebook from the link given below:
Note : At 2:40, professor mistakely told “it is a plot for 20 words” instead of “it is a plot for 15 words”.
You saw how to create word frequency plots and how to remove stop words using the NLTK’s list of stopwords. Practice this skill in the following coding coding exercise.
In the next section, you’ll learn how to break text into smaller ‘terms’ called tokens.