Preprocessing the Data

Now that we’re ready with the raw dataset by combining the news and price index, the next step is to preprocess these variables. We’ll preprocess the following variables:

Price
News

Let’s see how it’s done in the following lecture.

Both the variables have been preprocessed. For text, it is generally preferred to expand the contractions and replace acronyms as it avoids duplications. Suppose you have the terms ‘didn’t’ and ‘did not’, these two are the same. But if you don’t expand the contractions, you’ll end up having different separate tokens for these terms. On the other hand, normalising the target variable is also a good practice.

Next, let’s load the word embedding. You have already been introduced to word embeddings and their usage earlier in this module. For this problem, we’re going to use the GloVe embeddings. You can learn more about GloVe (Global Vectors) here. The code required to download the embeddings is present in the Jupyter notebook that was provided at the start of the start of the session.

Let’s revisit and see how does a word embedding represent a word.

Having gotten a brief idea of word representation with the help of embeddings, let’s load the GloVe embedding that used in this project. Here, we are using pre-trained GloVe word embedding. We may not need to use stemming/lemmatisation as a text preprocessing step because, for most of the common words, vector representation for a word variations may already be present in the embedding space. For example, vector representation for variations for the word ‘learn’ such as ‘learns’. such as ‘learns’, ‘learning’ etc. is already present. So, it makes more sense to use words in its actual context if we can use it.

You learnt how to load the GloVe embedding. You also discarded the rarely occurring words using a threshold value of 10. This threshold is a hyperparameter and needs to be decided carefully. You can play around with its value and see the how the model performs. Now, we have a dictionary: vocab_to_int{}, which contains 4 types of words:

Words which are present in the document and also present in the pre-trained GloVe model.
Words which are not present in the pre-trained GloVe model, but have a frequency more than 1o in the document.
Token words <UNK> for the words which are not present in the pre-trained GloVe model, and also have a frequency less than 1o in the document.
Token words <PAD> for padding the sentence.

As we know, there were some words that were discarded because they were below a certain threshold and were not present in the GloVe embedding. Discarding a word doesn’t mean that we’re going to remove those words from the headlines. Discarding means we’re not going to consider that word as a part of the vocabulary. We’ll represent that word with the ‘<UNK>’ token rather than having a separate token for it. By doing this, we’re going to reduce a significant amount of vocabulary. We do this because these ‘rare’ words may not contain significant information. Also, it will be difficult to find a vector representation for a word appearing less number of times in the document. Later in the segment, you will see that we train the word embedding for all the words in the vocab_to_int{}, including the pre-trained words.

The ‘<PAD>’ token will be used to pad the news to a fixed length which is a standard way to represent input text.

Word vector initialisation for training:

Now, while initialising a word there can be three scenarios:

A word that was below the threshold value and not present in GloVe: Such a word will be represented by the <UNK> token and the <UNK> token will be initialised with a random word vector. Since all such words have the same representation <UNK>, they will have the same word vector.
A word that was above the threshold value and not in GloVe embedding: Such a word will be initialised to a random word vector. Note that, although the word vectors are random, they will also be unique.
A word that was above the threshold and present in GloVe: Such words will be represented by the GloVe embedding vectors.

Now, while training, the word vectors can either be fixed or can be trained. We’ll train the vectors as the neural network training takes place which you’ll see in the later part of the session.

In the next segment, you’ll learn how to change words to numeric form and then bring each sequence to a fixed length.

Report an error