IKH

Preparing Data

In the last segment, you were introduced to the problem statement. You also briefly looked at the architecture of a CNN-RNN model which is going to be used in this exercise. As you might already know, data ingestion is one of the most important steps in the model building process. Hence, let’s start to prepare the Dow Jones index and the news headlines data so that it’s ready for modelling.

As we know, the problem that we are trying to solve is predicting stock value index using news headline, we need to find a one-to-one mapping for the dates common in both the news data and the Dow Jones price index dataset. But the problem now is that there is a one-to-many mapping between the dates and the news, that is, for each date, there are multiple news headlines. You looked at how to merge the two datasets based on the dates.

Now, there could be two ways to model the price index based on the news: we could either predict the absolute opening price of the next day, or we could predict the rise or fall in the opening price of the next day based on the previous day. Which one should we go for, then? Predicting the rise or fall seems more logical in this case since we’re trying to analyse the effect of the news on the price index.

Let’s now understand this approach of predicting the rise or fall of the price index with an example. Suppose the government announced a new policy on Monday that would potentially help the country’s economy substantially. Let’s assume that the opening index on Monday was 5000 points. Now, let’s say that the Dow Jones opened with 5100 points on Tuesday (assuming the effect of positive news). We want a dataset where Monday news is the independent variable, and +100 is the dependent variable.

Also, note that we are only considering the opening price in this problem, so we don’t need the other variables such as high, low, closing price, etc. Keeping this in mind, let’s see how to prepare the data.

You have seen how to create the target variable by calculating the difference of the opening indices of the current and the previous day.. But the news and the price index are still two separate datasets. Let’s see how to merge them into a single dataset using the date as the joining variable.

The dataset created has 2 columns: headlines consisting of all headlines and price as the target variable respective to a particular date.

In the next segment, you’ll see how to preprocess the input and the output variables to make them ready to be fed to the CNN-RNN model.

Report an error