In this segment, you will learn how decision trees and random forests stack up against logistic regression. You will be looking at the telecom churn prediction example that we considered in the earlier module. You will be using the same data set with the same problem statement and build the tree models to understand how they are better than the logistic regression model. Before you step into this, go through the data set and previous Python notebook and recall the initial steps of data cleaning and preparation for building the model.
You will use 21 variables related to customer behaviour (such as monthly bill, internet usage, etc.) to predict whether a particular customer will switch to another telecom provider or not, i.e., whether they will churn or not.
Problem Statement
You have a telecom firm that has collected data of all its customers. The main types of attributes are as follows:
- Demographics (age, gender, etc.).
- Services availed (internet packs purchased, special offers taken, etc.).
- Expenses (amount of recharge done per month, etc.).
Based on all this past information, you want to build a model that will predict whether a particular customer will churn or not, i.e., whether they will switch to a different service provider or not. So, the variable of interest, i.e., the target variable here is ‘Churn’, which will tell us whether or not a particular customer has churned. It is a binary variable where 1 means that the customer has churned and 0 means that the customer has not churned.
You can download the data sets from the links given below:
Also, here’s the data dictionary:
You can also download the code file given below and follow along.
In the video below you will hear from Rahim who will explain how random forests and decision trees stackup against logistic regression.
Let’s now watch the following video to learn how to build decision trees over the same data set and understand the trade-offs, benefits and limitations of decision trees over logistic regression.
You saw in the video given above that without putting much effort in scaling, multicollinearity, p-values and feature selection, you got impressive and better results using decision trees as compared to a logistic regression model. However, remember that decision trees are high variance models and that they change quite rapidly with small changes in the data. In such a case, let’s watch the next video to learn how random forests can help you stack better as compared to both logistic and decision tree models against this prediction problem.
Random forests definitely gave a great leverage in the results as compared to both logistic regression and decision trees with much less effort. It has exploited the predictive power of decision trees and learnt much more than a single decision tree could do alone. However, there is not much visibility with respect to the key features and the direction of their effects on the prediction, which is done well by a logistic regression model. If interpretation is not of key significance, random forests definitely do a great job.
Rahim has also built a manual ensemble model on the housing price prediction example that you solved in linear regression. If you are curious, you can go through that in this optional segment.