You chose a cutoff of 0.5 in order to classify the customers into ‘Churn’ and ‘Non-Churn’. Now, since you’re classifying the customers into two classes, you’ll obviously have some errors. The classes of errors that would be there are.
- ‘Churn’ customers being (incorrectly) classified as ‘Non-Churn’.
- ‘Non-Churn’ customers being (incorrectly) classified as ‘Churn’.
To capture these errors, and to evaluate how well the model is, you’ll use something known as the ‘Confusion Matrix’. A typical confusion matrix would look like the following.
This table shows a comparison of the predicted and actual labels. The actual labels are along the vertical axis, while the predicted labels are along the horizontal axis. Thus, the second row and first column (263) is the number of customers who have actually ‘churned’ but the model has predicted them as non-churn.
Similarly, the cell at the second row, the second column (298) is the number of customers who are actually ‘churn’ and also predicted as ‘churn’.
Note that this is an example table and not what you’ll get in Python for the model you’ve built so far. It is just used as an example to illustrate the concept.
Now, the simplest model evaluation metric for classification models is accuracy – it is the percentage of correctly predicted labels. So what would the correctly predicted labels be? They would be.
- ‘Churn’ customers being actually identified as churn.
- ‘Non-churn’ customers being actually identified as non-churn.
As you can see from the table above, the correctly predicted labels are contained in the first row and first column, and the last row and last column as can be seen highlighted in the table below:
Now, accuracy is defined as:
Hence, using the table, we can say that the accuracy for this table would be:
Now that you know about confusion matrix and accuracy, let’s see how good is your model built so far based on the accuracy. But first, answer a couple of questions.
So using the confusion matrix, you got an accuracy of about 80.8% which seems to be a good number to begin with. The steps you need to calculate accuracy are:
- Create the confusion matrix.
- Calculate the accuracy by applying the ‘accuracy_score’ function to the above matrix.
Coming Up
So far you have only selected features based on RFE. Further elimination of features using the p-values and VIFs manually is yet to be done. You’ll do that in the next section.