Let’s hear Rahim summarise the last two sessions for you.
So to summarise, the steps that you performed throughout the model building and model evaluation were:
- Data cleaning and preparation
- Combining three dataframes
- Handling categorical variables
- Mapping categorical variables to integers
- Dummy variable creation
- Handling missing values
- Test-train split and scaling
- Model Building
- Feature elimination based on correlations
- Feature selection using RFE (Coarse Tuning)
- Manual feature elimination (using p-values and VIFs)
- Model Evaluation
- Accuracy
- Sensitivity and Specificity
- Optimal cut-off using ROC curve
- Predictions on the test set
So first, classes were assigned to all the customers in the test data set. For this, a probability cutoff of 0.5 was used. The model thus made, was very accurate (Accuracy = ~80%), but it had a very low sensitivity (~53%). Thus, a different cutoff was tried out, i.e. 0.3, which resulted in a model with slightly lower accuracy (~77%), but a much better sensitivity (~78%). Hence, you learnt that you should not just blindly use 0.5 as the cutoff for probability every time you make a model. Business understanding must be applied. Here, that means playing around with the cutoff, until you get the most useful model.
Also, recall that the sensitivity of a model is the proportion of yeses (or positives) correctly predicted by it as yeses (or positives). Also, the specificity is equal to the proportion of nos (or negatives) correctly predicted by the model as nos (or negatives). For any given model, if the sensitivity increases by changing the cutoff, its specificity goes down.
High values of both cannot be achieved in a single model. Hence, you have to choose which one you would want to be higher. The safest option, though, is the one in which you just take the cutoff that equalises accuracy, sensitivity and specificity. But it totally depends on the business context. Sometimes you might want a higher sensitivity, sometimes you might want a higher specificity.
You also saw another view of things which was the Precision and Recall view. Those were very much related to sensitivity and specificity. Precision essentially means of the ‘Yeses’ predicted, how many were actually yeses. Recall on the other hand is that same as sensitivity, i.e. out of the total actual yeses, how many did you correctly predict.
Report an error