Before model building, you first need to perform the test–train split of the data, and then scale the numeric features and build some basic predictive models. Scaling of variables is an important step to bring together all the variables onto the same scale for the model to be easily interpretable. So, let’s watch the next video and see how this works.
So, as you saw in the video, after performing the train–test split and feature scaling, we built a basic logistic regression model with all the features, without doing any feature selection or changing the default model settings. You thus obtained a basic model to get started with. But how good is this model? Is it generalizable for unseen data as well? Let’s go ahead and find out the answers in the next video.
So, you obtained a baseline model here and evaluated it on both the train and test sets using different metrics, such as accuracy and the confusion matrix. In the video, you saw that the evaluated results of the model without any feature engineering and hyperparameter tuning did not turn out to be that bad.
But how do you decide whether your model is performing well or not? Is it possible to do this evaluation without the test set? In the next video, we will build a more powerful non-linear model on this data set and try to answer some of these questions.
A simple model acts as a baseline model and helps you in getting an idea about the model performance. Then, you may want to go for decision trees if interpretability is something that you are looking for and compare its performance with the linear/logistic regression model. Finally, if you still do not meet the requirements, use random forests. But, keep in mind the time and resource constraints you have, as random forests are computationally expensive. Go ahead and build more complex models like random forests only if you are not satisfied with your current model and you have sufficient time and resources in hand.
As you saw in the video, cross-validation is one of the techniques that can be leveraged for model evaluation purposes. In this technique, the training data is divided into multiple folds (say, k) first, and then train–test splits are made for each of these folds. Multiple models are trained on the different k – 1 train splits and are validated on the remaining test split. This is done for each of the k folds, and the final model assessment is performed by taking the average model performance for individual folds.
So, we create a model on the subsets of the training data and test the same model on different subsets of the training data. Hence, you create a test set from the training data and use it to validate the model performance. And testing on an unseen data set is the real test for a model.
We evaluated both logistic regression and the random forest model using the cross-validation score. This gave us a better and reliable assessment of the generalised performance of the model on unseen data. You could clearly see this from the random forest model, which showed misleading train performance earlier but produced more accurate results and was close to the actual test score when it was assessed using the cross-validation scheme.
So, in the next video, you will now learn how the OOB score is similar to the cross-validation score in random forests and how different evaluation metrics can be used instead of accuracy using the cross-validation approach to assess the model performance.
Recall OOB score, which you learnt about in the random forest model; it is a technique to measure the prediction error of random forests. The OOB score in random forests gives a similar estimate to the one produced by the cross-validation score.
Another point to remember is that the cross-validation score uses accuracy as the default metric to evaluate the model. But you can use other metrics, as per your requirement, from the list of different evaluation metrics from sklearn.metrics.scorers.