The following points from model evaluation and hyperparameter tuning are worth reiterating:
Validation Data
- Since hyperparameters need some unseen data to tune the model, the validation set is used.
- It prevents the learning algorithm to ‘peek’ into the test data while tuning the hyperparameters.
- A severe and practically frequent limitation of this approach is that data is often not abundant.
Cross-Validation
- It is a statistical technique which enables us to make extremely efficient use of available data.
- It divides the data into several pieces, or ‘folds’, and uses each fold as test data one at a time.
You also learnt about feature selection using cross-validation – RFECV that helps in determining the optimum number of features required for model building and prediction.
Hyperparameters
- Hyperparameters are used to ‘fine-tune’ or regularize the model so as to keep it optimally complex with the help of cross-validation technique.
- The learning algorithm is given the hyperparameters as an ‘input’ and returns the model parameters as the output.
- Hyperparameters are not a part of the final model output.
Hyperparameter tuning for large data sets can be done efficiently using RandomizedSearchCV which is computationally faster as compared to GridSearchCV and finds the best set of hyperparameters with fewer iterations.
In this session, you used logistic regression and random forest models for the bank marketing prediction. In practice, you have many models to work with and you may get overwhelmed by the choice of algorithms available for handling classification and regression problems. But it is always advisable to start with a simple model depending on the type of problem that you have. Begin with a linear regression model if you have a regression task or a logistic regression model for a classification task. Using a simple model serves two purposes:
- It acts as a baseline (benchmark) model.
- It gives you an idea about the model performance.
Then, you may want to go for decision trees if interpretability is something that you are looking for and compare its performance with the linear/logistic regression model.
Finally, if you still do not meet the requirements, use random forests. But, keep in mind the time and resource constraints you have, as random forests are computationally expensive. Go ahead and build more complex models like random forests only if you are not satisfied with your current model and you have sufficient time and resources in hand.
But remember these are not the only models that you have. As you go ahead and explore new models, feel free to apply them in your business problems in order to get insightful and good performance results.