Random Forests in Python

In this segment, you will understand how to implement random forests in sklearn. You will experiment with hyperparameters, such as the number of trees and the number of variables considered at each split. You will build the random forest classifier on the same heart disease data set.

The data set used in the following video can be downloaded from the link given below.

Download the Python code used in the following video from the link given below to practice along.

Please run the code in the notebook and understand the initial few steps – data understanding, multiple regression model and decision tree – before moving on to the video.

Let’s now build a model using a RandomForestClassifier() with some arbitrary parameters for simplicity and better prediction results.

You built the model and looked at some sample trees and got an idea of how decisioning takes place. You also looked at the OOB score to understand how individual trees perform. Let’s now watch the next video to learn how to tune some hyperparameters using gridsearchcv() to make our ensemble model perform better.

Let’s now watch the following video to look at the notion of variable importance in random forests.

To summarise, you learnt how to build a random forest in sklearn. Apart from the hyperparameters that you have in a decision tree, there are two more hyperparameters in random forests: max_features and n_estimators. The effects of both the hyperparameters are briefly summarised below.

The effect of max_features

You learnt that there is an optimal value of max_features, i.e, at very low values, the component trees are too simple to learn about anything useful, while at extremely high values, the component trees become similar to each other (and violate the ‘diversity’ criterion).

The effect of n_estimators

When you observe the plot of n_estimators and training and test accuracies, you will see that as you increase the value of n_estimators, the accuracies of both the training and test sets gradually increase. More importantly, the model does not overfit even when its complexity is increasing. This is an important benefit of random forests: You can increase the number of trees as much as you like without worrying about overfitting (only if your computational resources allow).

Also, as you saw, since there were a lot of models to fit, the time taken was quite high. If you want to gain a better understanding of the time taken to build random forests, you can go through this optional segment.

Now try answering the following question and test your understanding.

NOTE:

Although here grid search is fitted on entire data,please note that practically grid_search.fit() is applied on the training dataset and evaluated on test data.

If you are comfortable with building random forests now, we recommend you to attempt some optional model building coding exercises on the DoSelect console. The questions can be accessed here.

Report an error