IKH

Choosing Tree Hyperparameters in Python

So far, you have learnt how to tune different hyperparameters manually. However, you cannot always choose the best set of hyperparameters for the model manually. Instead, you can use gridsearchcv() in Python, which uses the cross-validation technique. Now, what exactly is cross-validation, and how is it helpful? Let’s watch the next video and hear what Rahim has to say about it.

As you learnt in this video, the problems with manual hyperparameter tuning are as follows:

  • Split into train and test sets: Tuning a hyperparameter makes the model ‘see’ the test data. Also, the results are dependent upon the specific train-test split.
  • Split into train, validation and test sets: The validation data would eat into the training set.

However, in the cross-validation technique, you split the data into train and test sets and train multiple models by sampling the train set. Finally, you can use the test set to test the hyperparameter once.

Specifically, you can apply the k-fold cross-validation technique, where you can divide the training data into k-folds/groups of samples. If k = 5, you can use k-1 folds to build the model and test it on the kth fold. 

It is important to remember that k-fold cross-validation is only applied on the train data. The test data is used for the final evaluation. One extra step that we perform in order to execute cross-validation is that we divide the train data itself into train and test (or validation) data and keep changing it across “k” no. of folds so that the model is more generalised. This is depicted in the image below.

The green and orange boxes constitute the training data. Here, the green ones are the actual training data and orange ones are the test (or validation) data points selected within the training dataset. As you can see, the training data is divided into 5 blocks or folds, and each time 4 blocks are being used as training data and the remaining one block is being used as the validation data. Once the training process is complete, you jump to model evaluation on the test data depicted by the blue box.


Now, coming back to the question, how do you control the complexity (or size) of a tree? A very ‘big’ or complex tree will result in overfitting. On the other hand, if you build a relatively small tree, it may not be able to achieve a good enough accuracy, i.e., it will underfit. So, what values of hyperparameters should you choose? As you would have guessed, you can use grid search with cross-validation to find the optimal hyperparameters.

In the next video, Rahim will explain how each hyperparameter affects a tree and how you should choose the optimal set of hyperparameters.

You played around with different values of different hyperparameters using GridSearchCV(). This function helped you try out different combinations of hyperparameters which ultimately eased your process of figuring out these best values. It is, however, important to note that the values tried out in the demonstration above may have not necessarily given the best results in terms of accuracy. You may go ahead and try out different combinations as well and see if you can surpass Rahim’s test score.

Phew! That was surely a lot to take in the past couple of segment. Let’s now watch the next video to recall and summarise what you have learnt so far while building decision trees.

You are required to read the documentation of sklearn DecisionTreeClassifier and understand the meaning of hyperparameters. The following questions are based on the documentation

If you are comfortable with building decision trees now, we recommend you to attempt some optional model building coding exercises on the DoSelect console. The questions can be accessed here.

Report an error