Earlier, you learnt that decision trees have a strong tendency to overfit data, which is a serious problem. So, you need to pay attention to the size of the tree. A large tree will have a leaf only to cater to a single data point.
There are two broad strategies to control overfitting in decision trees: truncation and pruning. In this video, you will learn how these two techniques help control overfitting.
There are two ways to control overfitting in trees:
- Truncation – Stop the tree while it is still growing so that it may not end up with leaves containing very few data points. Note that truncation is also known as pre-pruning.
- Pruning – Let the tree grow to any complexity. Then, cut the branches of the tree in a bottom-up fashion, starting from the leaves. It is more common to use pruning strategies to avoid overfitting in practical implementations.
Please note that this session covers mainly the “Truncation” techniques. In case you also want to learn about “Pruning” techniques (which is completely independent), we have provided an optional segment that can be accessed here. We recommend going through it after you have finished this whole session on hyperparameter tuning.
Let’s look into various ways in which you can truncate a tree:
Though there are various ways to truncate or prune trees, the DecisionTreeClassifier() function in sklearn provides the following hyperparameters which you can control:
- criterion (Gini/IG or entropy): It defines the homogeneity metric to measure the quality of a split. Sklearn supports “Gini” criteria for Gini Index & “entropy” for Information Gain. By default, it takes the value of “Gini”.
- max_features: It defines the no. of features to consider when looking for the best split. We can input integer, float, string & None value.
- If an integer is inputted then it considers that value as max features at each split.
- If float value is taken then it shows the percentage of features at each split.
- If “auto” or “sqrt” is taken then max_features=sqrt(n_features).
- If “log2” is taken then max_features= log2(n_features).
- If None, then max_features=n_features. By default, it takes “None” value.
- max_depth: The max_depth parameter denotes the maximum depth of the tree. It can take any integer value or None. If None, then nodes are expanded until all leaves contain just one data point (leading to overfitting) or until all leaves contain less than “min_samples_split” samples. By default, it takes “None” value.
- min_samples_split: This tells about the minimum no. of samples required to split an internal node. If an integer value is taken then consider min_samples_split as the minimum no. If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split. By default, it takes the value “2”.
- min_samples_leaf: The minimum number of samples required to be at a leaf node. If an integer value is taken then consider min_samples_leaf as the minimum no. If float, then it shows the percentage. By default, it takes the value “1”.
There are other hyperparameters as well in DecisionTreeClassifier. You can read the documentation in python using: