Let’s consider the heart disease example that was introduced in the earlier segments to understand decision trees. Now, you will calculate the homogeneity measure for some of the features on some numbers using the Gini index to determine the attribute that you should split on first.
Recall that the Gini index is calculated as follows:
where pi is the probability of finding a point with the label i, and k is the number of classes.
The data set is not homogeneous, and you need to split the data such that the resulting partitions are as homogenous as possible. This is a classification problem, and there are two output classes or labels – having a heart disease or not. Here, you use the Gini index as the homogeneity measure. Let’s go ahead and see how Gini index can be used to decide where to make the split on the data point. While making your first split, you need to choose an attribute such that the purity gain is maximum. You can calculate the Gini index of the split on ‘sex’ (gender) and compare that with the Gini index of the split on ‘cholesterol’.
Suppose you gave the data for 100 patients and the target variable consists of two classes: class 0 having 60 people with no heart disease and class 1 having 40 people with a heart disease.
Expressing this in terms of probabilities you get:
Now, you can calculate the gini index for the data before making any splits as follows:
Let’s now evaluate which split gives the maximum reduction in impurity among the possible choices. You have the following information about the target variable and the two attributes.
As you can see, the table above shows the number of diseased/non-diseased person w.r.t. the levels in the two attributes – ‘Sex’ and ‘Cholesterol’. Let’s calculate the homogeneity reduction on each attribute individually, starting with ‘Sex’.
Split based on Sex
Let’s consider the first candidate split based on sex/gender. As you can see from the first table, of the 100 people, you have 70 males and 30 females. Among the 70 males i.e. the child node containing males, 50 belong to class 0 i.e, they do not have a heart disease and the rest 20 males belong to class 1 having a heart disease. So basically for the split on “Sex”, you have something like this —
Now the probabilities of the two classes within the male subset comes out to be:
Now using the same formula, Gini impurity for males becomes:
Let’s now take the other case i.e. the child node containing females, where there are 30 females out of which 10 belong to class 0 having no heart disease and 20 belong to class 1 having a heart disease. The probabilities of the two classes within the female subset comes out to be:
Now using the formula, Gini impurity for females becomes:
Now how do you get the overall impurity for the attribute ‘sex’ after the split? You can aggregate the Gini impurity of these two nodes by taking a weighted average of the impurities of the male and female nodes. So, you have –
This gives the Gini impurity after the split based on gender as:
Thus, the split based on gender gives the following insights:
- Gini impurity before split = 0.48
- Gini impurity after split = 0.42
- Reduction in Gini impurity = 0.48 – 0.42 = 0.06
Hence, you get the following tree after splitting on ‘Sex’ —
Split based on Cholesterol
Let’s now take another candidate split based on cholesterol. You divide the dataset into two subsets: Low Cholesterol (Cholesterol < 250) and High Cholesterol (Cholesterol > 250). There are 60 people belonging to the low cholesterol group and 40 people belonging to the high cholesterol group.
If you see the second table given above, you will notice that among the 60 low cholesterol people, 50 belong to class 0, i.e, they do not have a heart disease and the rest 10 belong to class 1 having a heart disease. So basically for the split on “Cholesterol”, you have something like this —
Now the probabilities of the two classes within the low cholesterol subset comes out to be:
Now using the formula, Gini impurity for low cholesterol subset becomes:
Let’s now take the other case where there are 40 high cholesterol (Cholesterol > 250) people out of which 10 belong to class 0 having no heart disease and 30 belong to class 1 having a heart disease. The probabilities of the two classes within the high cholesterol subset comes out to be:
Now using the formula, Gini impurity for high cholesterol subset becomes:
The overall impurity for the data after the split based on cholesterol can be computed by taking a weighted average of the impurities of the high and low cholesterol nodes. So, you have –
This gives the Gini impurity after the split based on cholesterol as:
Thus, the split based on cholesterol gives the following insights:
- Gini impurity before split = 0.48
- Gini impurity after split = 0.3
- Reduction in Gini impurity = 0.48 – 0.3 = 0.18
Hence, you get the following tree after splitting on ‘Cholesterol’ —
Hence, from the above example, it is evident that we get a significantly higher reduction in Gini impurity when you split the dataset on cholesterol as compared to when you split on gender.
Let’s summarise all the steps you performed.
- Calculate the Gini impurity before any split on the whole dataset.
- Consider any one of the available attributes.
- Calculate the Gini impurity after splitting on this attribute for each of the levels of the attribute. In the example above, we considered the attribute ‘Sex’ and then calculated the Gini impurity for both males and females separately.
- Combine the Gini impurities of all the levels to get the Gini impurity of the overall attribute.
- Repeat steps 2-5 with another attribute till you have exhausted all of them.
- Compare the decrease in Gini impurity across all attributes and select the one which offers maximum reduction.
You can also perform the same exercise using Entropy instead of Gini index as your measure.
Important: Please note that Gini index is also often referred to as Gini impurity. Also, some sources/websites/books might have mentioned a different formula for the Gini index. There is nothing wrong in using either of the formulas (because the ultimate interpretation regarding the impurity of the feature remains unchanged across both the formulas), but in order to avoid any confusion, we would recommend you to stick to the one mentioned in this session as this is the formula that we will be consistently using throughout the whole module. Here’s it again for you!
In the next segment, you will learn about the property of feature importance in decision trees.