Impurity Measures

Now, you have narrowed down the decision tree construction problem to this: you want to split the data set such that the homogeneity of the resultant partitions is maximum. But how do you measure this homogeneity?

Various methods, such as the classification error, Gini index and entropy, can be used to quantify homogeneity. You will learn about each of these methods in the upcoming video.

Note

At the timestamp [5:11] in the video, the formula for entropy includes a negative sign along with the expression. Also, at [5:42], the 2nd term will be -0.8 log(0.8) and not -0.8 log(-0.8).

The classification error is calculated as follows:

$$//E\;=\;1\;-\;max\left(P_I\right)//$$

The Gini index is calculated as follows:

$$//G\;=\;{\textstyle\sum_{I\;}^k}=\;1\;P_i\;\left(1\;-P_I\right)//$$

Entropy is calculated as follows:

$$//D\;=\;-\;{\textstyle\sum_i^k}=1\;P_{I.\;}\log_{2\;}\left(P_I\right),//$$

where pi is the probability of finding a point with the label i, and k is the number of classes.

Let’s now tweak the above example and try to understand how these impurity measures change with the class distribution.

From the above example, you understood how to calculate different impurity measures and how do they change with different class distributions.

Impurity Measures	Case I Class 0: 20 Class 1: 80	Case II Class 0: 50 Class 1: 50	Case III Class 0: 80 Class 1: 20
Classification Error	0.2	0.5	0.2
Gini Impurity	0.32	0.5	0.32
Entropy	0.72	1	0.72

You can see that for a completely non-homogeneous data with equal class distribution, the value of Classification Error and Gini Impurity are the same i.e. 0.5 and that of Entropy is 1.

The scaled version of the entropy in the illustration shown in the video is nothing but entropy/2. It has been used to emphasize that the Gini index is an intermediate measure between entropy and the classification error.

In practice, classification error does not perform well. So, we generally prefer using either the Gini index or entropy over it.

Gini Index

Gini index is the degree of a randomly chosen datapoint being classified incorrectly. The formula for Gini index can also be written as follows:

$$//G\;=\;{\textstyle\sum_{i=1\;}^k}P_i\;\left(1\;-\;P_i\right)\;=\;{\textstyle\sum_{I=1\;}^K}\left(P_{i\;-\;}P_i^2\right)\;=\;{\textstyle\sum_{i=1}^k}P_i\;-\;{\textstyle\sum_{i=1}^k}\;P_i^2\;=\;1\;-\;{\textstyle\sum_{i=1}^k}P_i^2//$$

where pi is the probability of finding a point with the label i, and k is the number of classes.

$$//\left(Think\;why\;was{\textstyle\sum_{I=1\;}^K}p_{i\;equal\;to\;1?}\right)//$$

Gini index of 0 indicates that all the data points belong to a single class. Gini index of 0.5 indicates that the data points are equally distributed among the different classes.

Suppose you have a data set with two class labels. If the data set is completely homogeneous, i.e., all the data points belong to label 1, then the probability of finding a data point corresponding to label 2 will be 0 and that of label 1 will be 1. So, p1 = 1 and p2 = 0. The Gini index, which is equal to 0, will be the lowest in such a case. Hence, the higher the homogeneity, the lower the Gini index.

Entropy

Entropy quantifies the degree of disorder in the given data, its value varies from 0 to 1. Entropy and the Gini index are similar numerically. If a data set is completely homogenous, then the entropy of such a data set will be 0, i.e., there is no disorder in the data. If a data set contains an equal distribution of both the classes, then the entropy of that data set will be 1, i.e., there is complete disorder in the data. Hence, like the Gini index, the higher the homogeneity, the lower the entropy.

Now that you have understood the different methods to quantify the purity/impurity of a node, how do you identify the attribute that results in the best split? Let’s learn more about this in the upcoming video.

The change in impurity or the purity gain is given by the difference of impurity post-split from impurity pre-split, i.e.,

The post-split impurity is calculated by finding the weighted average of two child nodes. The split that results in maximum gain is chosen as the best split.

To summarise, the information gain is calculated by:

where D is the entropy of the parent set (data before splitting),DA is the entropy of the partitions obtained after splitting on attribute A. Note that reduction in entropy implies information gain.

Let’s understand how do we compute information gain with an example. Suppose you have four data points out of which two belong to the class label ‘1’, and the other two belong to the class label ‘2’. You split the points such that the left partition has two data points belonging to label ‘1’, and the right partition has the other two data points that belong to label ‘2’. Now let’s assume that you split on some attribute called ‘A’.

So, the information gain after splitting the original data set on attribute ‘A’ is 1.0. You always try to maximise information gain by achieving maximum homogeneity and this is possible only when the value of entropy decreases from the parent set after splitting.

In case of a classification problem, you always try to maximise purity gain or reduce the impurity at a node after every split and this process is repeated till you reach the leaf node for the final prediction.

Additional Readings:

If you are curious to know about these topics, you may go through the following link.

CHAID algorithm for decision trees (Chi squared criteria for splitting and multiway decision trees).

Impurity Measures.

What happens when we use a base other than 2 for log in entropy?

Report an error