Recall the scenario from session one where you were a doctor asking a series of questions to determine whether a person has heart disease or not. Based on your past experience with other patients, the answers to these questions would finally lead to a prediction depicting whether the person has heart disease or not. Now, if you’re a doctor, you would know which questions to ask first, right? When a person might be at the risk of heart disease, the doctor would probably first ask/measure the cholesterol of the person. If it’s higher than 300, the doctor can be pretty sure that the person is at risk. Then he may ask some more questions to confirm his predictions even further.
Now suppose instead of asking/measuring the cholesterol, the doctor first asks the age of the person. If the person says, say, 55, the doctor may not be completely sure whether the person has heart disease or not. He/She would need to ask further questions.
From both these scenarios, it is easy to infer that to predict the target variable (in this case, heart disease), there are obviously some questions/attributes that are more important for its prediction than others. In our case, you saw that the cholesterol level is a more significant attribute than age in the prediction of heart disease. Why is that? It’s so because, for the doctor, it might be evident from past records that people will higher cholesterol levels have a high chance of having heart disease. Say, in his/her experience that doctor has seen that out of every 100 patients who have cholesterol higher than 300, 90 had a heart disease while 10 did not. Also, say, out of 100 patients that the doctor has consulted over the age of 54, fifty of them had a heart disease. Now, it is evident that ‘cholesterol’ would be better to determine heart disease than ‘age’ since 90% of the patients having cholesterol greater than 300 were diagnosed with heart disease.
You can clearly see that one of the classes (disease) is significantly dominant over the other (non-disease) after splitting on the basis of cholesterol > 300 and this dominance is helping the doctor make a more confident prediction unlike ‘age’ which gives us a 50-50 split of both classes leaving us in a dilemma again.
So we basically arrive at these questions; Given many attributes, how do you decide which rules obtained from the attributes to choose in order to split the data set? From a single feature, you can get many rules and you may use any of these to make the split. Do you randomly select these and split the data set or should there be a selection criterion for choosing one over the other? What are you trying to achieve with the split?
All these questions will be covered in the following session and you will learn about the various algorithms and criteria that are involved in constructing a decision tree.
If a partition contains data points with identical labels (for example, label 1), then you can classify the entire partition as that particular label (label 1). However, this is an oversimplified example. In real-world data sets, you will almost never have completely homogenous data sets (or even nodes) after splitting the data. Hence, it is important that you try to split the nodes such that the resulting nodes are as homogenous as possible. One important thing to remember is that homogeneity here is always referred to response (target) variable’s homogeneity.
For example, let’s suppose we consider the same heart disease example in which you wanted to classify whether a person has a heart disease or not. If one of the nodes is labelled ‘Blood Pressure’, try to split it with a rule such that all the data points that pass the rule have one label and those that do not pass the rule have a different label. Thus, you need to ensure that the response variable’s homogeneity in the resultant splits is as high as possible.
A split that results in a homogenous subset is much more desirable than the one that results in a 50-50 distribution (in the case of two labels). In a completely homogenous set, all the data points belong to one particular label. Hence, you must try to generate partitions that result in such sets.
For classification purposes, a data set is completely homogeneous if it contains only a single class label. For regression purposes, a data set is completely homogeneous if its variance is as small as possible. You will understand regression trees better in the upcoming segments.
Let’s take a look at the illustration given below to further understand homogeneity.
Consider a data set ‘D’ with homogeneity ‘H’ and a defined threshold value. When homogeneity exceeds the threshold value, you need to stop splitting the node and assign the prediction to it. As this node does not need further splitting, it becomes the leaf node.
Suppose that you keep the threshold value as 70%. The homogeneity of the node in the illustration given below is clearly above 70% (~86%). The homogeneity value, in this case, falls above the threshold limit. Hence no splitting is required and this node becomes a leaf node.
But what do you do when the homogeneity ‘H’ is less than the threshold value?
Now keeping the same threshold value as 70%, you can see from the below illustration that homogeneity of the first node is exactly 50%. The node contains an equal number of data points from both the class labels. Hence, you cannot give a prediction to this node as it does not meet the passing criterion which has been set. There is still some ambiguity existing in this node as there is no clarity on the label that can be assigned as the final prediction. This brings in the need for further splitting and we compare the values of homogeneity and threshold again to arrive at a decision. This is continued until the homogeneity value exceeds the threshold value.
Till the homogeneity ‘H’ is less than the threshold, you need to continue splitting the node. The process of splitting needs to be continued until homogeneity exceeds the threshold value and the majority data points in the node are of the same class.
This is an abstract example to give you an intuition on homogeneity and splitting. You will get better clarity in the next segment when you learn how to quantify and measure homogeneity to arrive at a prediction through continuous splitting. You will learn how to use specific methods to measure homogeneity, namely the Gini index, entropy, classification error (for classification), and MSE (for regression).
Now, how do you handle best split for different attributes in a decision tree using CART algorithm.
A tree can be split based on different rules of an attribute and these attributes can be categorical or continuous in nature. If an attribute is nominal categorical, then there are 2k−1−1 possible splits for this attribute, where k is the number of classes. In this case, each possible subset of categories is examined to determine the best split.
If an attribute is ordinal categorical or continuous in nature with n different values, there are n – 1 different possible splits for it. Each value of the attribute is sorted from the smallest to the largest and candidate splits based on the individual values is examined to determine the best split point which maximizes the homogeneity at a node.
There are various other techniques like calculating percentiles and midpoints of the sorted values for handling continuous features in different algorithms and this process is known as discretization. Although the exact technicalities are out of the scope of this module, curious students can read more about this in detail from the additional resources given below.
Now let’s answer some questions based on your learnings so far.