IKH

Comprehension – Decision Tree Classification in Python

Let’s consider the heart disease data set that we discussed in the earlier segment. The data lists various tests that were conducted on patients along with some other details of the patients. Now, given the test results and other attributes, suppose you want to predict whether a person has a heart disease or not.

Please download the dataset from below.

To keep this simple and focus on building a decision tree only, we are skipping any data preparation or feature manipulation techniques.

You also need to install Graphviz to visualize the decision tree. The steps to be followed are provided towards the end of the page.

Note

In the video above at timestamp [7:07], Rahim meant 81 records in the test set instead of 189.

So you’ve imported the required libraries, read and inspected the data. Also, the entire data set has been split into train and test sets. Let’s now move on to building the decision tree using the default parameters of the DecisionTreeClassifier() function except for the tree depth.

Now that you have built the decision tree and visualised it using the graphviz library, let’s now evaluate how the model that we built is performing on the unseen data.

You can see that the model that we have now is not performing well on the test set. This is because we built our model on the default parameters except for the depth and didn’t change any other hyperparameters. Hyperparameter tuning can improve the performance of decision trees to a great extent. So in the upcoming sessions, we will go ahead and exploit these parameters to improve the model and give better prediction results.

Question

What are hyperparameters?

Hyperparameters are simply the parameters that we pass on to the learning algorithm to control the training of the model. Hyperparameters are choices that the algorithm designer makes to ‘tune’ the behaviour of the learning algorithm. The choice of hyperparameters, therefore, has a lot of bearing on the final model produced by the learning algorithm.

So basically anything that is passed on to the algorithm before it begins its training or learning process is a hyperparameter, i.e., these are the parameters that the user provides and not something that the algorithm learns on its own during the training process. Here, one of the hyperparameters you input was “max_depth” which essentially determines how many levels of nodes will you have from root to leaf. This is something that the algorithm is incapable of determining on its own and has to be provided by the user. Hence, it is a hyperparameter.

Now, obviously, since hyperparameters can take many values, it is essential for us to determine the optimal values where the model will perform the best. This process of optimising hyperparameters is called hyperparameter tuning. You will learn to do that in the next session. First, let’s answer some questions based on your learnings so far.

You need to use the resultant decision tree structure to answer the following questions.

Installing Graphviz

Python requires the library ‘pydotplus’ and the external software Graphviz to visualise the decision tree. If you are using Windows, then you will need to specify the path to the pydotplus library in order to access the dot file from Graphviz.

Please refer the user guide and the steps below to install Graphviz.

Steps for Windows users:

  • Download Graphviz from here (ZIP file).
  • Unzip the file and copy-paste it in C:\Program Files (x86)\.
  • Make sure your file is unzipped and placed in Program Files (x86).
  • Environment Variable: Add C:\Program Files (x86)\graphviz-2.38\release\bin to the user path.
  • Environment Variable: Add C:\Program Files (x86)\graphviz-2.38\release\bin\dot.exe to the system path.
  • Install the Python Graphviz package – pip install graphviz.
  • Install pydotplus – pip install pydotplus.

Instructions to add the environment variable: click here.

Steps for Mac Users:

  • To install the Graphviz on your Mac, you can use Homebrew:
  • Install homebrew from here.
  • Run this in the terminal.
  • Install pydotplus, pip install pydotplus.
  • Install the python graphviz module, pip install graphviz.

Alternative method for Graphviz

For those who are facing issues with visualizing a decision tree using the Graphviz software can use another function called plot_tree from sklearn.tree. You can read more about the function from the documentation provided here.

The code for visualizing a decision tree using plot_tree along with the output has been provided below:

Additional Readings:

Parameters and Hyperparameters

Report an error