So far, you have learnt about splits for discrete target variables. But how is splitting performed for continuous output variables? You calculate the weighted mean square error (WMSE) of data sets (before and after splitting) in a similar manner as you do for linear regression models. In this video, you will learn about this in detail.

In decision tree regression, each leaf represents the average of all the values as the prediction as opposed to taking an majority vote in classification trees. This **average** is calculated using the following formula:

$$yt=\frac1{N_t}{\textstyle\sum_{i?D_t}}y^{(i)}$$

This is nothing but the sum of all data points divided by the total number of data points.

Also, the **impurity measure** for a given node is measured by the **weighted mean square error (WMSE)**, also known as** variance**, which is calculated by the following formula:

MSE(t)=\frac1{N_t}{\textstyle\sum_{i?\mbox{\large$D$}_t}}\begin{pmatrix}y^{(i)}-\widehat y&t\end{pmatrix}

This is nothing but the variance of all data points.

A higher value of MSE means that the data values are dispersed widely around mean, and a lower value of MSE means that the data values are dispersed closely around mean and this is usually the preferred case while building a regression tree.

The regression tree building process can be summarised as follows:

- Calculate the MSE of the target variable.
- Split the data set based on different rules obtained from the attributes and calculate the MSE for each of these nodes.
- The resulting MSE is subtracted from the MSE before the split. This result is called the
**MSE reduction**. - The attribute with the largest MSE reduction is chosen for the decision node.
- The dataset is divided based on the values of the selected attribute. This process is run recursively on the non-leaf branches, until you get significantly low MSE and the node becomes as homogeneous as possible.
- Finally, when no further splitting is required, assign this as the leaf node and calculate the average as the final prediction when the number of instances is more than one at a leaf node.

So, you need to split the data such that the weighted MSE of the partitions obtained after splitting is lower than that obtained with the original or parent data set. In other words, the fit of the model should be as ‘good’ as possible after splitting. As you can see, the process is surprisingly similar to what you did for classification using trees.

This module includes an optional demonstration on regression trees for those who want to explore and understand the process in detail.