Experienced users of decision trees have long appreciated that decision trees in general are often not impressive performers when it comes to regression. This does not in the least suggest that regression trees are not valuable analytical tools. As always, they are fabulous for gaining insight into data, making rapid out of the box progress even when working with highly flawed data, detecting hidden but important flaws in the data, and identifying valuable predictors. Regression trees are among the most useful of tools during exploratory data analysis, when the modeler is struggling to understand the data and elicit the dominant predictive patterns. This will be especially true when the data is strewn with missing values as the CART regression tree user will not need to do any special data preparation devoted to dealing with the missing values: CART will handle the missings effectively. But regression trees (at least single regression trees) often yield lower predictive accuracy than other methods, in part, because they generally produce a rather limited number of distinct predictions. All records falling into a specific terminal node of a regression tree share the same prediction – lumping all modestly similar records into the same predictive bucket. Regression trees suffer from one further problem that is rarely appreciated: because the criterion that is used to build the model is the same as the criterion used to assess the performance of the model, regression trees have an enhanced tendency to overfit to the training data. (More on this latter point later.)

There have been many attempts to improve the performance of regression trees (as measured by test set mean squared error). These include attempting to fit a simple one predictor linear regression to the data in each terminal node or developing some other form of model that would allow distinct records in each terminal node to receive distinct predictions. Jerome H. Friedman's MARS (Multivariate Adaptive Regression Splines) was created explicitly to improve on the CART regression tree, borrowing as many CART ideas as possible and importing them into the world of linear regression. Here we discuss another interesting approach that is worth consideration: converting the regression problem into a classification problem by first binning the target variable into a moderate number of bins. While this method does not address the problem of producing a limited number of distinct predictions, it can diminish the potential overfitting and generate alternative insights into the data.

This discussion touches on two important topics that deserve far more coverage than we offer here, namely, binning of variables in SPM and using the FORCE mechanism to impose a predetermined pattern of splits on a new analysis that might even include a new target variable. The discussion is organized as follows:

- Brief comments on the data set used for illustrative purposes
- Using SPM to bin a continuous dependent (target) variable
- Building a multi-class tree
- Using cost-sensitive learning to shape the tree
- Extracting the structure of the tree into a set of FORCE commands
- Applying the FORCE commands to the original continuous dependent variable to obtain regression performance statistics

**1)** We use the 1970s era BOSTON.CSV housing data set, which records the median value of homes within each of 506 census tracts in the greater Boston area. In addition to house price information, authors who assembled the data merged information from a variety of disparate publicly available sources, including crime statistics, pollution measures, school quality, and zoning characteristics of the neighborhoods. The full list of variables and descriptions appears below:

Goal: |
Study relationship between quality of life variables and property values |

MV |
Median value of owner-occupied homes in tract ($1,000’s) |

CRIM |
Per capita crime rates |

NOX |
Concentration of nitric oxides (pp 10 million) |

AGE |
Percent built before 1940 |

DIS |
Weighted distance to centers of employment |

RM |
Average number of rooms per house |

LSTAT |
% lower status of the population |

RAD |
Accessibility to radial highways |

CHAS |
Borders Charles River (0/1) |

INDUS |
Percent non-retail business |

TAX |
Property tax rate per $10,000 |

PT |
Pupil teacher ratio |

More detail is available in our presentation on the Evolution of Regression and in the original article by Harrison and Rubinfeld. On our website we have made available a special version of this data set that augments the original data with a SAMPLE$ variable coded as either “Learn” or “Test” (BOSTON_LT.CSV). The variable serves as a way to ensure that we always use the same Learn and Test partitions as we move through a variety of models and operations. (Usually SPM will look after this for you automatically so this is just a precaution.)

**2)** To bin a variable inside of SPM we simply select Data Binning as the analysis engine from the GUI model setup dialog. Below observe that we have checked off only the variable MV which is the continuous variable we intend to bin. We have also selected “Data Binning” as the Analysis Method, and for our current purposes we have selected “No independent testing” from the testing tab. There may be times when you want to bin using only information from the learning data but we think that this is not necessary for this example and thus use all available data.

Visiting the “Binning” tab we see the next display:

In our example, since we wish to bin only our intended dependent variable we have two choices for binning method:

a) Equal sized bins (number of training set observations in each bin)

b) CART self-guided binning in which the CART machinery is used to extract a data driven

set of bin sizes (usually not equal)

We generally prefer the latter and have shown it as selected in the screen shot.

For any binning method we must specify the "ideal" number of bins (user choice) and whether, when the ideal number cannot be generated, we prefer the closest number larger than, or smaller than, the ideal. We usually elect to go with fewer bins.

**3)** Selecting a number of bins wisely is important. If we are going to use a test method (cross-validation or explicit test partition) we must ensure that every level of the new, binned target appears in the test partition (or in every cross-validation fold). The simplest way to assure this is to work with a small number of bins, and here we have elected to work with 5 bins for our target. While this will make it difficult to generate a high performance tree it will still yield good results.

Finally, do not forget to request a saved output data set including the all of your original input data and the new, binned versions of MV.

Clicking on the “START” button generates this report, listing the size of each bin, the bin boundary separating each bin from its neighbor, and the mean, min, and max values of MV in each bin.

MV - 5 bins

--------------------- Source --------------------

Bin N W % Cut Point Mean StdDev Min Max

------------------------------------------------------------------------------------------------------------

1 134 134.00 26.48 17.35000 12.95896 3.06069 5.00000 17.30000

2 138 138.00 27.27 21.75000 19.71522 1.20249 17.40000 21.70000

3 128 128.00 25.30 27.70000 23.83594 1.46070 21.80000 27.50000

4 74 74.00 14.62 39.25000 32.19730 2.97588 27.90000 38.70000

5 32 32.00 6.32 . 47.21250 3.39390 39.80000 50.00000

------------------------------------------------------------------------------------------------------------

To move forward we now open the new dataset, which will contain all the original variables and two new binned variables:

MV_BIN This is an integer coded 1,2,3,4,5

MV_BINNED This contains the mean value of MV in each bin

If we now move to generate classification trees we can use either version of the binned MV to get the same results. We will work with the MV_BINNED version because we can directly read the bin means from the output. The model setup dialog will be visited to

- Select the target variable: MV_BINNED
- Select the predictors: We need to exclude the original target MV, the binned version MV_BIN, and the SAMPLE$ variable.
- Note that 13 variables have been selected as predictors
- Analysis type is “Classification”
- MV is selected as an AUX variable (optional)

The selection of “Classification” is of course a device we will be using as our ultimate objective is actually to develop a regression model.

The testing tab also needs to be visited to use the SAMPLE$ separation variable.

Clicking “Start” will generate the following multi-class classification tree:

We see an 8-node tree but no easy way to evaluate the performance of this as a regression model. Going to the “Summary” (click on the Summary button at the bottom of the navigator), the Prediction Success (Confusion Matrix) tab shows:

We have chosen the test sample and we display ROW percentages. Of course we would like to have seen near 100% in each element of the diagonal but the results appear to be largely on the right track.

Another perspective on performance is to go to the “Profile” tab (this tab will only be created if there are AUXILIARY variables requested). Here we select options as indicated in the image below:

Note the “Test” sample has been selected, and the average value (within each terminal node) of the profile variable (MV) has been selected for display in the bar chart. Finally, to get the above display you must click on the column header “Avg. Sum Learn” to sort the bars by average value of MV in the learn sample. We see that the terminal nodes also order correctly for the test data. (No test data ever reached terminal 2, a hazard when working with rather small test partitions.)

All very interesting but what does this have to do with regression? An easy way to render this as a regression model is to use the mean of MV in each terminal as the prediction to be made of our re-purposed classification tree. Any record reaching terminal 7 of our tree will be predicted to have a value of MV of 13.47 (see the first row of the table in the above display. Any record reaching terminal node 6 will be predicted to have a value of MV of 18.45, and so on for every node in the tree. But of course we would like to know how this model performs on test data and we would like to generate all these predictions automatically.

Now comes the trick that allows us to re-purpose any tree to any new use! This includes taking a classification tree and using its structure (sequence of splits) as a regression tree, which is what we want to do now. Here are the simple steps:

**(1)** On the navigator display right mouse click on the root node and select “Node plus children” (see below)

This will open a new SPM notepad window containing the following plain text commands

REM Node 1, depth 1, Left Child = 2, Right Child = -8

FORCE ROOT ON RM AT 7.437000

REM Node 2, depth 2, Parent = 1, Left Child = 3, Right Child = 7

FORCE L ON LSTAT AT 14.434999

REM Node 3, depth 3, Parent = 2, Left Child = 4, Right Child = -5

FORCE LL ON RM AT 6.543000

REM Node 4, depth 4, Parent = 3, Left Child = -1, Right Child = 5

FORCE LLL ON LSTAT AT 7.650000

REM Node 5, depth 5, Parent = 4, Left Child = -2, Right Child = 6

FORCE LLLR ON DIS AT 1.227150

REM Node 6, depth 6, Parent = 5, Left Child = -3, Right Child = -4

FORCE LLLRR ON RM AT 6.318000

REM Node 7, depth 3, Parent = 2, Left Child = -6, Right Child = -7

FORCE LR ON CRIM AT 0.614845

These commands dictate the sequence of splits seen in the above tree. Our trick is to now use those splits but in the context of a *regression* tree (remember the above tree is a classification tree).

**(2)** Return to the Model SetUp dialog, switch the target variable back to the continuous variable MV, and select Regression as the “Analysis Type”

**(3)** From the Notepad window containing the FORCE commands either use the keyboard shortcut CTRL-W or from the File Menu select “Submit Window”. This locks the entire tree structure in place.

**(4)** Finally, run the model via the START button on the model setup, via the toolbar icon shown below, or by adding the CART GO as a final command at the bottom of the Notepad containing the force commands before submitting.

If you have set up everything correctly you will now obtain the regression tree below, with a test sample R-Squared of .713 and a test MSE of 17.54:

**Final Observations**

In our example above we extracted the FORCE commands for the optimal tree generated by the classification model. We actually should have extracted the force commands from a somewhat *larger* tree in order to permit more flexibility in selecting what is optimal from the perspective of the regression tree. This could well yield an optimal regression tree that is somewhat more evolved and with better performance.

Thanks, and let us know how you liked this tutorial!