Simply Salford Blog

One Tree for Several Targets? Vector CART for Regression

Posted by Dan Steinberg on Wed, May 15, 2013 @ 04:39 AM

There are several tricks available for maneuvering CART into generating a single tree structure that will output predictions for several different target (dependent) variables in each terminal node.  For CART the idea seems very natural in that the structure of the model is just a segmentation of the data into mutually exclusive and collectively exhaustive segments.  If the segments of CART tree designed for one target variable have been well constructed then the segments could easily be relevant for the prediction of many outcomes.  A segmentation (CART tree) based on common demographics and Facebook likes, for example, could be used to predict consumption of tuna fish, frequency of cinema visits, and monthly hair stylist spend.  Of course, the question is: could a common segmentation in fact be useful for three such diverse behaviors, and, if such a segmentation existed, would be able to find it?

AUXILLIARY VARIABLES 

The first crude approach to common segmentation is simply to select one of the behaviors as the lead target variable and to list all the other “co-targets” as AUXILLIARY variables.  The method is crude because the tree will be optimized for the segmentation of just one behavior and we may find that there are few useful differences in the other “co-targets” across the tree terminal nodes (segments).  To the extent that the behaviors are indeed similar the predictive accuracy for any one of the targets should be similar to the others.  But without taking the other co-targets into account it is difficult to imagine that this approach could frequently be successful. Nevertheless, using  the AUXILLIARY variables can reveal interesting patterns in data and we try it first below.

Staring with GOODBAD.CSV as an example data set we consider CREDIT_LIMIT and NUM_CARDS as our set of “co-targets” and set up the following model:

CATEGORY

MODEL CREDIT_LIMIT

KEEP AGE, EDUCATION$, GENDER, HH_SIZE, INCOME, MARITAL$,

      N_INQUIRIES, OCCUP_BLANK, OWNRENT$,POSTBIN, TIME_EMPLOYED

AUXILIARY NUMCARDS

which we set up using the GUI dialogs as shown below.  The important details are:

  • Analysis type is set of REGRESSION
  • CREDIT_LIMIT is chosen as the target variable (we may only have one target)
  • NUMCARDS is selected as AUXILLIARY
  • NUMCARDS is not selected as a predictor
  • TARGET is not selected as a predictor (to keep the analysis simpler)

We also set LIMITS ATOM=10, MINCHILD=5 and use 10-fold cross validation for testing given the small size of the data set.  The GUI model setup dialog below shows several (but not all) of these settings.

one tree several targets 1

The resulting model is captured in the navigator below:

one tree several targets 2

and the Summary button leads us to full set of  performance stats for both LEARN and TEST (cross-validated) partitions:

one tree several targets 3

The Terminal Nodes tab contains a graph of the spread of values for  CREDIT_LIMIT in each of the terminal nodes.  This is hardly an impressive model but it roughly creates a segmentation yielding typically increasing credit limits.

one tree several targets 4

To view a display for our other “co-target” we could click on the “Profile” tab of the Summary display making sure to select NUMCARDS as the profile variable to display and clicking the “Average” button underneath the graph.

data mining software

Interestingly enough, NUMCARDS shows a distinct pattern across these nodes even though the variable was explicitly left out of the analysis and tree construction.  As an auxilliary variable it is only involved in the summary reports.  Having ignored this variable to grow the tree we then observe how the tree segmentation works for it as an auxiliary variable.   It is important to keep in mind that the nodes are sorted in a different order for NUMCARDS then they are for CREDIT_LIMIT so even if the segmentation induced by our tree works well for both variables, the highest credit limit segment is not necessarily the segment with highest number of cards.  But this does not diminish the potential value of the segmentation.

You can use this approach with any number of AUXILLIARY variables  but you must walk through the results variable by variable in the Summary/Profile window to see the results.

FORCING TREE STRUCTURE Version 1

This method is available only in SPM 7.1 which is expected to ship in September, 2013.  SPM 7.0 offers a similar approach, as we will discuss below, but the 7.1 method offers more control.  We start with a model built on one of the co-targets and if necessary, prune the tree to our preferred size.  Here we start with the CART tree for CREDIT_LIMIT shown earlier with six terminal nodes.  Now right-click on the root node in the navigator (first green arrow) and select the option highlighted in the submenu (second green arrow).  Just click after highlighting to bring up a new SPM Notepad containing all the commands needed to reproduce the splitting logic of this tree, including the surrogate logic.

regression trees

The window that will come next is not necessarily one that you will want to try to read or study: it is SPM programming code that captures the structure of the CART tree.  We want this structure because we want to impose it on our other co-targets, in this case the variable NUMCARDS.  From the command line TRANALATE LANGUAGE=TOPOLOGY yields the same output.

spm command line

Above, we only show the top part of the FORCE commands; the notepad window contains full details for every node that is to be split. It is important to understand what we have and have not done here.  All that the FORCE commands embody are the splitting rules and their surrogates. They do not contain any information about any target and they do not contain any form of prediction. 

If we now supply a new target variable (NUMCARDS) by adding a command at the top such as

MODEL NUMCARDS

and also add a

CART GO

at the bottom, the tree we will obtain will be a tree about NUMCARDS and every node in the tree and every report produced will be about NUMCARDS.  This second tree, while being totally unaware of CREDIT_LIMIT will follow the exact logic of the CREDIT_LIMIT tree nonetheless, thanks to the FORCE commands.

command line data mining

In order to get exactly what we want we need to insert one more SPM 7.1 command at the top:

FORCE STRICT

This grows the tree to the exact size and structure of our original six-node tree and then stops, ensuring that the tree we want is available.  In SPM 7.0, the tree will grow as large as possible, which means that it may grow freely well beyond the splits governed by the FORCE commands.  Then SPM CART will generate a pruning sequence that does not necessarily include the tree we actually want to impose on NUMCARDS.  (The explanation of this point is quite technically involved and requires a good understanding of the pruning process, but can be ignored for now).  The SPM 7.1 generated tree is:

cart regression tree

which as you can see has exactly the same topology as our first tree. (It also has the exact same splitters and surrogate splitters.) Now, looking at the terminal nodes display on the Summary report we get:

cart summary for 6 nodes

which illustrates the poor R-squared result.  While the medians show an increasing pattern across the nodes the box plots show that there is considerable overlap of values across even the highest and the lowest node.

So what we have accomplished then? Here, we get an A for effort but not much more.  Our idea of obtaining a common structure by arbitrary picking one of the variables to extract a template which is then applied to other co-targets may not work well in practice.  Obviously, if we have built separate models for several targets and have observed that the trees exhibit similarities, we could try this method, at least for the targets with similar trees, with a greater chance of achieving a satisfactory outcome.  Is there another way? There are in fact at least two other ways forward both of which try to generate a best common tree from the beginning.

FACTOR ANALYSIS, PRINCIPAL COMPONENTS, AND SINGULAR VALUE DECOMPOSITION

In this approach, instead of developing a model on any one of our original target variables, we proceed by creating a synthetic target that “represents” all of the co-targets.  This method can be applied to any set of continuous targets and there is no limit on the number of targets we can work with.  However, we need to keep in mind that the more co-targets, the less likely that a single synthetic target will do a good job of representing them all.  There are many ways to describe what we are doing, but the essential idea is to represent the set of different co-targets by a single new variable that incorporates as much information as possible about each of the actual co-targets.  You could think of what we are going to create as an optimally weighted average of all the co-targets (with negative weights allowed).

In SPM 7.0 we need to use the command processor to create the needed synthetic target as follows:

spm 7 commands

This will create a new data set containing all of the original variables and two new variables: SVLIST1_1 and SVLIST1_2.  (With K variables we can generate at most K SVD vectors; enter HELP SVD as a command for more information.) These are the vectors representing our list of co-targets.  Just as an aside, if you run a regression of either co-target on these two new variables you will get an R-Squared of 1.00 or a perfect reconstruction of the original variable.  We will only be using SVLIST_1_1 as our synthetic target; running a regression of each original co-target on this variable yields an R-Squared of 0.79.  In other words, our synthetic target here appears to be a good representation for both original co-targets.

Having saved the new variables we now open the new data set and choose SVLIST_1_1 as our target and the same KEEP list of 11 predictors as we used earlier for our solo CREDIT_LIMIT regression tree.  Observe that we have selected regression as the “Analysis Type” and have for convenience listed CREDIT_LIMIT and NUMCARDS as AUXILLIARY variables.  Remember that AUXILLIARY variables paly no role whatever in the tree construction; they appear only in node specific reports as added descriptive information.

reports descriptive information

Recall that we also set LIMITS ATOM=10, MINCHILD=5 in the dialog below on the “Limits” tab of the Model Setup mechanism.

minimum node sizes

Now we are ready to run the new model on the synthetic target SVLIST_1_1 to get:

cart model

Here we have used the left side button (pointed at by green arrow below) controlling the lower panel error display to highlight all the tree sizes CART considers statistically equivalent in terms of test sample performance.  We use this to guide us to the smallest tree within this range of equally good trees.

regression tree typology

Clicking the Summary button takes us to the Profiles for the variables we really care about: CREDIT_LIMIT and NUMCARDS.  These displays follow:

credit limit

CREDIT_LIMIT

numcards

NUMCARDS

While it is not easy to assess these graphs formally we can see that the common tree does appear to induce some discrimination among terminal nodes (or segments). But this is not our end point in this analysis; the graphs are just rough guides to what might have been accomplished.

The next step is to use the FORCE mechanism exactly as we did before.  But instead of forcing the structure optimized for one variable onto another we force the structure optimized for the synthetic target variable.  To the extent that the co-targets can be segmented in the same way, and to the extent that our synthetic target captures the essence of our original co-targets, the method should work well.

Keep in mind that the tree structure we are imposing now on our actual co-targets does not come from any of the original co-targets; it comes from a CART tree grown on the synthetic variable SVLIST_1_1 which is intended to “represent” all of our co-targets.  Below we show the tree that results when we force this structure on CREDIT_LIMIT and observe that we get a test sample (cross-validated) R-Squared of .4053.  This is actually better than a direct CREDIT_LIMIT model achieves but it is not clear this measurement is entirely convincing.  We would prefer a separate test sample to get an unambiguous performance measure.

cart regression performance

From the Summary display we show the terminal node box plots.

summary for 7 nodes

Now we do the same with NUMCARDS to obtain a tree with the identical structure but for the different target:

classification and regression trees

While the R-Squared here appears to be low the forced structure tree actually performs better (as cross-validated) than the direct model (.1023 vs. .0615).  The terminal node box plots show little discriminatory power (except for the left most terminal node):

summary for 6 nodes

The SVD method shows promise and it is relatively easy to execute.  In SPM 7.0 the steps are:

  • Select the continuous variables you hope to analyze jointly
  • Run the SVD command on those variables (this will create the principal vectors)
  • Run CART with the first vector as target making sure to list all of the original co-targets as AUXILLIARY variables
  • Browse the profile displays for the common CART tree to assess the potential of the results.

If you SCORE your data using the common CART tree SPM will create a NODE variable with values 1,2,…T, where T is the number of terminal nodes of your preferred sized tree.  These are the cluster assignments which you can proceed to work with for any further analysis or reporting.  This is all straightforward in SPM 7.0.

To produce the identical tree and navigator for every target you can regenerate the tree for each target using the FORCE commands capturing the structure of the common tree.  To guarantee that CART will not prune away branches of this structure you must use SPM 7.1 and FORCE STRICT command.

BREIMAN VECTOR CART

In 1992 Leo Breiman suggested another way to develop the optimal single CART tree for a set of co-targets. Here we discuss the idea as it applies to continuous co-targets and we are going to implement the idea manually. Starting again with our GOODBAD.CSV data set and the CREDIT_LIMIT and NUMCARDS variables we are going to create a new data set which is essentially a replicated (copied) version of the original data.  Given that we have two co-targets we are going to work with two replicates of the data, one stacked on top of the other.

Imagine that our main data set looks like this:

X1

X2

X3

Y1

Y2

COPY

11

12

13

14

15

1

21

22

23

24

25

1

31

32

33

34

35

1

Here the X1, X2, and X3 columns are our common predictors and Y1 and Y2 are our co-targets.  We now want stack two copies of this data on top of each other while also adding one new column NEWY, as follows:

X1

X2

X3

Y1

Y2

COPY

NEWY

11

12

13

14

15

1

14

21

22

23

24

25

1

24

31

32

33

34

35

1

34

11

12

13

14

15

2

15

21

22

23

24

25

2

25

31

32

33

34

35

2

35

Notice that NEWY is equal to Y1 when COPY=1 but is equal to Y2 when COPY=2. In other words we have “stacked” Y1 and Y2 on top of each other.  This is the essence of the trick: NEWY will be our target variable and it will literally contain both co-targets in it.  If we wanted to work with 10 targets then we need 10 stacked copies of the data and NEWY would be constructed so that its values were taken from a different Y for each different value of COPY.

In general we need to take care of one other detail if we want to allow each co-target to have the same “weight” in the overall results: each co-target needs to adjusted to have the same mean and variance prior to stacking the data.  The easiest thing to do, naturally, is to standardize each co-target to mean 0 and standard deviation 1.  Using the View…Menu item to reach descriptive stats we that CREDIT_LIMIT has a mean of 14967 and a standard deviation of 20700 while NUMCARDS has mean of 1.8012 and standard deviation of 1.8388.  We can use this information to standardize these two variables using SPM Data prep services.  (From the menus click on the Activity Window icon and then Data Prep, or just use the keyboard shortcut CTRL-N.)

shortcut menu in SPMdata viewer in SPM

Below we show the BASIC script that will create the two parts of the data set we require:

basic script in spm

This creates the two parts we need for the new data set as described above.  Now we need to concatenate them into one file which we can do in many different ways.  You can use our command-line utility for this or cut and paste in Excel, for example.  We saved the “stacked” data into a new file called GOODBAD_VECTOR.CSV. The last steps require us to assemble the new stacked target variable with the BASIC programming statements below and then run the model:

stacked target variable

The new CART tree we build is less a synthetic target than a literal copy of the two targets, one stacked above the other (but standardized to have the same mean of 0 and the same variance of 1).  Therefore, it will be the best candidate developed so far for being a true “common” or “consensus” or “vector target” model.  Below, we decided to prune the tree back to the smallest tree showing cross-validated test performance within 1 standard error of the literal best performing tree (often in modeling small is beautiful).  This gives us the 8-node tree which we will take as our multi-target tree.

new cart tree

Displaying the AUXILLIARY variables via the Summary…Profile graphs

credit limit

CREDIT_LIMIT

num card

NUMCARDS 

The big payoff comes when we use the FORCE facility to regrow the identical 8-node tree above once for each co-target.  The tree will look just like the one above (you might need to select the 8-node tree on the navigator if a smaller tree is just as good or better as CART will assume you want to see the best tree).  Now we can display the box plots for the two co-targets from the Summary reports Terminal Nodes tab:

credit limit

numcards

Since the box plots display the variability of the target in each terminal node we get a much better feel for the predictive reliability of the common tree when applied to each of the co-targets.

SUMMARY AND CONCLUSION

We went through quite a bit of work here and the obvious question is: was the game worth the candle?  In the example above, probably not.  But for other problems, and for circumstances in which we truly must work with a fair number of co-targets it may well be worth it.  We did see that the common tree appeared to be slightly better for both co-targets and there may be constraints limiting us to just one segmentation for a variety of outcomes.  Further, once you understand what needs to be done, automating the process via scripting is possible.  Please check back for further FAQ and BLOG items that might advance the automation of vector CART models; our goal is to make this all available with a button-push before long.

nonlinear regression whitepaper

Topics: CART, classification, Regression