Simply Salford Blog

Super-Fast Data Scans With One-Split CART Trees [tutorial]

Posted by Dan Steinberg on Mon, May 5, 2014 @ 10:35 AM

Data analysts are always under pressure to get quick results regardless of the size or complexity of the data.  In this brief note we show how to leverage the “Root Node Splits” report in a single split CART tree to gain rapid insights into your data.  Our example is based on some fictitious, but highly realistic, financial data.  The data set contains 264,578 records with several potential target variables.  We illustrate our main points here using the variable DEFAULT_90 which flags the 11.46% of the customer records associated with being late by at least 90 days on a payment ; in total we have 50 variables available, which is of course far fewer than we would normally work with for such data.

Upon opening the data set we see:

SuperFact Scans1  

and for the purposes of the example we are just assuming that we know that we want to predict the yes/no outcome DEFAULT_90.  To conduct our quick scan we go to the Model setup dialog: 

SuperFact Scans2

Unlike typical advanced modeling here we do not need to be concerned with predictor selection and it will be fine to include all possible predictors.  Of course if you are working with a very large number of records or a very large number of predictors (such as hundreds of thousands or millions) it might pay to eliminate those predictors you already know you will never use. In the example above we only happen to have 49 available predictors and we will allow them all into the analysis.

Now we need to turn to the “Testing” tab and here we want to turn off any kind of testing and go with the “Exploratory” mode as indicated with the boxes we have drawn below.  We would normally want some form of testing but in this case we are looking for a description of the data in the form of elementary predictive power of each variable. 

SuperFact Scans3

The last item we must  pay attention to is the “Limits” tab we want to grow a tree with only one split (we call this a tree with a depth of one).  We are not actually interested in growing a tree; instead we just want to obtain a score for each possible predictor (he goodness of split score at the top of the tree).  Observe that we have the maximum depth of the tree to 1.  We could easily obtain the report we are looking for without this limit but we would end up using much more computer time growing a tree we do not want or need.  Limiting the depth to 1 will vastly speed up this rapid data scan.  

SuperFact Scans4

Now we are ready to click the “Start” button to get our minimal CART tree

SuperFact Scans5  

and click on the “Summary” button to get the following display.  We show two screen shots to display the upper and lower portions of the ranking of the predictors.

SuperFact Scans6

 SuperFact Scans7

This display will show all the predictors ranked by their “Improvement” or splitting power scores.  This score is based on the ability of the predictor to separate the “yes” from the “no”records in just a single simple split and is thus far from a complete assessment of the power of each variable to predict if given the opportunity to do so in the best possible way.  But the single split test is still enormously informative and can help us make rapid early modeling decisions.

Starting with the first display we see that the top predictor is DEFAULT_ON_LOAN with an improvement score of 0.422.  In general, scores greater than  0.10 are suspiciously good and here we are considerably beyond this rule of thumb guideline.  In this example, DEFAULT_ON_LOAN is so strong that it might well be an alternative version of the target and thus an illegitimate predictor.  For certain, we will want to build further models excluding this predictor. We have three further variables with much lower but still extremely strong scores and we want to think about whether these variables have any legitimate reason to be included. FRAUD_TYPE$ looks to be the kind of variable we would not want to use and in fact suggests that our data is a mixture of normal loans with some defaults and some cases of fraud and surely these subsets of data need to be analyzed separately. 

To complete the discussion, let’s also look at the lower portion of the report.  Observe that every variable has been scored with the lowest score approaching zero.  The display allows us to highlight variables and then select their names to form a new KEEP list, which is ultimately the purpose of this analysis. In these early stages of data analysis we recommend eliminating predictors only from the top of the ranking as our quick scan is better designed to capture obviously ultra-strong predictors than to reveal subtlety. 

SuperFact Scans8 

What we did above was select all predictors with improvement scores less than 0.10 and then clicked on “New Keep List” to get an SPM Notepad that we can run to set up a new model:

SuperFact Scans9

This conveniently allows us to set up a new model omitting all of the “too good to be true” predictors.  We can use the menus to get to File...Submit Window or use CTRL-W keyboard shortcut. From here on we would want to run a more conventional model:

  • Choosing a suitable test method (e.g. 20% random sample reserved for test)
  • Allow a deeper tree (Set depth limit back to “AUTO” which is the default)

The idea behind our exercise is that by eliminating the too good to be true variables in a quick initial scan we have a much better chance of obtaining a model from which we can learn something when we devote the resources to actually grow a full depth CART tree.

So save yourself some time and and trouble by trying this trick whenever you start out with a new and unfamiliar data set.

cart case studies


Subscribe to this blog!

Topics: CART