Simply Salford Blog

Data Mining & Sampling Issues: Do we need a 3-way partition of data (learn, validate, test)?

Posted by Dan Steinberg on Wed, May 14, 2014 @ 10:51 AM

The short answer to this question is “no” we do not think that the 3-way partition is mandatory for SPM core models such as CART and TreeNet.  Here we discuss the issue.

Machine learning and data mining specialists have long understood that we typically cannot trust results based on learning (or training) data alone as modern learning machines are flexible enough to often yield near perfect fits to such data.  To ensure that our results are “honest” as Breiman, Friedman, Olshen, and Stone wrote in their 1984 “Classification and Regression Trees” monograph we need to evaluate performance on data that was not used to build the model.  In cases where learning data is truly scarce we have the option of using cross-validation to synthesize a test data set via some very clever sample re-use ideas that are genuinely reliable.  One way or another we can and must at least make use of a form of two-way data partition (learn and other, where “other” is sometimes called “test” and sometimes “validation”).

The question we address here is whether this is really enough when the process of model development is lengthy and may involve the building of dozens, hundreds, or even thousands of models.  If each model is evaluated on the second data partition one might argue that we are in fact learning from it.  The learning may be limited and indirect, but the cumulative effect of intensive model development and evaluation blunts the independence of the second partition data.  The concern of several researchers has been that this rather standard strategy of model development will lead to overfitting or the development of models that appear to perform better than will really occur when the models are deployed and applied to genuinely never-seen data.  To ensure that we do not fall into this trap it is argued that a third partition of data is required.  The sole purpose of this third partition is to provide an honest estimate of the performance of the model selected as best; otherwise, the third sample plays (or should play) no role in model development or model selection.  If we truly have large volumes of data there is no reason not to create a three-way data partition as there is essentially no cost to doing so and it can provide one extra bit of assurance regarding our model performance. But when data is not so plentiful should we be concerned about having a third “holdout” partition?

We illustrate our conclusions with results obtained from an intensive attempt to optimize a model based on results from a validate partition and report parallel performance results for the holdout sample.  As the holdout sample plays no role in the model refinement it provides an honest assessment of the performance of each stage of the model’s evolution.  Our thinking is that if the process of optimization was indeed learning substantially from the second data partition then this should be clearly revealed by the results from the holdout performance.

Our first example is derived from banking data where the target variable is DEFAULT_90, a flag for failure to make a payment on an account for 90 days.  The predictors are common credit bureau and credit related behavioral data.  Below we see some basic file information in the SPM Activity window: 

sampling issues 1

 The data has been limited to 45 plausible predictors to simplify the following stages of analysis and we begin with a straightforward CART model.  Note the dependent variable (target) has been checked off in the target column and all variables except SAMPLE$ (which is the Learn/Test/Holdout indicator) and an ID variable have been selected as predictors.

 sampling issues 2

As the whole point of this analysis is the comparison of performances across sample partitions we must visit the “Testing” tab where we specify the variable we created to identify the data partitions. (See our separate discussion of data partitions in the relevant FAQ.)  This variable, SAMPLE$ has values “Learn”, “Test”, and “Holdout” and was created for us by SPM in a prior data step, but we could have created it ourselves in some other environment (e.g. scripting, Excel, SQL, statistics package).

Observe on the next screen display that we offer the following options for testing:

  • No testing, use all of the data, exploratory mode
  • Random partition into two partitions (learn and test)
  • Test partition in another file (this tends to slow things down)
  • Variable separates data into partitions

The latter gives you the greatest flexibility and facilitates reliable comparisons when engaging in complex and multi-stage analyses.

sampling issues 3

Here we have selected our separation variable (the command language equivalent is PARTITION SEPVAR=SAMPLE$ and you could run this command from an SPM NotePad and get the same effect as we do from the GUI display. 

At the time of this writing we are not making the actual data available but we still want to document everything we are doing in setting up these runs.  You might try to do the same on your own data.

The last settings we manipulate in the GUI are found on the Model Setup “Limits” tab where we want to control the minimum sample sizes for nodes appearing near the bottom of the tree.

  • ATOM, the smallest splittable node.  Here will set this to 100 meaning that we stop splitting once we arrive at a node with fewer than 100 data records
  • SMALLEST CHILD (MINCHILD).  When we split a node we create two child nodes and this control prevents the smaller of the two becoming smaller than some limit, which here we set to 30.

sampling issues 4  

This last setting ensures that any small terminal nodes still have plausible sample sizes and safely limits tree growth which could be considerable if unrestrained when using a data set with more than 260,000 records as we have here. This is just an aside regarding our model setup and not strictly relevant to our current discussion. 

Growing the tree in the GUI will yield the usual tree topology display we call the CART Navigator, which also show the performance of the optimal tree on the Learn and Test partitions.  We precede this with some of the classic output showing the breakdown of the target variable by its 0/1 value and also by sample partition.  You can see that the Test and Holdout partitions are actually identical in size although the realized samples show slightly different rates of default.

 sampling issues 5

sampling issues 6

We have highlighted the Learn and Test ROC values which are quite close and this would be our first reason to expect that the Holdout ROC would also be close to .8412.  As we do not automatically produce a report contrasting the Holdout ROC we now extract that result via the following steps:

  • Visit the SCORE dialog.  Click on the “Score…” button at the right bottom area of the Navigator or via the menus
  • SELECT records belonging only to the Holdout sample
  • Score 

sampling issues 7 

The SCORE dialog requires you to specify (or accept the decision made by SPM)

  • The data which is to be scored
  • The model which will be used to do the scoring 

Here we accept the information assumed by SPM: we want to use the data we already have open and we want to use the model we just built.  We do not show the essential SELECT operation that confines the scoring to the data from the Holdout partition.

Observe that the score operation was conducted on 39,678 records (as expected) and that the ROC value is 0.83904, which is reasonably close to the 0.8412 seen for the test data.  The difference of .00216 is about a one half standard deviation of the estimate of ROC (the classic output reports the variance of the ROC estimate as bout .00002 which is a standard deviation of about .0045.

sampling issues 8  

To simulate the process of repeated optimization of a model in response to trial and error modeling we invoke the automation capabilities of SPM

sampling issues 9

The automation selected is SHAVING which repeatedly removes a variable (or variables) from the KEEP list of predictors based on the variable importance rankings of the most recently completed model.  As we progressively refine this list of predictors we make use of test sample performance at every step to determine if we are improving our model and when and if to stop the refinement process. Below we see that the automation suggests that we can cut all the way back from 45 predictors to 5 improving model performance on test data. 

sampling issues 10

The holdout sample results for this series of models is not automatically produced by SPM, but we did organize the results in a convenient fashion and show the results graphically.  Each pair of bars display the test and holdout ROC for a single cycle of predictor elimination and we can see that the holdout results remain very close to the test results throughout.  The average difference is .009 and the values range from .0019 to .0165.  These results suggest that from a practical perspective we are unlikely to be misled by relying on test data alone. It is indeed a surprise that the Holdout sample uniformly shows a slight performance advantage which if anything should further strengthen our case.

 sampling issues 11

The example we have shown here is just one of many such comparisons we have conducted, all supporting the same conclusion.  We do want to emphasize that our conclusions are based on experience with CART and TreeNet and with cross-sectional data coming from banking, insurance, and targeted marketing.

Topics: train and test data, partition, sample size