When assessing predictive model performance using cross-validation, the model we obtain after all the computation is actually a model built on all of the data, that is, a model for which no data was reserved for testing. The standard test results reported for this all-data model are actually estimated and synthesized from the supplementary models built on parts of the data. Typically, the supplementary models are thrown away after they have served their purpose of helping us construct educated guesses about the future performance of the all-data model on new previously unseen data.
We can, however, save the predictions made by the supplementary models and thus obtain a prediction for every record in the learn sample derived from a model that was built when that record was in the test partition. Borrowing from the bagger and Random Forests terminology we call those predictions OOB or “out of bag” predictions. We illustrate how to obtain these predictions in SPM 7.0 with some screen captures below. We use a data set pertaining to default on a loan and thus set up a classification model but the same considerations apply to any regression model as well.
The specifics of this data set are not relevant and we encourage you to follow along with one of your own data sets. Of course we begin by setting up the model and identifying the “legal” predictors. This is then followed by a visit to the “Testing” tab where we want to pay attention to the three highlighted items:
- First, we selected V-fold cross-validation, opting for 10 folds.
- Next we highlight the option to save the OOB scores to a file. These will include the predictions made for every record in the learn sample when the record was set aside for testing in one of the 10 cross-validation models.
- Finally, we also highlight the option for saving the cross-validation models into the grove that will also contain the main all-data model. There is no requirement that we save the models if all we are interested in are the OOB predictions. Just checking the boxes and letting SPM know where we want the outputs to be saved is all we have to do. In my session, I elected to call the output file “example_OOB.csv”. Now just click the “Start” button and wait for the results. Below we show the results from our run, where we obtain a 39-node tree with a cross-validated test estimate of the area under the ROC curve of .8460, almost identical to the learn sample ROC.
So far this looks like a normal CART run and indeed the GUI will not display anything to suggest that OOB data has been saved. But if I open up the saved data I will something like this for a CART model:
This requires a bit of explanation. First, remember that in CART, we develop not one tree, but a sequence of pruned trees, starting with the largest tree grown and going all the way down to just a two node tree. So when we generate the OOB predictions they will be specific to a tree of a given size.
This explains the column names in the output data set: First we have the original dependent variable, and then the CVFOLD number. Next come OOB predictions that are tree size specific. Above, the column named PRUNING_18_43_NODES corresponds to results aligned the 43-node tree generated from all the data. When we move from a tree having T terminal nodes to T-1 terminal nodes only the records from the nodes that were combined due to the pruning will have their predictions changed so we might have to move through several prunings to see a change for a given row in the data. To work with this data we should decide on specific tree we want to focus on and then just work with the set of OOB predictions for that tree. However, you might be interested in the trajectory of predicted values that come from moving through a range of different trees in the tree sequence.
What can you do with this data?
Most often we might wish to plot actual against predicted outcome (in regression) or review predictions within specific subsets of the data. The benefit of this data is that you can manipulate and query the data at the record level.
One caution: CART decides on its own whether it is outputting a probability for class 0 or a probability for class 1 in this data set so you may have to inspect the file to determine how to read the scores.
One further caveat for the advanced user: You might think that if you used the saved predicted probabilities (or 1-prob if necessary) to rank the data you could extract an area under the ROC curve equal to that reported on the CART navigator. But the CART navigator and Summary calculate ROC separately for every CV fold and then take a weighted average over the CV folds to arrive at the final (and correct) measure. Using the OOB vector of predictions usually gives very similar results but the explanation for why the weighted average approach is correct will have to wait for another post.