Occasionally users ask us how to make use of a model they have just built, and specifically, how to generate predictions from model. In this note we will discuss RandomForests models although the general ideas are relevant for any SPM generated model.
Here we will be using SPM 7.0 (engine release 466). To check your version, go to the “Help” menu item and select “About”
which will bring up this display reporting the version number (highlighted with the green arrow and box below):
If you have an earlier version we recommend that you update to take advantage of a steady stream of improvements and performance enhancements.
To illustrate the process we make use of a data set of responses to a marketing offer made by an auto insurance company. The target variable is RESPOND and we set up the model using all variables not obviously inadmissible. (This data is not available for download but the type of information being used here is typical).
For the purpose of illustration we also visit the “Select Cases” tab to exclude some of the data from the learn sample. Here we have selected the records for which the variable PRIVATE has a value of 0; be sure to click the “Add To List” button to activate the SELECT. The records we exclude now are going to be the records we score later once the RandomForests model has been built.
Finally, we visit the testing tab and elect to let RandomForests do its own internal self-testing via “Out of Bag” or OOB records, testing each tree on the random subset of records excluded from that particular tree. OOB testing is not the same as k-fold cross-validation but has much in common with it and is a highly reliable method for honestly estimating model performance on new data (generalization error).
Clicking “START” will run the RandomForests model and yields:
Click on the “Summary” button at the bottom of the Results display above to reveal:
Here we note the area under the ROC performance estimated via OOB data and then click on the “Score” button to apply the model to new data.
The SCORE dialog is our control center from which we can decide exactly how to deploy our model. We have highlighted the critical decision areas in green:
- What data do we want to score? This involves specifying a file to be input or defining a specific subset of rows of data to process
- What model do we want to use to do the scoring? If we click on the SCORE button of a model results window SPM will assume that this is the model we want to use. However, we can always elect to use some other model by pointing to the GROVE file in which that model was saved.
- If we would like to save the predictions, record by record, then to which file? For simple performance measurement we are not required to save anything.
In addition, we can visit the “Select Cases” window to refine the specification of the data to be scored. In this case, we will SELECT the records not used to build the model, i.e. the records for which PRIVATE=1.
Now all we need to do is click on the “Score” button to obtain the performance report where we observe that the performance on this other subset of data is decent but not nearly as good as seen on the subset for which PRIVATE=0.:
The advanced tab on the Scoring dialog in SPM 7.0 contains two controls relevant to RandomForests models which we show below.
The options are: “save individual tree predictions”, and “enable buffered scoring”. The former allows you to save record specific predictions as generated separately by every individual tree in the forest. In our case, since we ran 500 trees, we would be saving 500 new columns of data for every record being scored. Normally this would be done for further processing of the predictions, such as with GPS regularized regression. If you do dig into these predictions you must keep in mind that that RandomForests predictions and performance measures are calculated differently internally during model building than when scoring data after model construction. The difference has to do with “out of bag” data and is very important to understand. When calculating our prediction for a specific record in the learn data we NEVER make use of all trees grown. Instead we make use only of the trees for which the record in question was OOB, or not used to grow the tree. On average, about 37% of the trees in a forest were grown when a specific record was OOB. Thus, with 500 trees, on average about 185 meet the OOB criterion.
When we SCORE data we no longer keep track of the in-bag out-of-bag distinction and essentially assume that every record was OOB for every tree. This, of course, is true if we are scoring holdout data but will not be true if we simply score the original training data. Our point here is: do not be surprised if, when scoring your training data, you obtain substantially better performance statistics than reported by SPM during model building. The reason will be that during model building SPM RandomForests reports OOB performance stats whereas when scoring SPM calculates and reports a mixture of both learn sample and OOB performance.
The “Enable Buffered Scoring” is an option that should use if you encounter memory problems when attempting to load a large RandomForests grove (for example, groves with many trees, extremely large trees, or both). By opting for buffering the Scoring operation will proceed, but more slowly, as parts of the forest are swapped in and out of main memory.
Below we display a view of the data saved by the scoring operation when we opt to save the individual trees.
The first column is the actual target variable RESPOND, which is followed by the class label assigned to each record by the forest (the predicted class). The PROB variables are the probabilities assigned to each level of the target, and are based on a weighted average of the individual tree predictions.