Simply Salford Blog

Why Data Scientists Split Data into Train and Test

Posted by Dan Steinberg on Mon, Mar 3, 2014 @ 07:47 AM

Newcomers to Data Science frequently wonder why we insist on partitioning data into separate roles of learn (aka train) and test rather than just working with all of the data. As we have recently received a number of  questions related to this topic we decided to put together a series of blog posts to help clarify the topic and the issues. 

In every day business intelligence settings we typically search for answers to relatively simple questions usually posed as queries to a database. How many 16 oz. tins of Fruto do Mar tuna fish did we sell last week in each of our large format grocery stores? How many visitors did we get to our website yesterday? For such questions it would not make much sense to partition the data; in fact,  our questions require us to use all of it. Although we are focused on the real world, if we are given trustworthy data, we can think of the questions as being primarily about the data. Answers in the form of descriptions and summaries of data are exactly what we are looking for. We ought to have no reason to doubt or be skeptical about the answers that our, queries generate, and if we do, we probably have some very straightforward ways to fix any problematic data.

When our objective turns to prediction, and in particular towards the development of predictive models, we will typically use our models to guide many decisions, and to make hundreds, thousands, or even billions of predictions. With a predictive model our principal focus is no longer on the data but on a type of theory about reality. Here we have every reason to be cautious, if not skeptical, as we are going considerably beyond the confines of descriptive statements about our data. Predictions by their very nature lead us into unknown territory; we commit to statements about what has not yet happened, in contrast to typical business intelligence queries that are about what has already happened. If we develop predictive models  we must have a way to assess their accuracy, reliability and credibility. 

When we say that we want to assess the quality of a predictive model we are saying we would like to know what will happen if we commit to the use of a model in making predictions. Will our predictions be relatively close to the actual outcomes we eventually get to see? When predictions are incorrect approximately how large are our prediction errors? Will our model make consistent errors, such as always over-predicting the outcome for an identifiable group of entities? These are common sense questions but the underpinnings of the correct answers are technically challenging to follow. 

The simplest way for us to get a handle on the ability of a predictive model to perform on future data is to try to simulate this eventuality. Although we cannot literally gain access to the future before it occurs we can reserve some of our currently available data and treat it as if were data from the future. For example, if we are predicting which Internet ads a web site visitor will click on, we might build predictive models using data from two days ago and make predictions for yesterday. This is actually a perfect simulation with the benefit that we presumably already know which ads were clicked on, and by whom, yesterday. We can thus compare our predictions with the outcomes that actually occurred. In marketing campaigns and credit risk models we usually work with data pertaining to a single point in time (or interval in time such as one week, one month, one campaign). Such data is often referred to as cross-sectional. For such problems we typically divide the available data into separate partitions randomly, developing our models on one of these partitions and using the other for predictive model assessment and possibly model refinement. We turn to this topic now. 

The simplest partition possible for cross-sectional data is a two-way random partition to generate a learning (or training) set and a test set (sometimes instead referred to as a validation set). The thinking underlying such a division is that:

  • The data available for analytics fairly represents the real world processes we wish to model

  • The real world processes we wish to model are expected to remain relatively stable over time so that a well-constructed model built on last month’s data is reasonably expected to perform adequately on next month’s data 

If our assumptions are more or less correct then the data we have today is a reasonable representation of the data we expect to have in the future. Holding back some of today’s data for testing is therefore a fair approximation to having future data for testing.  

The division of the data into learn and test must be executed carefully to avoid introducing any systematic differences between learn and test. For example, we would never want to select the first half of the data for learning as there is a risk that the data has been ordered in a specific way. For example, all records pertaining to a click or a default might come before the non-clicks or the good accounts might precede the defaults. A common way to ensure lack of systematic difference between the partitions is simple random assignment. For every record, we flip a digital coin with a predefined probability of a “heads”, and assign that record to a partition depending on the outcome of the digital coin flip. There are other ways to accomplish the partition but they all rely on a form of random assignment. 

Why Bother Creating a Test Partition? 

First and foremost, we create test partitions to provide us honest assessments of the performance of our predictive models.  No amount of mathematical reasoning and manipulation of results based on the training data will be convincing to an experienced observer. Most of us have encountered strategies for profitable stock selection that perform brilliantly on past (training) data but somehow fall down where it counts, namely on future data. The same will apply to any predictive model we generate with modern learning machines. 

Beyond the need to demonstrate performance in a convincing and common-sense way to decision makers and non-specialists, the test partition plays a critical model selection role for the CART decision tree and TreeNet Gradient Boosting. The CART learning machine is so adaptive that, with a sufficiently large tree, it can often achieve 100% predictive accuracy on the training data. This is also true for some other learning machines.  For CART we use the test partition to evaluate the predictive performance of  trees of different sizes in order to identify and present the “right sized tree”.  In TreeNet gradient boosting we use the test partition as a constant monitor of the progressively growing ensemble of trees and finally to select the “right sized ensemble” or the optimal number of trees to keep. In other words, the test partition plays not just the passive role of evaluation of a specific model, but also the active role of model selection from a well defined set of alternative models. There is much more to say about this active role and we touch on it briefly below. 

Roles of Learn and Test Partition 

Newcomers to Data Science are often unclear about the distinct roles of the learn and test partitions. The learn partition has a single and essential role: it provides the raw material from which the predictive model is generated. All the details of a CART, MARS, TreeNet, Random Forests, or GPS/Lasso model are based on this data. Decision tree node splitters, coefficients of a Lasso regression, and most importantly, the model predictions, are based on the learn sample. We do not use, and indeed we do not even need access to the test partition in order to build the predictive model. This may sound obvious but appreciating the implications of these statements often requires some time to absorb. So we continue to elaborate these points here (with, for now, some convenient oversimplification). 

The predictions that a CART decision tree makes out of the box are derived from the learn partition data. To determine these predictions we examine the data inside of each terminal node and typically take the results as literal model predictions. If in a given terminal node we observe a click-through-rate of 2.7% then our model prediction for all future records arriving at this terminal node will be 2.7%. If in our retail sales example a CART node displays that a given supermarket sold 1,738 16 oz. tins of Fruto do Mar tuna fish last week, then any future record reaching this terminal node will be associated with a predicted sales rate of 1,738. Similar observations can be made for TreeNet Gradient boosting ensembles which are constructed from many small CART trees. The test partition data play no role whatever in the construction of these predictions and we would have arrived at the same trees and the same predictions even if we had no test data whatsoever.

Among other uses, the test partition is employed to evaluate the performance of the model. If our learn data predicts a 2.7% click-through-rate in a given node and the test data displays a rate of 2.69% we are likely to be pleased with the results, given how close learn and test results are. However, if we see instead a test CTR of 0.5% we would want to reject the model and continue the process of model re-engineering. Even if we have already committed to a model, the test partition provides us with some guidance as to the accuracy we might expect from the predictions. If in the tuna fish example our predictions never stray further than a 10% difference between learn and test results we could use this in our planning and strategy for avoiding overstock or understock outcomes on supermarket shelves. 

For CART and TreeNet the test partition plays another essential role: the selection of a “right sized model” from a defined menu of options. A CART decision tree is typically first grown to a relatively large size, reaching potentially hundreds or even thousands of terminal nodes. This large tree, called the maximal tree, is then put through a process of pruning, by which we move to progressively smaller and smaller trees. The pruning process does not stop until no tree remains and every split has been pruned away. The trees we arrive at in this process are our candidate models, all having been generated from the learn data. If this process appears a bit murky you could instead think about trees growing instead of being pruned. We start with no tree and make our first split, reaching a small tree with two terminal nodes. Next, we split one of these two terminal nodes into two new terminal nodes and arrive at a three terminal tree. And so on. With a very large number of trees of different sizes to consider, which one should we use for making predictions? This where the test data come into the picture. 

Any given sized tree built on the learn data commits to specific predictions which can be compared to test data actual outcomes. Doing this for every tree size available allows us to produce a curve displaying test partition performance plotted against size of tree. For CART, we typically display a variation of the misclassification error rate or test partition normalized sum of squared prediction errors. Generally, this curve is bath-tub or U-shaped reflecting the fact that an overly-simple model leveraging just one or two predictive factors will not perform as well as a more fleshed out model richer in leveraging predictive information. This leads from relatively high error rates through a range of steadily decreasing error. However, once the model becomes overly-complicated it will typically show progressively deteriorating predictive accuracy with increasing complexity. Without a test partition we will never be able to tell exactly where the sweet spot of ideal model size is located. We also always have to allow for the possibility that no model performs well on previously unseen data and that we are working with data unsuitable for any type of predictive modeling. Should this be the case the best model is in fact no model at all. 

To illustrate some of our above points we display some screen shots from models built using the Salford Predictive Modeler software suite v7.0. The first screen shot shows results for a model built with no test partition at all. The upper panel displays the topology of the entire tree and the lower portion displays learn sample performance plotted against tree size. The Y-axis displays the misclassification rate (relative to a null model) and we see that this rate reaches zero (perfect classification of all learn data) when the tree reaches its maximum possible size. Observe that the area under the ROC curve is 1.00 for a tree with 349 terminal nodes. For the smallest possible tree we can grow the normalized error rate is a vastly larger 0.60.

 train test 1

Now we run the identical analysis on the identical learn partition but make use of a test partition.  (The method we use is cross-validation which requires its own set of explanations.  For now, we ask anyone not familiar with the method to trust our statement that this method provides us with a test partition of the same size as the learn partition).

 train test 2

Note that the test partition based measure of the normalized misclassification error rate is not even close to zero. Instead, it is 0.573. Looking at the panel on the right containing “Model Statistics” we see that the model with the best performance on test data as measured by the area under the ROC curve contains only 19 terminal nodes. There is no way we could have possibly guessed this from our learn data only results.

Are We “Learning” From the Test Partition?

If we first commit to a model and then evaluate it with test data we are clearly conducting a “clean” evaluation. But if we use test results to help us pick the model we will subsequently commit to are we not learning from the test partition? This line of thinking has convinced many machine learning practitioners to work with a 3-way partition of the data, reserving a final “holdout” sample to ensure that our winning model truly stands up to one more truly clean evaluation. In our experience with CART and TreeNet we have never encountered a situation in which a third partition failed to confirm what the test partition revealed. In other words, the amount of “learning” is so modest that we are not really at risk of having spoiled the independent evaluation role of the test partition.

Final Remarks

We have taken quite some time to get to the punch line of this story which is: the model we build is constructed only from the learn partition of the data and all of the fine details are derived from learn data only. The test partition is used for model evaluation and for model selection but the role of the test data is at best indirect. This does not mean that a modeler cannot go on to build a new model on data where the learn and test data are pooled; in fact, we have done this in many circumstances. But this is a separate phase of the modeling and is discussed in a separate post.

data mining automation

Subscribe to this blog!

Topics: train and test data, data science