The short answer to this question is “no” we do not think that the 3-way partition is mandatory for SPM core models such as CART and TreeNet. Here we discuss the issue.
Newcomers to Data Science frequently wonder why we insist on partitioning data into separate roles of learn (aka train) and test rather than just working with all of the data. As we have recently received a number of questions related to this topic we decided to put together a series of blog posts to help clarify the topic and the issues.
Question: When can you usefully build models using all of your data?
Updated: July 16, 2013
In their 1984 monograph, Classification and Regression Trees, Breiman, Friedman, Olshen and Stone discussed at length the need to obtain “honest” estimates of the predictive accuracy of a tree–based model. At the time the monograph was written, many data sets were small, so the authors took great pains to work out an effective way to use cross–validation with CART trees.
The result was a major advance for data mining, introducing ideas that at the time were radically new. The main point of the discussion was that the only way to avoid overfitting was to rely on test data. With plentiful data we can always reserve a portion for testing, but with fewer data we might have to rely on cross validation. In either case, however, only the test or cross–validated results should be trusted. In contrast, earlier approaches tended to ignore the training data performance results and focus only on the test data.