Simply Salford Blog

Data Mining & Sampling Issues: Do we need a 3-way partition of data (learn, validate, test)?

Posted by Dan Steinberg on Wed, May 14, 2014 @ 10:51 AM

The short answer to this question is “no” we do not think that the 3-way partition is mandatory for SPM core models such as CART and TreeNet.  Here we discuss the issue.

Read More

Topics: train and test data, partition, sample size

Why Data Scientists Split Data into Train and Test

Posted by Dan Steinberg on Mon, Mar 3, 2014 @ 07:47 AM

Newcomers to Data Science frequently wonder why we insist on partitioning data into separate roles of learn (aka train) and test rather than just working with all of the data. As we have recently received a number of  questions related to this topic we decided to put together a series of blog posts to help clarify the topic and the issues. 

Read More

Topics: train and test data, data science

All Train and No Test? Build Predictive Models Using All of Your Data

Posted by Dan Steinberg on Tue, Aug 20, 2013 @ 12:33 PM

Question: When can you usefully build models using all of your data?

Read More

Topics: overfitting, train and test data

The History Behind Data Mining Train/Test Performance

Posted by Dan Steinberg on Tue, Jul 16, 2013 @ 12:56 PM

Updated: July 16, 2013

In their 1984 monograph, Classification and Regression Trees, Breiman, Friedman, Olshen and Stone discussed at length the need to obtain “honest” estimates of the predictive accuracy of a tree–based model. At the time the monograph was written, many data sets were small, so the authors took great pains to work out an effective way to use cross–validation with CART trees.

The result was a major advance for data mining, introducing ideas that at the time were radically new. The main point of the discussion was that the only way to avoid overfitting was to rely on test data. With plentiful data we can always reserve a portion for testing, but with fewer data we might have to rely on cross validation. In either case, however, only the test or cross–validated results should be trusted. In contrast, earlier approaches tended to ignore the training data performance results and focus only on the test data.

Watch This Tutorial on Train/Test Consistency in CART
Read More

Topics: TreeNet, CART, train and test data, Cross-Validation, tr

Data Mining: How to Partition Data into Train and Test

Posted by Dan Steinberg on Fri, May 18, 2012 @ 11:33 AM

There are several options for partitioning data randomly into train and test partitions, repeating the process to obtain different partitions.

Read More

Topics: CART, train and test data, partition