Simply Salford Blog

Dan Steinberg

Recent Posts

Choosing Your Own Preferred MARS Model

Posted by Dan Steinberg on Wed, Aug 20, 2014 @ 09:46 AM

When MARS develops a model it actually develops many and presents you with the one that it judges best based on a self-testing procedure.  But the so-called MARS optimal model may not be satisfactory from your perspective.  It might be too small (include too few variables), too large (include too many variables), too complex (include too many splines, basis functions, or breaks in variables), or otherwise not to your liking based on your domain knowledge. So what can you do to override the MARS process?

Read More

Topics: data mining, Variable Importance, MARS, data science, predictive modeling, predictive model, data analysis, Dan Steinberg, statistics, machine learning

Data Mining & Sampling Issues: Do we need a 3-way partition of data (learn, validate, test)?

Posted by Dan Steinberg on Wed, May 14, 2014 @ 10:51 AM

The short answer to this question is “no” we do not think that the 3-way partition is mandatory for SPM core models such as CART and TreeNet.  Here we discuss the issue.

Read More

Topics: train and test data, partition, sample size

Super-Fast Data Scans With One-Split CART Trees [tutorial]

Posted by Dan Steinberg on Mon, May 5, 2014 @ 10:35 AM

Data analysts are always under pressure to get quick results regardless of the size or complexity of the data.  In this brief note we show how to leverage the “Root Node Splits” report in a single split CART tree to gain rapid insights into your data.  Our example is based on some fictitious, but highly realistic, financial data.  The data set contains 264,578 records with several potential target variables.  We illustrate our main points here using the variable DEFAULT_90 which flags the 11.46% of the customer records associated with being late by at least 90 days on a payment ; in total we have 50 variables available, which is of course far fewer than we would normally work with for such data.

Read More

Topics: CART

How Do I Avoid Overfitting My MARS Model?

Posted by Dan Steinberg on Fri, Mar 21, 2014 @ 11:05 AM

Overfitting is an issue for most machine learning tools. The learners are very flexible and can thus adapt to the noise in the data, as well as to the signal. A classic technique to avoid overfitting is to ensure that we have both learn and validate (or test) data, and then to monitor the learning the process; comparing the goodness of fit or performance on learn and validate data as a function of the amount of training.

Read More

Topics: MARS, overfitting

Why Data Scientists Split Data into Train and Test

Posted by Dan Steinberg on Mon, Mar 3, 2014 @ 07:47 AM

Newcomers to Data Science frequently wonder why we insist on partitioning data into separate roles of learn (aka train) and test rather than just working with all of the data. As we have recently received a number of  questions related to this topic we decided to put together a series of blog posts to help clarify the topic and the issues. 

Read More

Topics: train and test data, data science

A Quick Overview of Unsupervised Learning in Salford SPM

Posted by Dan Steinberg on Tue, Feb 4, 2014 @ 06:30 AM

The SPM Salford Predictive Modeler software suite offers several tools for clustering and segmentation including CART,  Random Forests, and a classical statistical module CLUSTER. In this article we illustrate the use of these tools with the well known Boston Housing data set (pertaining to 1970s housing prices and neighborhood characteristics in the greater Boston area).  

Read More

Topics: SPM, Random Forests, CART, unsupervised learning, Cluster Analysis

Probabilities in CART Trees (Yes/No Response Models)

Posted by Dan Steinberg on Tue, Oct 15, 2013 @ 12:43 PM

Probabilities in CART trees are quite straightforward and are displayed for every node in the CART navigator.  Below we show a simple example from the KDD Cup ‘98 data predicting response to a direct mail marketing campaign.

Read More

Topics: Battery, CART, classification

Scoring RandomForests Models: Applying Models to New Data [tutorial]

Posted by Dan Steinberg on Fri, Oct 11, 2013 @ 05:28 AM

Occasionally users ask us how to make use of a model they have just built, and specifically, how to generate predictions from model.  In this note we will discuss RandomForests models although the general ideas are relevant for any SPM generated model.

Read More

Topics: SPM, Random Forest, model scoring

Why Leave-One-Out (LOO) Cross-Validation Does Not Work For Trees

Posted by Dan Steinberg on Wed, Aug 28, 2013 @ 11:24 AM

The "leave-one-out" (LOO) or jackknife testing method is well known for regression models, and users often ask if they could use it for CART models.  For example, if you had a dataset with 200 rows, you could ask for 200-fold cross-validation, resulting in 200 runs; each of which would be built on 199 training records and tested on the single record which was left out. Those who have experimented with this for regression trees already know from experience that this does not work well, and you do not obtain reliable estimates of the generalization error (performance of your tree on previously unseen data). In this post I comment on why this is the case and what your options are.

Read More

Topics: Regression, classification trees, Cross-Validation

All Train and No Test? Build Predictive Models Using All of Your Data

Posted by Dan Steinberg on Tue, Aug 20, 2013 @ 12:33 PM

Question: When can you usefully build models using all of your data?

Read More

Topics: overfitting, train and test data