When MARS develops a model it actually develops many and presents you with the one that it judges best based on a self-testing procedure. But the so-called MARS optimal model may not be satisfactory from your perspective. It might be too small (include too few variables), too large (include too many variables), too complex (include too many splines, basis functions, or breaks in variables), or otherwise not to your liking based on your domain knowledge. So what can you do to override the MARS process?
The short answer to this question is “no” we do not think that the 3-way partition is mandatory for SPM core models such as CART and TreeNet. Here we discuss the issue.
Data analysts are always under pressure to get quick results regardless of the size or complexity of the data. In this brief note we show how to leverage the “Root Node Splits” report in a single split CART tree to gain rapid insights into your data. Our example is based on some fictitious, but highly realistic, financial data. The data set contains 264,578 records with several potential target variables. We illustrate our main points here using the variable DEFAULT_90 which flags the 11.46% of the customer records associated with being late by at least 90 days on a payment ; in total we have 50 variables available, which is of course far fewer than we would normally work with for such data.
Overfitting is an issue for most machine learning tools. The learners are very flexible and can thus adapt to the noise in the data, as well as to the signal. A classic technique to avoid overfitting is to ensure that we have both learn and validate (or test) data, and then to monitor the learning the process; comparing the goodness of fit or performance on learn and validate data as a function of the amount of training.
Newcomers to Data Science frequently wonder why we insist on partitioning data into separate roles of learn (aka train) and test rather than just working with all of the data. As we have recently received a number of questions related to this topic we decided to put together a series of blog posts to help clarify the topic and the issues.
The SPM Salford Predictive Modeler software suite offers several tools for clustering and segmentation including CART, Random Forests, and a classical statistical module CLUSTER. In this article we illustrate the use of these tools with the well known Boston Housing data set (pertaining to 1970s housing prices and neighborhood characteristics in the greater Boston area).
Probabilities in CART trees are quite straightforward and are displayed for every node in the CART navigator. Below we show a simple example from the KDD Cup ‘98 data predicting response to a direct mail marketing campaign.
Occasionally users ask us how to make use of a model they have just built, and specifically, how to generate predictions from model. In this note we will discuss RandomForests models although the general ideas are relevant for any SPM generated model.
The "leave-one-out" (LOO) or jackknife testing method is well known for regression models, and users often ask if they could use it for CART models. For example, if you had a dataset with 200 rows, you could ask for 200-fold cross-validation, resulting in 200 runs; each of which would be built on 199 training records and tested on the single record which was left out. Those who have experimented with this for regression trees already know from experience that this does not work well, and you do not obtain reliable estimates of the generalization error (performance of your tree on previously unseen data). In this post I comment on why this is the case and what your options are.
Question: When can you usefully build models using all of your data?