Simply Salford Blog

Understanding CART Splits, Competitors, and Surrogates [video]

Posted by Heather Hinman on Thu, Jun 6, 2013 @ 04:04 AM

This blog post is extracted from one of Salford Systems' video tutorial lectures offered by Dan Steinberg. Take some time out of your day to improve your knowldege of CART.

Read More

Topics: CART, classification trees, Tutorial

One Tree for Several Targets? Vector CART for Regression

Posted by Dan Steinberg on Wed, May 15, 2013 @ 04:39 AM

There are several tricks available for maneuvering CART into generating a single tree structure that will output predictions for several different target (dependent) variables in each terminal node.  For CART the idea seems very natural in that the structure of the model is just a segmentation of the data into mutually exclusive and collectively exhaustive segments.  If the segments of CART tree designed for one target variable have been well constructed then the segments could easily be relevant for the prediction of many outcomes.  A segmentation (CART tree) based on common demographics and Facebook likes, for example, could be used to predict consumption of tuna fish, frequency of cinema visits, and monthly hair stylist spend.  Of course, the question is: could a common segmentation in fact be useful for three such diverse behaviors, and, if such a segmentation existed, would be able to find it?

Read More

Topics: CART, classification, Regression

Unsupervised Learning and Cluster Analysis with CART

Posted by Dan Steinberg on Tue, May 7, 2013 @ 10:01 AM

CART in its classification role is an excellent example of "supervised" learning: you cannot start a CART classification analysis without first selecting a target or dependent variable. All partitioning of the data into homogeneous segments is guided by the primary objective of separating the target classes. If the terminal nodes are sufficiently pure in a single target class the analysis will be considered successful even if two or more terminal nodes are very similar on most predictor variables.

Read More

Topics: CART, Cluster Analysis, Biomedical Application

Best Practices for Building the Optimal CART Tree [mini tutorial]

Posted by Dan Steinberg on Mon, Apr 15, 2013 @ 09:24 AM

In this blog I'll address the CART tree sequence. CART follows a forward growing and a background pruning process to arrive at the optimal tree. In the process CART generates for us, not just one model, but a collection of progressively simpler models. This collection of models is known as the "tree sequence." In this article I will explain the forward and backwards tree generation process. I will also discuss how a modeler might use judgment to select a near optimal tree that might be better for deployment than the so–called optimal tree. (this blog is a transcript of the video below).

Read More

Topics: CART, classification trees

Regression Model Building via Classification Trees [tutorial]

Posted by Dan Steinberg on Tue, Apr 9, 2013 @ 12:13 PM

Experienced users of decision trees have long appreciated that decision trees in general are often not impressive performers when it comes to regression.  This does not in the least suggest that regression trees are not valuable analytical tools. As always, they are fabulous for gaining insight into data, making rapid out of the box progress even when working with highly flawed data, detecting hidden but important flaws in the data, and identifying valuable predictors. Regression trees are among the most useful of tools during exploratory data analysis, when the modeler is struggling to understand the data and elicit the dominant predictive patterns. This will be especially true when the data is strewn with missing values as the CART regression tree user will not need to do any special data preparation devoted to dealing with the missing values: CART will handle the missings effectively.  But regression trees (at least single regression trees) often yield lower predictive accuracy than other methods, in part, because they generally produce a rather limited number of distinct predictions. All records falling into a specific terminal node of a regression tree share the same prediction – lumping all modestly similar records into the same predictive bucket.  Regression trees suffer from one further problem that is rarely appreciated: because the criterion that is used to build the model is the same as the criterion used to assess the performance of the model, regression trees have an enhanced tendency to overfit to the training data.  (More on this latter point later.)

Read More

Topics: CART, Regression, classification trees, Tutorial, SPM 7

Finding R-Squared for CART Regression Trees

Posted by Dan Steinberg on Tue, Feb 19, 2013 @ 10:14 AM

CART users often ask where they can find the value of the R­‐squared for their regression trees. The answer is very simple: in conventional statistics.

Read More

Topics: CART, Regression

A Brief Note On CART and Parallelization

Posted by Dan Steinberg on Thu, Dec 20, 2012 @ 11:04 AM

From time to time we like to call attention to articles that we think are of general interest to our user community, and we especially like to spotlight articles written by our mentors Drs. Leo Breiman (in the picture below around 30 years old), Jerome Friedman, Richard Olshen and Charles Stone. The article highlighted here was written in 1995 when a top flight personal computer was equipped with 8MB (that is MB not GB) and a harddrive with 1GB (that is, one gigabyte) and the CPU ran about 100 times slower than todays norm).

No one will be surpised to learn that Dr. Breiman was also forward thinking when it came to parallelization of machine learning algorithms. In this succint paper Parallelizing CART Using a Workstation Network (with coauthor Pil Spector) Breiman discusses the essential concepts of parallelizing CART, which is remarkably straightforward. The paper goes on to report the results of several experiments run paralleizing CART over a network observing that speed increases were at best disappointing, but that the techniques could be the only feasible way to deal with data too large to be stored on any one server. They conclude that substantial speed increases would be expected only on multi-CPU shared memory servers.

Download the pdf here.
Read More

Topics: CART, parallelization

Accurate results with limited data in CART and TreeNet

Posted by Dan Steinberg on Fri, Dec 7, 2012 @ 07:11 AM

How Large A Sample Do I Need? Or, Can I Achieve first class results with just a few hundred training samples?

Read More

Topics: TreeNet, CART, sample size

How to Apply CART and Logistic Regression to Help Diagnose HIV

Posted by Heather Hinman on Thu, Nov 29, 2012 @ 09:13 AM

This presentation by Dr. Jason Haukoos was given at the 2012 Salford Analytics and Data Mining Conference (ADMC) in San Diego, CA. Enjoy!

Read More

Topics: CART, Logistic Regression

Using CART to Unravel Clusters in Asthma Databases

Posted by Heather Hinman on Wed, Nov 14, 2012 @ 01:11 PM

Video blog covering how CART is able to unravel clusters!

Read More

Topics: CART