This blog post is extracted from one of Salford Systems' video tutorial lectures offered by Dan Steinberg. Take some time out of your day to improve your knowldege of CART.
There are several tricks available for maneuvering CART into generating a single tree structure that will output predictions for several different target (dependent) variables in each terminal node. For CART the idea seems very natural in that the structure of the model is just a segmentation of the data into mutually exclusive and collectively exhaustive segments. If the segments of CART tree designed for one target variable have been well constructed then the segments could easily be relevant for the prediction of many outcomes. A segmentation (CART tree) based on common demographics and Facebook likes, for example, could be used to predict consumption of tuna fish, frequency of cinema visits, and monthly hair stylist spend. Of course, the question is: could a common segmentation in fact be useful for three such diverse behaviors, and, if such a segmentation existed, would be able to find it?
CART in its classification role is an excellent example of "supervised" learning: you cannot start a CART classification analysis without first selecting a target or dependent variable. All partitioning of the data into homogeneous segments is guided by the primary objective of separating the target classes. If the terminal nodes are sufficiently pure in a single target class the analysis will be considered successful even if two or more terminal nodes are very similar on most predictor variables.
In this blog I'll address the CART tree sequence. CART follows a forward growing and a background pruning process to arrive at the optimal tree. In the process CART generates for us, not just one model, but a collection of progressively simpler models. This collection of models is known as the "tree sequence." In this article I will explain the forward and backwards tree generation process. I will also discuss how a modeler might use judgment to select a near optimal tree that might be better for deployment than the so–called optimal tree. (this blog is a transcript of the video below).
Experienced users of decision trees have long appreciated that decision trees in general are often not impressive performers when it comes to regression. This does not in the least suggest that regression trees are not valuable analytical tools. As always, they are fabulous for gaining insight into data, making rapid out of the box progress even when working with highly flawed data, detecting hidden but important flaws in the data, and identifying valuable predictors. Regression trees are among the most useful of tools during exploratory data analysis, when the modeler is struggling to understand the data and elicit the dominant predictive patterns. This will be especially true when the data is strewn with missing values as the CART regression tree user will not need to do any special data preparation devoted to dealing with the missing values: CART will handle the missings effectively. But regression trees (at least single regression trees) often yield lower predictive accuracy than other methods, in part, because they generally produce a rather limited number of distinct predictions. All records falling into a specific terminal node of a regression tree share the same prediction – lumping all modestly similar records into the same predictive bucket. Regression trees suffer from one further problem that is rarely appreciated: because the criterion that is used to build the model is the same as the criterion used to assess the performance of the model, regression trees have an enhanced tendency to overfit to the training data. (More on this latter point later.)
CART users often ask where they can find the value of the R‐squared for their regression trees. The answer is very simple: in conventional statistics.
No one will be surpised to learn that Dr. Breiman was also forward thinking when it came to parallelization of machine learning algorithms. In this succint paper Parallelizing CART Using a Workstation Network (with coauthor Pil Spector) Breiman discusses the essential concepts of parallelizing CART, which is remarkably straightforward. The paper goes on to report the results of several experiments run paralleizing CART over a network observing that speed increases were at best disappointing, but that the techniques could be the only feasible way to deal with data too large to be stored on any one server. They conclude that substantial speed increases would be expected only on multi-CPU shared memory servers.
Download the pdf here.
How Large A Sample Do I Need? Or, Can I Achieve first class results with just a few hundred training samples?
This presentation by Dr. Jason Haukoos was given at the 2012 Salford Analytics and Data Mining Conference (ADMC) in San Diego, CA. Enjoy!
Video blog covering how CART is able to unravel clusters!