During the course of Salford Systems' 4-part webinar series "The Evolution of Regression," some very good questions from the audience have made their way to presenter Dr. Dan Steinberg, CEO and Founder. Here are a few responses that we thought would benefit everyone who is interested in regression, nonlinear regression, regularized regression, decision tree ensembles and post-processing techniques.
Q1: Does Huber L1|L2 hybrid error confer sparsity on the solution?
A: The choice of LOSS function in TreeNet (say between variations of the Huber loss) in and of itself is not expected to have any impact on the sparsity of the final TreeNet model. It turns out that for TreeNet logistic binary models opting for classification accuracy as the optimality criterion (for selecting the right sized ensemble) usually results in the smallest number of trees in the model, whereas opting for log-likelihood (or cross-entropy) usually leads to the largest number of trees with ROC close behind. But for regression models we have not noticed a pattern among the optimal model sizes across optimal model criteria.
Q2: Can we use Importance Sampled Learning Ensembles (ISLE) with Random Forests?
A: The post-processing of a complex model with regularized regression is a generic
idea that can (and should) be applied to many data mining methods. We have applied
- CART: Each internal node in the tree is represented as -1/0/+1 indicator variable (-1=go left, +1=go right, 0=does not pass through this node) GPS post processing selects the nodes to keep (a form of pruning)
- MARS: Take the maximal model with the largest number of basis functions Generalized PathSeeker (GPS) selects the basis functions to keep
- TreeNet: Original version of ISLE to compress a TreeNet model
- Random Forests: GPS applied to the trees just as done for TreeNet. Compression is expected to be far less than for TreeNet as the trees are typically very different from each other. Nonetheless, the selection and reweighting can improve the performance of the forest on new unseen data.
As for the idea of using MARS to do post-processing, this would be appealing in the
type of hybrid model in which we pool both the tree generated features (which could be the final outputs generated by a tree, which is one feature per tree), or node indicators (such as did this record pass through this node, yes or no) with the original raw variables. Principally because the raw variables can cause problems for GPS (outliers, errorneous coding, missing values) and MARS is far more adept at handing such problems using MARS in this context is potentially useful.
Salford Systems introduced just such a hybrid in 1997 in which the terminal nodes of a single CART tree are introduced into a logistic regression along with other raw variables (related papers appear on our website). The motivation for using MARS then was to allow MARS to group similar terminal nodes together (a typical byproduct of a MARS model) and also to allow MARS to discover interactions between terminal nodes and raw variables, which would define subgroup specific regressions.
Q3: Weren't there predecessors to CART?
A: A review of the bibliography in "Classification and Regression Trees" will reveal citations to work in decision tree technology that came both before and after Friedman's first draft of what became CART.
Morgan and Sonqist's (1963) paper on Automatic Interaction Detection was a notable
predecessor. Its flaws were so severe that the statistical community became quite
harderened against the idea of tree-based analytics. Against this backdrop the CART authors faced a steep uphill battle to have other researchers take their ideas seriously. One big difference between AID and the work of Friedman st Stanford (in 1975) and Breiman (in Los
Angeles as an independent consultant) and Stone at UCLA (in 1975) was the latter's emphasis on nonparametric concepts from the get go. Hypothesis testing and conventional statistical testing concepts were not part of the mechanism that became CART.
Morgan and Messenger's 1973 THAID did introduce a precursor of the "twoing" splitting rule found in CART. CHAID, an attempt to repair AID, was introduced in 1980 By Kass, also relies heavily on statistical testing notions.
For a review of the innovations found in CART, such as pruning, cost-sensitive learning, and missing vaklue handling, see my article in "Top 10 Algorithms in Data Mining" (be sure to look for the book and not the super-brief article).
Q4: Is there an API for the SPM software?
A: Salford Systems is hard at work on shared library and DLL versions of our data mining engines to permit them to be embedded in user created systems.
TreeNet will be the first to be ready, with expected beta release by June 1, 2013.
The TreeNet engine supports:
- ONE-TREE models meaning it can be used as a quite effective single decision tree engine.
- Random Forests mode
- Multi-core functionality allowing for faster processing
MARS and GPS are expected to follow as a shared libraries in September, 2013.
Part 4 of the series is on Friday, April 12 at 10am PST. We hope to see you there!