Simply Salford Blog

What Type of Automation Does A Data Scientist Need?

Posted by Salford Systems on Fri, Jun 10, 2016 @ 07:00 AM

Cross-post from Dan Steinberg's Blog, on data mining automation. Dan's article discusses the Salford Systems' approach to modeling automation, which is to assist the analyst as much as possible by anticipating the routine stages of model building. The goal is to speed up the decision making process that goes into building a predictive model, and help avoid missing useful test measures and diagnostics. The goal is NOT to replace the data scientist, but to achieve fast and accurate models!

The last thing most data scientists want is a machine that replaces them! The idea that we can build a machine to conduct sophisticated analyses from start to finish has been around for some time now and new attempts surface every few years. The fully automated data scientist is going to be attractive to some organizations with no analytics experience whatsoever but for more sophisticated organizations the promise of such automation is bound to be met with skepticism and worry. Can you imagine visiting a machine learning driven medical service, accepting a diagnosis and prescriptions, and even undergoing surgery with no human oversight involved? Today, even though the pilots say that the airplanes can be 100% computer flown, few of us are ready to take a pilotless airplane ride even though the driverless car appears to be making impressive headway.

In our opinion, automation in predictive analytics is not just a luxury or a future hope. It is an essential component of our everyday modeling practice. The automation we develop for ourselves works its way into every release of our Salford Predictive Modeler. We look at this automation as a way to assist the human data scientist by doing what automation has always done best: relieving the data scientist of tedious repetitive and fairly simple tasks, such as rerunning a cross-validation many times using different random seeds and summarizing the results so that the learning from the experiment is immediately visible to the analyst. Today, some of our automated pipelines do indeed begin from a rather early stage in data exploration and drive all the way through to the delivery of a candidate deployable predictive model encompassing on the order of 15 stages of data processing, remodeling, and automated decision making. We view this as a way to quickly assemble a collection of results that an experienced data scientist can review, critique, modify, and rerun, on the way to arriving at a predictive model (or models) that is vetted by humans and can be trusted.

To a large extent the running of even a single Random Forests model can be viewed as predictive modeling automation. The user has no need to concern themselves with the issues that plague legacy statisticians such as missing values, transformations of predictors, possible interaction effects, outliers in the predictors, or multicollinearity. However, without some human oversight there is going to be genuine risk of what one of my most experienced colleagues refers to as “blunders” that can cause enormous pain if not caught before deployment or before critical decisions are taken. Data science veterans know well of predictive models that went bad due to a mismatch of training data and the data to which the models were to be applied. Just today I discussed this issue with a client confronting such a mismatch; the medical training data was gathered in different regions of the world than the regions in which the model is hoped to be used. We know that even how the data will be collected in different parts of the world will differ, and data errors will not be rare or innocuous. The point of the exercise is to save lives and we cannot accomplish our mission with just routine modeling. In developing an automated system to predict sales of products promoted in a network of large grocery stores we found products that appear to violate the “law of demand” (higher prices cause lower units sold, everything else being equal). Clearly, our system did not recommend increasing the prices during special promotions. If such problems were rare exceptions we could argue that full-on automation of predictive modeling could be largely safe and effective and a few simple rules might help us catch the odd problem cases. In our experience of over more than two decades of predictive modeling, unexpected problems in some part of the process leading from data acquisition to the final deployed model is the rule and not the exception.

By no means am I arguing against a warm embrace of automation in data science and predictive modeling. We have been promoting such automation since we first released a commercial version of the CART decision tree in collaboration with Leo Breiman and his coauthors. (This was before many of today’s data scientists were even born.) We have been building progressively more automation into our SPM product and into the systems we have built for our clients over the years and we will continue to do so. One of our systems retrained itself on new data every six hours, spit out millions of predictions per day, and operated with no downtime for three years before it was retired in favor of more modern technology. The automation we are trying to build is a set of tools that allow data scientists to spend more time thinking about the problems they are trying to solve, to recognize possible problems that can impede their progress or damage the generalization power of their models, and to arrive at the needed results far faster than was ever possible, even a few years ago. However, at least for the present, we see the data scientist as a mandatory participant in the process and our job is to assist them.


 Check out Dan Steinberg's blog for more on the Salford Predictive Modeler®, data mining, and predictive analytics.

Dan Steinberg blog

 

Read More

Topics: SPM, CART, data mining, data science, predictive modeling, Dan Steinberg, Leo Breiman, Salford Predictive Modeler

Salford Systems' CART Featured in New Predictive Analytics Book

Posted by Salford Systems on Wed, Mar 9, 2016 @ 09:03 AM

Eric Siegel’s Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die is a nontechnical overview of modern analytics with detailed discussion of how machine learning is being deployed across all industries and in all major corporations.  Eric is a hugely entertaining writer and brings with him the expertise you would expect of a Columbia University trained Ph.D..  Geoffrey Moore writes that the book is “deeply informative” and Tom Peters calls the book “The most readable ‘big data’ book I’ve come across. By far”.

Read More

Topics: CART, classification, predictive modeling, classification trees, decision trees, regression trees, predictive analytics, decision tree

Super-Fast Data Scans With One-Split CART Trees [tutorial]

Posted by Dan Steinberg on Mon, May 5, 2014 @ 10:35 AM

Data analysts are always under pressure to get quick results regardless of the size or complexity of the data.  In this brief note we show how to leverage the “Root Node Splits” report in a single split CART tree to gain rapid insights into your data.  Our example is based on some fictitious, but highly realistic, financial data.  The data set contains 264,578 records with several potential target variables.  We illustrate our main points here using the variable DEFAULT_90 which flags the 11.46% of the customer records associated with being late by at least 90 days on a payment ; in total we have 50 variables available, which is of course far fewer than we would normally work with for such data.

Read More

Topics: CART

A Quick Overview of Unsupervised Learning in Salford SPM

Posted by Dan Steinberg on Tue, Feb 4, 2014 @ 06:30 AM

The SPM Salford Predictive Modeler software suite offers several tools for clustering and segmentation including CART,  Random Forests, and a classical statistical module CLUSTER. In this article we illustrate the use of these tools with the well known Boston Housing data set (pertaining to 1970s housing prices and neighborhood characteristics in the greater Boston area).  

Read More

Topics: SPM, Random Forests, CART, unsupervised learning, Cluster Analysis

Data Mining 101: A Beginners' Boot Camp

Posted by Heather Hinman on Tue, Jan 28, 2014 @ 04:12 AM

Let's get right to it! You're a beginner, and you want to know what is needed to start data mining and become an experienced data scientist overnight. We get it - this is the world we live in - quick and dirty. So here we go, take notes!

Read More

Topics: TreeNet, CART, data mining, predictive model, beginner, Data Prep

Probabilities in CART Trees (Yes/No Response Models)

Posted by Dan Steinberg on Tue, Oct 15, 2013 @ 12:43 PM

Probabilities in CART trees are quite straightforward and are displayed for every node in the CART navigator.  Below we show a simple example from the KDD Cup ‘98 data predicting response to a direct mail marketing campaign.

Read More

Topics: Battery, CART, classification

Using CART For Beginners With A Telco Example

Posted by Heather Hinman on Thu, Jul 18, 2013 @ 12:45 PM

Familiarize yourself with CART decision tree technology in this beginner's tutorial using a telecommunications example dataset from the 1990s. By the end of this tutorial you should feel comfortable using CART on your own with sample or real-world data.

Read More

Topics: CART, telecommunications, beginner, Tutorial

The History Behind Data Mining Train/Test Performance

Posted by Dan Steinberg on Tue, Jul 16, 2013 @ 12:56 PM

Updated: July 16, 2013

In their 1984 monograph, Classification and Regression Trees, Breiman, Friedman, Olshen and Stone discussed at length the need to obtain “honest” estimates of the predictive accuracy of a tree–based model. At the time the monograph was written, many data sets were small, so the authors took great pains to work out an effective way to use cross–validation with CART trees.

The result was a major advance for data mining, introducing ideas that at the time were radically new. The main point of the discussion was that the only way to avoid overfitting was to rely on test data. With plentiful data we can always reserve a portion for testing, but with fewer data we might have to rely on cross validation. In either case, however, only the test or cross–validated results should be trusted. In contrast, earlier approaches tended to ignore the training data performance results and focus only on the test data.

Watch This Tutorial on Train/Test Consistency in CART
 
Read More

Topics: TreeNet, CART, train and test data, Cross-Validation, tr

The Experienced Data Scientist's Guide To CART Decision Trees

Posted by Heather Hinman on Wed, Jul 10, 2013 @ 12:23 PM

This guide is for data mining practitioners or data scientists with experience using CART Classification and Regression Trees. Walk yourself through the slideshare for a more in-depth understanding of how CART decision trees can be implemented in today's data mining applications.

Read More

Topics: CART, classification trees, Tutorial

How to Utilize 'Out-Of-Bag' Predictions with Cross-Validation in CART

Posted by Dan Steinberg on Fri, Jun 21, 2013 @ 08:15 AM

When assessing predictive model performance using cross-validation, the model we obtain after all the computation is actually a model built on all of the data, that is, a model for which no data was reserved for testing.  The standard test results reported for this all-data model are actually estimated and synthesized from the supplementary models built on parts of the data. Typically, the supplementary models are thrown away after they have served their purpose of helping us construct educated guesses about the future performance of the all-data model on new previously unseen data.

Read More

Topics: OOB, CART, Cross-Validation