Simply Salford Blog

Diary of a Data Scientist - Inside the Mind of a Statistician

Posted by Charles Harrison on Wed, Jun 29, 2016 @ 07:00 AM

Cross-post from Diary of a Data Scientist, a first hand account of the life of a data scientist; sharing the struggles, triumphs, and day-to-day perspective of a technical research professional. Click to subscribe to the Diary of a Data Scientist Blog here.


Read More

Topics: data science, predictive modeling, statistician, diary of a data scientist

What Type of Automation Does A Data Scientist Need?

Posted by Salford Systems on Fri, Jun 10, 2016 @ 07:00 AM

Cross-post from Dan Steinberg's Blog, on data mining automation. Dan's article discusses the Salford Systems' approach to modeling automation, which is to assist the analyst as much as possible by anticipating the routine stages of model building. The goal is to speed up the decision making process that goes into building a predictive model, and help avoid missing useful test measures and diagnostics. The goal is NOT to replace the data scientist, but to achieve fast and accurate models!

The last thing most data scientists want is a machine that replaces them! The idea that we can build a machine to conduct sophisticated analyses from start to finish has been around for some time now and new attempts surface every few years. The fully automated data scientist is going to be attractive to some organizations with no analytics experience whatsoever but for more sophisticated organizations the promise of such automation is bound to be met with skepticism and worry. Can you imagine visiting a machine learning driven medical service, accepting a diagnosis and prescriptions, and even undergoing surgery with no human oversight involved? Today, even though the pilots say that the airplanes can be 100% computer flown, few of us are ready to take a pilotless airplane ride even though the driverless car appears to be making impressive headway.

In our opinion, automation in predictive analytics is not just a luxury or a future hope. It is an essential component of our everyday modeling practice. The automation we develop for ourselves works its way into every release of our Salford Predictive Modeler. We look at this automation as a way to assist the human data scientist by doing what automation has always done best: relieving the data scientist of tedious repetitive and fairly simple tasks, such as rerunning a cross-validation many times using different random seeds and summarizing the results so that the learning from the experiment is immediately visible to the analyst. Today, some of our automated pipelines do indeed begin from a rather early stage in data exploration and drive all the way through to the delivery of a candidate deployable predictive model encompassing on the order of 15 stages of data processing, remodeling, and automated decision making. We view this as a way to quickly assemble a collection of results that an experienced data scientist can review, critique, modify, and rerun, on the way to arriving at a predictive model (or models) that is vetted by humans and can be trusted.

To a large extent the running of even a single Random Forests model can be viewed as predictive modeling automation. The user has no need to concern themselves with the issues that plague legacy statisticians such as missing values, transformations of predictors, possible interaction effects, outliers in the predictors, or multicollinearity. However, without some human oversight there is going to be genuine risk of what one of my most experienced colleagues refers to as “blunders” that can cause enormous pain if not caught before deployment or before critical decisions are taken. Data science veterans know well of predictive models that went bad due to a mismatch of training data and the data to which the models were to be applied. Just today I discussed this issue with a client confronting such a mismatch; the medical training data was gathered in different regions of the world than the regions in which the model is hoped to be used. We know that even how the data will be collected in different parts of the world will differ, and data errors will not be rare or innocuous. The point of the exercise is to save lives and we cannot accomplish our mission with just routine modeling. In developing an automated system to predict sales of products promoted in a network of large grocery stores we found products that appear to violate the “law of demand” (higher prices cause lower units sold, everything else being equal). Clearly, our system did not recommend increasing the prices during special promotions. If such problems were rare exceptions we could argue that full-on automation of predictive modeling could be largely safe and effective and a few simple rules might help us catch the odd problem cases. In our experience of over more than two decades of predictive modeling, unexpected problems in some part of the process leading from data acquisition to the final deployed model is the rule and not the exception.

By no means am I arguing against a warm embrace of automation in data science and predictive modeling. We have been promoting such automation since we first released a commercial version of the CART decision tree in collaboration with Leo Breiman and his coauthors. (This was before many of today’s data scientists were even born.) We have been building progressively more automation into our SPM product and into the systems we have built for our clients over the years and we will continue to do so. One of our systems retrained itself on new data every six hours, spit out millions of predictions per day, and operated with no downtime for three years before it was retired in favor of more modern technology. The automation we are trying to build is a set of tools that allow data scientists to spend more time thinking about the problems they are trying to solve, to recognize possible problems that can impede their progress or damage the generalization power of their models, and to arrive at the needed results far faster than was ever possible, even a few years ago. However, at least for the present, we see the data scientist as a mandatory participant in the process and our job is to assist them.


 Check out Dan Steinberg's blog for more on the Salford Predictive Modeler®, data mining, and predictive analytics.

Dan Steinberg blog

 

Read More

Topics: SPM, CART, data mining, data science, predictive modeling, Dan Steinberg, Leo Breiman, Salford Predictive Modeler

How Data Science Can help us Discover our Planet’s History

Posted by Kimberly Fahrnkopf on Wed, Oct 15, 2014 @ 06:55 AM

In order to see how data science can help in discovering our earth’s history, it is important to know firstly, about the Gaia Hypothesis. 

Read More

Topics: data mining, data science, predictive modeling, machine learning

Data Science in Biology: A Few Problems & Solutions [guest post]

Posted by Kimberly Fahrnkopf on Thu, Sep 11, 2014 @ 10:10 AM

Guest post by Grant Humphries, Post Doctoral researcher, University of California, Davis

Read More

Topics: TreeNet, data mining, big data, data science, predictive modeling, data analysis

Predicting Shifts in El Niño Using Birds & Data Mining

Posted by Kimberly Fahrnkopf on Wed, Sep 3, 2014 @ 10:38 AM

Dr. Grant Humphries, from the Zoology department at the University of Otago, New Zealand, has spent the last three years studying how a bird species called Sooty Shearwaters can help predict upcoming El Niño occurrences. After much time and research, he has figured out a way to do so using data mining.

Read More

Topics: TreeNet, data mining, Variable Importance, big data, data science, prediction, predictive modeling, predictive model

Choosing Your Own Preferred MARS Model

Posted by Dan Steinberg on Wed, Aug 20, 2014 @ 09:46 AM

When MARS develops a model it actually develops many and presents you with the one that it judges best based on a self-testing procedure.  But the so-called MARS optimal model may not be satisfactory from your perspective.  It might be too small (include too few variables), too large (include too many variables), too complex (include too many splines, basis functions, or breaks in variables), or otherwise not to your liking based on your domain knowledge. So what can you do to override the MARS process?

Read More

Topics: data mining, Variable Importance, MARS, data science, predictive modeling, predictive model, data analysis, Dan Steinberg, statistics, machine learning

Musings on Becoming a Data Scientist [guest post]

Posted by Heather Hinman on Fri, May 9, 2014 @ 07:25 AM

Guest Post by Scott Terry, Rapid Progress Marketing and Modeling, LLC 

Read More

Topics: big data, data science

Why Data Scientists Split Data into Train and Test

Posted by Dan Steinberg on Mon, Mar 3, 2014 @ 07:47 AM

Newcomers to Data Science frequently wonder why we insist on partitioning data into separate roles of learn (aka train) and test rather than just working with all of the data. As we have recently received a number of  questions related to this topic we decided to put together a series of blog posts to help clarify the topic and the issues. 

Read More

Topics: train and test data, data science

6 LinkedIn Groups Every Data Scientist Should Join

Posted by Heather Hinman on Fri, Jan 17, 2014 @ 11:00 AM

Staying informed and up-to-date on the latest industry news doesn't need to be a chore. Have the latest news and trending topics delivered right to your inbox through LinkedIn Groups.

Read More

Topics: data mining, data science

A Data Science Prediction for 2014 [In 120 Words]

Posted by Heather Hinman on Mon, Jan 6, 2014 @ 09:51 AM

January is commonly a time to reflect on the past and make predictions about the future. To each his own, I suppose, but I’m confident that many of us will agree on a few common themes for 2014.

Read More

Topics: data science, prediction

Subscribe to Simply Salford and receive Email Updates

Try the Salford Predictive Modeler software
blog on data mining and predictive analytics, as explored by a pair of data scientist
Targeted Marketing Case Study
Subscribe to Afternoon Analytics Podcast

Follow Salford Systems

Most Popular Posts

Latest Posts