Simply Salford Blog

Webinar Recap: 3 Ways to Improve Regression, Part 1

Posted by Kaitlin Onthank on Thu, Jan 21, 2016 @ 09:19 AM

Did you miss our webinar yesterday? It's never too late to register to get the recording

Read More

Topics: RandomForests, Random Forests, stochastic gradient boosting, Nonlinear Regression, Regression

9 Data Mining Challenges From a Data Scientist Like You

Posted by Salford Systems on Tue, Jan 19, 2016 @ 07:00 AM

Data mining has a plethora of challenging aspects. Some of these challenges are common among nearly all data scientists, analysts, and predictive modelers while others are more industry-specific. Nevertheless, we all run into a snag here and there (hopefully more like there, not here) and it can be a trying task to overcome our day-to-day or project-to-project challenges.

Read More

Topics: command line, sample size, big data, GUI, missing values, data analysis, data mining in education

Top LinkedIn Groups Every Data Scientist Should Join

Posted by Salford Systems on Thu, Jan 14, 2016 @ 08:00 AM

Did you know that the hottest skill on LinkedIn that resulted in hiring is Statistical Analysis and Data Mining? Or that there are one million professional post publishers on LinkedIn?*

LinkedIn is quickly becoming the go-to space for influencers, bloggers, thinkers, and readers, especially in the world of data science. While keeping up with all of the latest buzz in the industry may seem impossible, subscribing to LinkedIn groups lets you read about all of these topics from top minds in the field (or from your average Joe!) at your own leisure. You can control email alerts, contribute to the discussion yourself, peruse job postings, sign up for free webinars, and much more. 

Below are some of the top LinkedIn Groups that we think every Data Scientist should join:

 

Read More

Topics: data mining

Podcast Recap: Dan Steinberg's Early Days

Posted by Nicole Finzi on Tue, Jan 12, 2016 @ 08:00 AM

Did you miss our podcast with Dan Steinberg yesterday? Subscribe to our podcast, Afternoon Analytics, for instant notification reminders! It's never too late to go back and listen! 

Read More

Topics: data mining, data analysis

Trending Data Topics: 2015 vs. 2016

Posted by Kaitlin Onthank on Thu, Jan 7, 2016 @ 08:00 AM

It's no secret that data science popped up on every corner in 2015. We've already seen numerous articles on the hottest topics of the year as well as predictions for 2016, so we thought we'd weigh in on the subject! Although it goes without saying, "data science" is at its peak popularity and we have the proof:

While we know "data science" was a buzzword (buzz-phrase?) of the year, there were certain aspects of the phenomenon that dominated the blogosphere. Below, we'll cover three trending topics in data science; what they are, how they exploded in 2015, and where we see them in 2016. But first, a quick poll...

 

Big Data 

‘Big Data’ is a term that you’ve probably been hearing for some time now. You might even be sick of hearing it. Big data is a broad term that gets tossed around easily, but in its simplest definition, often refers to “data sets so large or complex that traditional data processing applications are inadequate” (Wikipedia).   What does this mean? The data may be so large, or difficult, that it cannot be analyzed or processed with traditional analysis methods. For instance, you may find that R cannot handle building a model with millions of observations. Three factors form the basis for big data: volume, variety, and velocity (Wikipedia).

Another term that you may have heard in conjunction with big data is Hadoop. Hadoop is a framework for data storage and processing spread out over commodity hardware. So if you’re panicking over how to handle all your data, or if you have massive amounts of data to handle and are unsure of how to do so/lack the systems to do so, take this advice from our Senior Scientist, Mikhail Golovnya:

Proper sampling of a 'big' dataset can yield models just as good as or even better than models using all of the data. You don’t need to use every data point available to you.

While we could talk about big data for days, we'll leave you to make your own conclusions about the necessity of big data applications. But, be sure to stay tuned for an in-depth blog post and webinar by CEO Dan Steinberg on whether you need big data or just enough data.

 

Open Data

Open data refers to data that can be used, re-used, and distributed freely by anyone. 2015 saw a surge in the amount of publicly available open data. This movement has been extremely beneficial for a few reasons:

  • Small companies (especially non-profits) are taking advantage by releasing their data to the public and challenging data scientists around the world to solve their analytics problems
  • Larger companies are using platforms like Kaggle to both educate and hire data scientists
  • Data scientists-in-training (and students) now have access to hundreds of thousands of data sets to learn from

On the other side of this movement is the issue of data privacy. Most companies are hesitant to release data due to confidentiality, which is why a good portion of open datasets out there are actually sanitized. Sanitized data is transformed and disguised to the point where it merely mimics the original. Analysis is still possible but variable names and values have been changed to protect the owners. Keep this in mind when drawing conclusions pertaining to open data!

Here are a few of our favorite open data sites:

http://www.data.gov/

http://archive.ics.uci.edu/ml/

 

Data Scientists

A common controversy in the land of data science this year is the definition of those who practice in the field. So-called "data scientists" are supposedly paid too much, paid too little, don't have enough programming skills, don't have enough decision-making skills, and so on. The problem is that no single role can cover the expanse of responsibilities in data science, but we're all using the same term for everyone in the field! We're starting to see statisticians venture into data mining and programmers venture into analytics, causing a sudden explosion of this versatile new job title.

According to VentureBeat1, the data scientist career path is expected to increase by 18.7% between 2010 and 2020. Many universities across the country are now offering degrees in data science as well as adding entire departments devoted to this field.

This seems like a good point to bring up a related poll I came across the other day. On KDNuggets, half of voters believe data scientists will be automated and unemployed by 2025.  At Salford Systems, we couldn't disagree more. Every day we come across situations where automation cannot replace the human ability to analyze a problem and reach a conclusion. Software can be extremely powerful, but there is always a need for intelligent decision-making apart from a computer.

We fully expect to see the rise of the Data Scientist continue in 2016. In fact, data scientists in every department at sizable companies could be the norm within the next few years.

 

1http://venturebeat.com/2013/11/11/data-scientists-needed/

 

Feel free to leave your thoughts below or email us at mlq@salford-systems.com with questions and comments!

Read More

Welcome to Simply Salford!

Posted by Salford Systems on Tue, Jan 5, 2016 @ 08:00 AM

Simply Salford is a lighter and less technical read for people of all backgrounds. You don’t have to be a statistician to enjoy what we are talking about.  You can expect to read about everything from trending topics in data to common technical support questions. We want to hear from you! Email us at mlq@salford-systems.com with questions, comments, or topics that you want to hear about.

Read More

Forecasting with Predictive Analytics

Posted by Eric Lee on Tue, Sep 29, 2015 @ 10:54 AM

Did you have a chance to check out our latest webinar on forecasting with analytics? It's available on-demand. View the recording and learn how data can be utilized quickly in accurate and actionable models. The slides, data set, software, and a step-by-step tutorial are all available here as well.

Forecasting with Predictive Analytics recorded webinar

Read More

Joint Statistical Meetings 2015- Computer Technology Workshop Presentations

Posted by Eric Lee on Wed, Aug 19, 2015 @ 09:46 AM

Were you able to attend our Computer Technology Workshops at JSM 2015 in Seattle, WA this year? If so, we hope you enjoyed the sessions and were able to walk away with additional insight on data mining and modeling.

Read More

Enter a KDD Cup or Kaggle Competition. You don’t need to be an expert!

Posted by Eric Lee on Mon, Jun 22, 2015 @ 08:36 AM

Learn how TreeNet, AKA stochastic gradient boosting, can be used to quickly achieve a place within the top 5 standings of the 2009 KDD Cup competition.

Read More

TreeNet Gradient Boosting and CART Decision Trees: A Winning Combination

Posted by Kimberly Fahrnkopf on Thu, Apr 9, 2015 @ 10:14 AM

6 Reasons to Combine CART and TreeNet:

#1 Build predictive models quickly: One advantage of CART is that is has the ability to build models relatively fast.

#2 Incorporate all types of variables: Your model can include numeric, binary, categorical, and missing values.

#3 Interpretable model representation: CART’s easy-to-understand decision tree graphics will make your job easy when explaining the model to your boss! All you have to do is print it out!

#4 Maintain model stability: One of TreeNet’s top advantages is that it will retain a stable model due to averaging of the individual decision tree responses – something difficult to do with CART.

#5 Produce a high interaction order model: TreeNet allows precise control over interactions among multiple variables.

#6 Include ALL variables: In CART, relatively few predictors make it into the model, but when using TreeNet each tree works with the entire data – many opportunities for variables to enter.

When combining TreeNet and CART – you maintain the simplicity of CART while overcoming its challenges with TreeNet gradient boosting.

Read More