Simply Salford Blog

Kaitlin Onthank

Find me on:

Recent Posts

Webinar Recap: 3 Ways to Improve Regression, Part 2

Posted by Kaitlin Onthank on Thu, Jan 28, 2016 @ 10:22 AM

Did you miss our webinar yesterday? It's never too late to register to get the recording

Read More

Topics: stochastic gradient boosting, Nonlinear Regression, Regression Splines, Regression

Webinar Recap: 3 Ways to Improve Regression, Part 1

Posted by Kaitlin Onthank on Thu, Jan 21, 2016 @ 09:19 AM

Did you miss our webinar yesterday? It's never too late to register to get the recording

Read More

Topics: RandomForests, Random Forests, stochastic gradient boosting, Nonlinear Regression, Regression

Trending Data Topics: 2015 vs. 2016

Posted by Kaitlin Onthank on Thu, Jan 7, 2016 @ 08:00 AM

It's no secret that data science popped up on every corner in 2015. We've already seen numerous articles on the hottest topics of the year as well as predictions for 2016, so we thought we'd weigh in on the subject! Although it goes without saying, "data science" is at its peak popularity and we have the proof:

While we know "data science" was a buzzword (buzz-phrase?) of the year, there were certain aspects of the phenomenon that dominated the blogosphere. Below, we'll cover three trending topics in data science; what they are, how they exploded in 2015, and where we see them in 2016. But first, a quick poll...

 

Big Data 

‘Big Data’ is a term that you’ve probably been hearing for some time now. You might even be sick of hearing it. Big data is a broad term that gets tossed around easily, but in its simplest definition, often refers to “data sets so large or complex that traditional data processing applications are inadequate” (Wikipedia).   What does this mean? The data may be so large, or difficult, that it cannot be analyzed or processed with traditional analysis methods. For instance, you may find that R cannot handle building a model with millions of observations. Three factors form the basis for big data: volume, variety, and velocity (Wikipedia).

Another term that you may have heard in conjunction with big data is Hadoop. Hadoop is a framework for data storage and processing spread out over commodity hardware. So if you’re panicking over how to handle all your data, or if you have massive amounts of data to handle and are unsure of how to do so/lack the systems to do so, take this advice from our Senior Scientist, Mikhail Golovnya:

Proper sampling of a 'big' dataset can yield models just as good as or even better than models using all of the data. You don’t need to use every data point available to you.

While we could talk about big data for days, we'll leave you to make your own conclusions about the necessity of big data applications. But, be sure to stay tuned for an in-depth blog post and webinar by CEO Dan Steinberg on whether you need big data or just enough data.

 

Open Data

Open data refers to data that can be used, re-used, and distributed freely by anyone. 2015 saw a surge in the amount of publicly available open data. This movement has been extremely beneficial for a few reasons:

  • Small companies (especially non-profits) are taking advantage by releasing their data to the public and challenging data scientists around the world to solve their analytics problems
  • Larger companies are using platforms like Kaggle to both educate and hire data scientists
  • Data scientists-in-training (and students) now have access to hundreds of thousands of data sets to learn from

On the other side of this movement is the issue of data privacy. Most companies are hesitant to release data due to confidentiality, which is why a good portion of open datasets out there are actually sanitized. Sanitized data is transformed and disguised to the point where it merely mimics the original. Analysis is still possible but variable names and values have been changed to protect the owners. Keep this in mind when drawing conclusions pertaining to open data!

Here are a few of our favorite open data sites:

http://www.data.gov/

http://archive.ics.uci.edu/ml/

 

Data Scientists

A common controversy in the land of data science this year is the definition of those who practice in the field. So-called "data scientists" are supposedly paid too much, paid too little, don't have enough programming skills, don't have enough decision-making skills, and so on. The problem is that no single role can cover the expanse of responsibilities in data science, but we're all using the same term for everyone in the field! We're starting to see statisticians venture into data mining and programmers venture into analytics, causing a sudden explosion of this versatile new job title.

According to VentureBeat1, the data scientist career path is expected to increase by 18.7% between 2010 and 2020. Many universities across the country are now offering degrees in data science as well as adding entire departments devoted to this field.

This seems like a good point to bring up a related poll I came across the other day. On KDNuggets, half of voters believe data scientists will be automated and unemployed by 2025.  At Salford Systems, we couldn't disagree more. Every day we come across situations where automation cannot replace the human ability to analyze a problem and reach a conclusion. Software can be extremely powerful, but there is always a need for intelligent decision-making apart from a computer.

We fully expect to see the rise of the Data Scientist continue in 2016. In fact, data scientists in every department at sizable companies could be the norm within the next few years.

 

1http://venturebeat.com/2013/11/11/data-scientists-needed/

 

Feel free to leave your thoughts below or email us at mlq@salford-systems.com with questions and comments!

Read More