Let's get right to it! You're a beginner, and you want to know what is needed to start data mining and become an experienced data scientist overnight. We get it - this is the world we live in - quick and dirty. So here we go, take notes!
Get to Know Your Data
So, where to start? We will begin with how to analyze your dataset. You MUST get to know your data before beginning a project. If you don't know your data (what its fallacies are, what its trends are, what its problems are, what its information means) you are doomed from the beginning, according to the late, great Dr. Leo Breiman.
Make sure you define your project goal, you understand which variables may NOT be good predictors (maybe there are too many missing, or maybe there is some obscure variable titled 'recordid' that definitely should not be predictors. It is best to grab a cup of coffee or tea, sit down, sift through the data, and take notes on the quality and underlying themes within the data EVEN BEFORE you build a model.
Build a 'Top-Level' Model First
We recommend starting with a simple decision tree model so that you can make initial judgments about the quality of the data and see what is going on from a top level. Partitioning a test sample, setting model limits, choosing the type of analysis, and selecting predictors are all part of model setup.
Boost Your Model
Once you have built your initial model that provides a general synopsis of what is going on with the data, it's time to dig deeper for more insight and predictive accuracy. You may have learned that you can simplify your model further and reduce the number of predictors used. Or, you may have learned that there isn't a signal at all and you need to take a step back and re-evaluate your data. Maybe your dataset is just too small and you need to gather more in order to build sufficient models. Regardless, the next step in perfecting your model is to now build a model ensemble using a boosted methodology. (There are other approaches, but this is what has been recommended by Salford Systems' experts). TreeNet Stochastic Gradient Boosting is used in this tutorial below:
Save and Repeat
Once you have the model that you are satisfied with, why reinvent the wheel? Of course, depending on your industry, you may need to tweak your model parameters over time in order to remain relevant. You may save your models and call them again to be used on new data, more data, subsets of data, you name it! Deployment baffles so many modelers out there, but it doesn't need to be a mystery adventure. In a quality software package, you'll be able to translate your models into various programming languages, so that you can move back and forth between packages and platforms as needed.
What else do data mining beginners need to know that wasn't covered in this quick and dirty summary? Add your comments and suggestions below!