Cross-post from Diary of a Data Scientist, a first hand account of the life of a data scientist; sharing the struggles, triumphs, and day-to-day perspective of a technical research professional. Click to subscribe to the Diary of a Data Scientist Blog here.
Have you ever wondered how something is “distributed”? Have you ever noticed a bridge under construction, and thought about which factors will determine how quickly it will be built? Ever thought about the assumptions that go into producing election polls? Welcome to some of the daily musings of a statistician. Statistics is the science of analyzing data. Statistics and data science are very similar disciplines, and analysts almost always draw from both to produce their analyses. With that said, there are differences between statistics and data science. In many ways statistics and machine learning can be thought of as sub-disciplines of data science in the sense that both techniques are often used to explore the data, build models, or make inferences. Data science includes not only statistics and machine learning but also cleaning and reshaping data, building infrastructure for data, and more.
Before I take you inside the mind of a statistician I should first point out that statisticians are a diverse group of people with different areas of expertise. We almost always work on a team that includes people with different areas of knowledge like computer science, business, engineering, management, and more. How a statistician would approach a problem depends on any number of things including programming skill, abilities of teammates, statistical expertise, and writing ability.
Approaching a Problem
So how do I approach problems in data science? Well, like any data scientist would. My process, although flexible, involves pulling or obtaining my own data (not always an easy task), exploring, reshaping and cleaning, writing scripts to automate the reshaping and cleaning, exploring again, building and assessing models, and then using those models to answer questions or solve problems.
My process is very similar to that of any other analyst, so what makes a statistician unique? Our training. Most of a statistician’s training in applied statistics revolves around model building and diagnostics. Some of the things that we care a lot about (trust me, there are a lot more!) are assumptions, bias and variance, model selection, and properly answering questions.
Assumptions underlie everything in statistics. As an example, let’s say I want to predict whether a brain surgery will be successful based on a patient’s characteristics and I assume I have data on both successful and unsuccessful surgeries. If I discover the data only consists of successful surgeries (discovered during the exploration phase), then I can describe the characteristics of patients whose surgery was successful but I can’t build a predictive model. Assumptions form the foundation of every analysis and should never be ignored.
Bias and Variance
Playing darts is a great way to conceptualize bias and variance. Bias is how close you are, on average, to the desired target (in this case, the bullseye). If your darts are centered near the bullseye then your bias is low, but as the center of your dart throws moves away from the bullseye then bias is introduced. Variance measures the “spread” of your darts. If your darts hit all over the dart board (or even the wall!) then your variance is higher. One way bias and variance is used is in the context of model complexity and error. A model is biased if, on average, its predictions incorrectly estimate the true value and a model has high or low variance if its predictions are more erratic or consistent, respectively. An example of model complexity would be that a regression model with eight terms is more complex than one with six terms. It is helpful to look at a plot of squared prediction error (this is used in linear regression) versus model complexity:
As model complexity increases, squared bias decreases, variance increases, and total error, at least up to a point, decreases. This relationship is referred to as the bias-variance tradeoff. The purple line in the chart indicates that the optimal model in this case is one that allows for both some bias and variance, meaning that a compromise is necessary to minimize model error. This can be interpreted as fitting a more complex model to the data (reducing bias) while not so close that it cannot predict new observations (lower variance). I use this idea when I think about model selection. For instance, linear regression models tend to have low bias and high variance and random forests have low bias and try to reduce variance. The bias-variance tradeoff is an important modeling concept and it is always on my mind when I build models.
Statisticians spend quite a bit of time thinking about model selection. But how is this decided? Should you just try every possible model you can think of? Probably not. I tend to focus on things such as purpose and model performance. Perhaps the most important consideration is purpose. If I am presenting to a group with little technical expertise that cares about understanding previous behaviors then a regression model or a CART tree is appropriate because of interpretability. After defining the purpose I build a few models and then assess them (this is an iterative process). If I am building a model to predict I focus on holdout sample performance as well as variable importance and behavior.
Variable importance measures show which variables, like blood pressure, are important in predicting successful surgeries whereas variable plots show how higher levels of blood pressure affects the chance of a successful surgery. Variable plots, especially those that correspond to important variables, give me an idea of how my model is behaving and what relationships it is capturing.
Holdout sample performance refers to how a model performs on “unseen” data. If the data consists of 500 patient outcomes I randomly select 30% (this percentage can vary) of the records and do not use this data to build the model (hence the term “holdout”). The model is actually built using the other 70% and then I use the model to predict observations in the holdout sample and report the performance. Why report the holdout performance? It is more realistic. In practice the model is used to predict brain surgery outcomes for patients that were not in the data used to build the model, so performance on the holdout data, which was also not in the model building data, is more indicative of actual model performance. Selecting a model is an important part of my process, but the model is meaningless if I cannot use it to answer questions.
Correctly answering questions defines data science, so providing answers and explaining my work is crucial. I always convey to the audience the appropriate use of a model and the uncertainty associated with it. Context is important for explaining the appropriate use of a model. If I built a model to predict successful brain surgeries at a hospital in New York, then the model should not be used to predict brain surgery outcomes at a hospital in Australia. Statistical models are not perfect reflections of reality and have a degree of uncertainty. Explaining uncertainty can be accomplished using things like holdout performance measures, confidence and prediction intervals, and simply reminding the audience to “take things with a grain of salt.” Models always come with their own set of stipulations, and it is the job of a statistician to properly communicate this to the audience.
When I approach a problem I think of it in terms of my training in statistics. I conceptualize things in terms of assumptions, the tradeoff between bias and variance, model selection, properly explaining my work, and more. Every problem is different. I use these concepts because they produce useful insights and, ultimately, solve problems.
Like what you've read? Click to subscribe to the Diary of a Data Scientist Blog here.
Visit the blog by clicking the button below: