When beginning a data analysis project, analysts often discover that the data as presented or made available is not ready for analysis. The reasons for this lack of readiness could be many, including:
- The coding of the data is inconsistent (e.g. date is sometimes Day-Month-Year, and sometimes Month-Day-Year)
- Data is made available in separate tables, but merge keys for join are missing
- Dependent variables for the analysis are largely missing
- Many fields appear to contain wild (clearly impossible) values
- Ambiguity regarding whether a value is valid or missing (e.g. age is 99)
- The unit of observation in the data is not appropriate for analysis (e.g transaction level data but analysis is required at customer level)
When a type of analysis is repeated regularly, and the data always flows from the same sources, it may be eventually be possible to have many of these issues resolved prior to the start of the analytical project, and tools for dealing with the remainder may be prebuilt (e.g. data cleaning and reorganization). But even so, there is always the possibility of new surprises and new ways in which problems may enter into the data.
So the topic of this discussion is: how much is typically spent in wrestling with the data and "beating it into shape" so that the actual analytics can begin, and how can we reduce this amount of time?
The first thing to point out is that although the data preparation phase of an analytic project may not be the most enjoyable part it is far from valueless and indeed can teach the analyst quite a bit. The close interaction of the analyst with the data can provide insights into possible deficiencies that could color the subsequent analysis, say due to selection bias (e.g. the geographical distribution of some customer data does not follow an expected pattern and there is a shortage of data for a region). In some cases it will be discovered that the flaws are in reality not with the data at all but with the real-world processes generating the data (e.g the call center always rejects incoming calls from area code 212). The process of "wrangling" the data may also provoke hypotheses, stimulate new ideas, and give the analyst a sense of the reliability of certain parts of the data.
Second, the process of data preparation is not really a totally distinct phase of the analytical project; we do not prepare data once and for all, and then go on to model building. Instead, we go though a process of preliminary data preparation and then trial modeling. The results and insights from the initial models will almost invariably point us to a new round of data preparation and data repair. This cycle of model-data-update-model may be repeated several times. Normally, after the first few cycles, the data preparation phase no longer includes data repair and is instead focused on feature extraction (new predictor construction). However, outright errors in the data may still be discovered very late in the analytical project.
In the KDDCup 2000 competition (in which Salford won two first place outcomes in two separate challenges) we rebuilt our analytical database 10 times over the course of six weeks. Further, around week 5 we discovered a major error in the data for one specific field that was vital to the analysis (more on this below). So, even though we like to have nice and neat work flow diagrams to explain our process, real world analytics is rarely so neat.
So, going back to the key question of this article: what fraction of time is spent in data preparation for modeling? I have been asking analysts this question since 1997, when Salford started a series of live data mining training sessions focused on CART and MARS.
I have continued to ask this question of any group of analysts I happen to meet, and the answers have been remarkably consistent: the most common response is 80%.
Literally hundreds of practicing data miners and statistical modelers, most of them working at major corporations supporting extensive analytical projects, have reported that they spend 80% of their effort in manipulating the data so that they can analyze it! I have heard other numbers too.
A minority report 50%, and I have heard the occasional 90%.
Regardless of what the fraction is for you and your colleagues, you probably have less time for modeling than you would like. To put it another way, most analysts feel that they have not had sufficient time to perfect their models and that a model is declared final when there is no time left to make it better. (Just like a timed test when you are told to "put your pencils down").
So what can we do about this? At Salford Systems our emphasis has been on developing and leveraging tools that both (a) accelerate the process of error detection and correction, and (b) allow modelers to get further with less than perfect data. For this, we need tools that are robust in the face of outliers, missing values, and outright miscoded data. The strongest tool we offer for dealing with problematic data is CART as its missing value handling and automatic nesting of sub models allow it to breeze through these problems. Let's take an example from the KDDCup 2000. One of our models in this competition concerned prediction and characterization of the customers of an e-commerce website. Which visitors become customers and actually buy something? We started with a CART model that selected as its root node splitter "Is the visitor registered on this site?" This seems to be a most sensible way to start; people do not register on sites they have no interest in. The next split, for those who registered was geographic location of the visitor and it isolated residents of New York and California as much more likely to buy than residents of other states (remember this was data from the year 2000). So far so good, everything makes sense. On the other side of the tree, looking at visitors who had not previously registered, the CART tree again wanted to split on geographic region. But, at that time and in that data non-registered users were supposed to be completely anonymous and there was supposed to be no way for us to know their geographic location (this would not be true today). Looking at this tree we dug back into the data and discovered that we had complete registration information for a good number of visitors who were flagged as not-registered! In other words, the REGISTERED flag had been miscoded in the database and was sometimes correct and sometimes incorrect.
Discovering this error was not difficult if you looked at a CART tree; however it would probably remain invisible in the context of other predictive models. But beyond the tree, spotting the error required some domain-specific knowledge and, of course, paying attention. In other words, it needed a tool and a sufficiently capable analyst who was actually paying attention to the details. Interestingly, no one else discovered this error (which we reported to the competition organizers as soon as we confirmed it).
Another example of a major error we discovered with CART involved a bad merge where the dependent variable of the analysis had been linked with customer data describing another person. Given the self-testing built into CART we could quickly ascertain that we could not predict the outcome at all. Using other methods might tempt you into an overfitting exercise driven by the insistence that the data was predictable!
A further example involved a data preparation coding error, which we will describe in detail in another context. We became convinced that something was wrong with the data when our CART tree predicting credit card risk used a relatively minor credit bureau measure as the root node splitter. In many previous analyses of similar data, that variable was never considered to be important or particularly useful. Suddenly it had been elevated to the top rank. We might have decided that we had just made a major new discovery but instead we decided that we had bad data. In this case, the tree revealed something about the data that would be appreciated only by a domain specialist.
In addition to CART's ability to provide X-ray vision into data, we like to use TreeNet (Friedman's stochastic gradient boosting), which is based on a series of small CART-like trees. TreeNet is a powerful learner capable of dealing with missing values, outliers, and major errors in the dependent variable (such as recording a YES as a NO). Like CART, TreeNet models can be constructed on fairly rough data sets and still yield important insight into the data. Tree results are understood chiefly by reviewing the performance statistics (does it predict well) and the graphs displaying the relationship between Y (the target) and X (a predictor). Strange looking graphs can be signals of data problems.