The explosion of interest in predictive analytics and in particular sophisticated predictive analytics has led to renewed interest in the topic of how to plan for an predictive analytics project. The topic is hardly new but anyone involved in planning for, or conducting a serious analysis project needs to cognizant of the fundamentals. Here we abstract from insights and ideas from two classic discussions, the 1996 paper by Fayyad, Piatetsky-Shapiro, and Smyth ‘From Data Mining to Knowledge Discovery in Databases,” and the 1999 “Cross-Industry Standard Process (CRISP) for Data Mining” developed by analytical practitioners at Daimler-Chrysler, ISL (later acquired by SPSS and now part of IBM) and NCR (Teradata). While these first-generation discussions clearly date from an earlier technological era the principle ideas are still relevant and are being rediscovered and translated into modern terminology daily. Here we offer a summary of the essentials viewed from our perspective of 20 years of experience in the world of data mining and predictive analytics.
Stage #1: Business Understanding (Specification of an Objective)
To successfully conduct an analytics project it is first necessary to have a notion of what is hoped to be accomplished, described in the practical language of business or decision-making. Often the objective can be clearly stated. Examples include:
- In an election campaign, identify the swing voters most likely to ultimately vote for a specific candidate (precisely what the Obama campaign undertook in 2012).
- Rank the offers you might provide to a website visitor by the probability of acceptance or expected value of the response (something Salford Systems undertook on behalf of a client).
Not all projects have such a clear and focused objective and may be more generic:
How can I make my overall grocery store operation more efficient logistically?
In this discussion we have in mind the more clearly defined, and thus more narrowly scoped, projects. The CRISP approach calls this first prerequisite “Business Understanding” which as Fayyad et. al. point out, includes the identification of the objective from the point of view of the client.
Stage #2: Data Inventory and Understanding
Getting a good grasp of exactly what data is available for analysis is critical and involves canvassing both internal (proprietary) and external (publicly available or purchasable) data sources. Identifying, locating, and gaining access to potentially useful data within an organization can be challenging, and a project may be forced to proceed using less data than actually exists due to institutional constraints isolating data silos. An effort to identify all sources of potentially useful data is generally worthwhile as many large organizations have a myriad of data stores, some of which are generally unknown outside the department that collected it. External data is increasingly available via the Internet and can come from government sources and/or private organizations. Such external data can be extremely helpful in improving the predictive performance of a model and refining of the analytical insights it can yield. Examples of such data include:
- Popularity of certain Internet search terms
- Stock market index movements
- Snowfall in Fairbanks, Alaska
Supplemental data can be of enormous value in providing the relevant context for what is tracked from the perspective of a single enterprise. For example, supplemental data may include prices charged by competitors for the same or similar products offered by a retailer.
In our consulting experience for financial services enterprises, we have generally found that the data such enterprises amass on their own customers is far more valuable than anything that could be collected or purchased about those same customers externally, but the added value of such data could still justify the cost of the acquisition of this data.
Stage #3: Assessment of Data for Suitability
Having identified objectives and located potential data the next step involves an assessment of the data with the objective of determining whether the available data can in fact support the original objective or perhaps some modified and even scaled-back objectives. One may discover that the available data is missing essential details (descriptors for the ad actually shown to a web site visitor) or that data is available only for a subpopulation (e.g. only data for the clicks was retained and all non-click information was discarded). Such data omissions make the job of predictive analytics vastly more difficult if not impossible. The topic of mishandled or mismanaged data is a large one and all experienced analysts have their own catalog of horror stories. Our point here is simply that we have to allow for the possibility that data problems may complicate the project if not stop it in its tracks.
The data may also offer too small a sample size for meaningful analysis. This may appear to be a thing of the past, something that will never occur again in the brave new world of big data. But even with huge data stores the fraction that is relevant to a specific analysis may be small.
If you need to build a click model for an ad that has only been clicked on a 15 times you are short of data even if you have one billion impressions for that ad. In sum, we need both a sufficient quality and a sufficient quantity of data.
Stage #4: Pilot Project
During the last five years or so, all of our consulting projects have begun with a significantly scaled-down dry run or pilot project; one major year-long project we conducted in 1996 began with a 3-month scaled down trial. This is something of an innovation in the practice of advanced analytics even though the notions are anything but new. Engineers and architects build scale models, jet engines are tested in stationary mounts on the ground long before they are placed on an airplane, and new web pages and marketing campaigns are typically first exposed to just tiny fractions of a major web site’s traffic. Obvious or not, the notion that data analysts should consciously opt for a trial run may be new to some practitioners and is well worth considering. At the very least, analytics could be conducted on a radically slimmed down version of the data set to accelerate modeling run times.
Real-World Example: In a project to predict the sales of individual promoted products in a chain of large grocery stores we started with data for a few dozen products for one department, and then expanded with further selected products from other departments. When done with the pilot project we moved on to making the system work for 122,000 products.
Stage #5: Prepare (and Explore) Data
From a manager’s point of view the important point here is that the preparation of the data (“beating it into shape”) so that it can be successfully modeled may require considerable time and resources. David Vogel et. al. were the winner’s of the first (and second) round progress prizes in the $3 million Heritage Provider Network predictive modeling competition. In their paper explaining their approach and methods, about one third of the paper is devoted to documenting their preparation of the data. While some data mining and predictive analytics tools can work remarkably effectively with unprepared data (TreeNet is clearly one such tool), careful data preparation invariably leads to better, and often substantially better, performance.
Much of the data preparation is guided by rather straightforward examination of the main components of the data expected to be relevant to modeling.
Routine first Steps:
- Ensuring that the data display appropriate values
- Scanning for extreme values and/or invalid values
- Locating oddities in the data such as unexpected gaps
Often special care needs to be devoted to handling of missing information, which might be encoded in different ways in different parts of the data. Data that has been collected over a fairly lengthy period of time may also suffer from changes in data coding conventions over time rendering the overall inconsistent. A simple example would be post codes initially stored as 9 digit zip codes and then later stored always in the ZIP+4 format. Without paying attention to such details certain aspects of the analysis will suffer.
There is nothing especially glamorous about data preparation and perhaps for this reason there are very few books or courses devoted specifically to this topic. The data cleaning process is crucial for achieving the best possible results, however, and we expect to offer a series of blog entries on this topic in the future.
Stage #6: Modeling
For the professional data analyst, modeling is the most enjoyable and stimulating part of the project. It s here that we get to show off our skills and the magic of modern technology, delivering results with modern analytical tools that were simply unthinkable using only standard statistical methods. In SPM 7.0 we offer CART, MARS, TreeNet, RandomForests, GPS Generalized Pathseeker (Lasso style regression, logistic regression, and the extended elastic net), ISLE model compression, and conventional regression which gives the modeler a fair of options to choose from.
Although modelers enjoy building and assessing models, the process of moving from first drafts to final versions can involve a fair bit of experimentation and trial and error. To accelerate this search process we recommend making use of Modeling Automation such as embedded in the SPM BATTERY feature.
By modeling automation we do not mean letting the machine do all the work, but in allowing the machine the assist the modeler by automatically running and summarizing a series of experiments in which variations on a theme, tailored by the modeler, are run. This allows the modeler to keep analytical servers running 24 hours a day, building model variations that can be quickly reviewed in graphical summaries.
Stage #7: Evaluation, Interpretation, Understanding
Although any model can be treated as a black box prediction machine there are usually valuable insights that can be extracted from the model and it is worthwhile spending the time to secure them. We might learn for example that certain subpopulations of customers are unusually favorable or unfavorable for our objectives or we might learn that we can shift selling effort from one segment to another. Evaluation, interpretation, and insight are by definition the domain of the thoughtful and informed reviewer and we do not expect this stage of the project to be automated any time soon (with or without Watson).
The traditional large consulting companies have always understood this very well and have typically devoted the lion’s share of their efforts to telling the story unveiled by the technical analysis in any project. When analytics are undertaken in house the responsible department should take pains to perfect this portion of the project as well.
Stage #8: Full Project
The pilot project may yield results good enough to justify moving forward with the resulting models and insights. In this case, we would move immediately to “Deployment” below and then consider expansion of the pilot.
If the pilot is successful but the business requires the full project before deployment can be considered, then the steps above must be repeated but in the context of a much larger scope. While much should have been learned from the pilot the full project may reveal a host of new problems and challenges and several conclusions derived form the pilot may need to be revised.
Pilot projects often benefit from the luxury of allowing detailed study of a narrowly defined set of data, supporting very high quality results. A risk for the expansion of the project is that insufficient resources are allocated under the mistaken understanding that the simple scaling up of the project is routine. Salford Systems has been part of several projects in which we thought the pilot was of far higher quality than the full project in all aspects of its execution due to resource allocation.
Stage #9: Deployment
Unless a model is intended to guide only strategic thinking its real payoff comes only when it is deployed in some real-world process. Our retail sales prediction models are used to guide the logistics of a grocery store chain, dictating the number of cans of tuna fish to ship to each store in the network next week. Our ad optimization system recommends which of possibly many thousands should be displayed right now for a given visitor of a given web page. Our lifetime customer value model scores credit card applicants on the basis of their likelihood to have nonzero balances on their card 12 months from now. These models all needed to be embedded in business processes to speed and improve the quality of decisions. Without deployment, models become little more than academic exercises or illustrations of what might be accomplished.
Deployment is a complex topic deserving of an extended discussion all of its own. The only point here is that a modeling system should support easy deployment, by, for example, allowing export of models into reusable code such as Java, C, or PMML. (SPM offers all three and more).
Here are the stages we described once again:
- Business Understanding (Specification of an Objective)
- Data Inventory and Understanding
- Assessment of Data for Suitability
- Pilot Project
- Prepare (and Explore) Data
- Modeling (Build Predictive Models)
- Evaluation, Interpretation, Understanding
- Full Project
We might also add a final stage: Extract lessons learned and refine “best practices” documents to guide future projects.