Customer churn presents a particularly vexing problem for businesses; every company loses clients or customers over time. It's no wonder that companies are pouring money and time into this issue, we've all heard that it's less costly to retain a customer than to attract a new one. Let's take the wireless telecommunications industry as an example. In 2003, the wireless telecom industry had 20-40% of customers leaving their provider in a given year. As once-explosive subscriber growth rates slowed down, retaining existing customers became increasingly important to a company's overall profitability. Currently, annual churn rates for telecommunications companies average between 10-67%. If the customers who are likely to churn can be identified, the company can target them with retention campaigns, giving them an incentive to stay and preventing loss of revenue.Read More
Our CEO and founder, Dr. Dan Steinberg recently wrote about gradient boosting machines. Gradient boosting machines are a powerful machine learning technique, and have been deployed with great success over the years in Kaggle competitions.Read More
Guest post by Grant Humphries, Post Doctoral researcher, University of California, Davis
Dr. Grant Humphries, from the Zoology department at the University of Otago, New Zealand, has spent the last three years studying how a bird species called Sooty Shearwaters can help predict upcoming El Niño occurrences. After much time and research, he has figured out a way to do so using data mining.
We recently had a question about running a model using GPS, and wanted to share the answer in case anyone else has the same issue.
Let's get right to it! You're a beginner, and you want to know what is needed to start data mining and become an experienced data scientist overnight. We get it - this is the world we live in - quick and dirty. So here we go, take notes!
Updated: July 16, 2013
In their 1984 monograph, Classification and Regression Trees, Breiman, Friedman, Olshen and Stone discussed at length the need to obtain “honest” estimates of the predictive accuracy of a tree–based model. At the time the monograph was written, many data sets were small, so the authors took great pains to work out an effective way to use cross–validation with CART trees.
The result was a major advance for data mining, introducing ideas that at the time were radically new. The main point of the discussion was that the only way to avoid overfitting was to rely on test data. With plentiful data we can always reserve a portion for testing, but with fewer data we might have to rely on cross validation. In either case, however, only the test or cross–validated results should be trusted. In contrast, earlier approaches tended to ignore the training data performance results and focus only on the test data.
One of the most important controls in TreeNet is the maximum number of terminal nodes permitted in each tree (the NODES=number parameter setting on the TreeNet command). You would think that if you ask for say NODES=4 that all of your trees would have no more than 4 trees. However, that is not exactly how things will turn out unless your data contain NO MISSINGS. If there are missings in your data and variables with missing values are used as splitters then the trees may actually contain more nodes than expected.
When beginning a data analysis project, analysts often discover that the data as presented or made available is not ready for analysis. The reasons for this lack of readiness could be many, including: