Data mining has a plethora of challenging aspects. Some of these challenges are common among nearly all data scientists, analysts, and predictive modelers while others are more industry-specific. Nevertheless, we all run into a snag here and there (hopefully more like there, not here) and it can be a trying task to overcome our day-to-day or project-to-project challenges.
I recently stumbled over a 2005 article on The Top 10 Data Mining Mistakes -- and how to avoid them, by John Elder of Elder Research, Inc., and it got me thinking about the challenges that data scientists, analysts, and statisticians face when they are deep into a data mining project. I decided to ask some data scientist friends of mine what they consider to be the most challenging aspects of data mining. The response I got was a a great overview of realistic, data-driven challenges that folks in this line of work deal with regularly.
Data Mining Challenges You Might Recognize:
- Poor-quality data such as: dirty data, missing values, inadequate data size, and poor representation in data sampling.
- Lack of understanding/lack of diffusion of data mining techniques in academic arenas.
- The lack of good literature on important data mining topics and techniques.
- (Academic institutions) have trouble accessing commercial-grade software at reasonable costs.
- Data variety - trying to accommodate data that comes from different sources and in a variety of different forms (images, geo data, text, social, numeric, etc.).
- Data velocity - online machine learning requires models to be constantly updated with new, incoming data.
- Dealing with huge datasets, or 'Big Data,' that require distributed approaches.
- Coming up with the right question or problem - "More data beats the better algorithm, but smarter questions beat more data," Gregory Piatetsky, www.KDnuggets.com
- Remaining objective and allowing the data to lead you, not the opposite. Preconceived notions can be dangerous, but luckily it is in our power to resist them...
In this article I am unable to elaborate on all of the challenges, but I can offer some insight and possible solutions to a few of them:
Issues with Limited Data
Working with small datasets can render a less than suitable predictive model (in many cases). It may have missing values that will skew the results and/or suggest unrepresentative results due to working with data that is not a representative sample of the population. This can cause even more challenges in data usefulness, predictive accuracy and variable interaction effects. However, limited data does not have to be a major roadblock in the modeling process. This article shows how you can still achieve accurate results when working with small datasets.
Useful Literature in Data Mining
Although reference material for data mining can sometimes be hard to come by, knowing the right places to look can reduce the tedious exercise of searching the Web for good literature. I've listed a few specific articles, as well as some key websites, for you to familiarize yourself with:
- Greedy Function Approximation: A Gradient Boosting Machine
- The Elements of Statistical Learning
- Constrained Tree Structure Method System
- CART Monograph
- Stochastic Gradient Boosting
- From Data Mining To Knowledge Discovery In Databases
- Stanford Technical Reports Archive
- Data Mining Blogs
- On-Demand Webinars (Salford Systems)
- On-Demand Webinars (General)
Software Cost Solutions
This is definitely targeted toward academic institutions or small consulting firms that want to work with top-of-the-line predictive modeling software, but lack the budget and/or ability to justify such a purchase. Sadly there is little to comment on here, but there ARE a couple of pathways that can help support the cause.
Salford Systems, provides free software use for University students if the department/university/professor has obtained a license to be used as part of the course curriculum. Additionally, special discounts are offered to academic institutions, students, and professors (as low as $75 per license for students). Pricing information can be obtained online.
Like I mentioned, consultants and smaller companies usually deal with the challenge of price sensitivity as well. What I have discovered (at least when it comes to Salford's software) is that people are faced with the dilemma of which specific tools will prove to be the most useful (generate the most ROI) in their specific data mining projects, and which tools they are willing to forgo. An article that may help if you share this dilemma is titled "Data Mining on a Budget: Choose Wisely."
Working with Big Data
Big data, big data, big data - buzz word or real word? Well, the definition of big data will vary from modeler to modeler, making this another difficult challenge to tackle. According to Wikipedia, big data "is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization."
Although I cannot speak for other software vendors, if you need to work with large data Salford Systems recommends combining the benefits of the GUI and Non-GUI to support massive amounts of data.
Did we miss any major challenges that you face? Leave more data mining challenges in the comments below!