Over the past few weeks I have been talking to a lot of people who are looking at how data mining can be used in their organisation, for their projects and to people who have been doing data mining for a log time.
What comes across from talking to the experienced people, and these people are not tied to a particular product, is that you need to concentrate on the business problem. Once you have this well defined then you can drill down to the deeper levels of the project. Some of these levels will include what data is needed (not what data you have), tools, algorithms, etc.
Statistics is only a very small part of a data mining project. Some people who have PhDs in statistics who work in data mining say you do not use or very rarely use their statistics skills.
Some quotes that I like are:
"Focus hard on Business Question and the relevant target variable that captures the essence of the question." Dean Abbott PAW Conf April 2012
"Find me something interesting in my data is a question from hell. Analysis should be guided by business goals." Colin Shearer PAW Conf Oct 2011
There has need a lot of blog posting and articles on what are the key skills for a Data Miner and the more popular Data Scientist. What is very clear from all of these is that you will spend most of your time looking at, examining, integrating, manipulating, preparing, standardising and formatting the data. It has been quoted that all of these tasks can take up to 70% to 85% of a Data Mining/Data Scientist time. All of these tasks are commonly performed by database developers and in particular the developers and architects involved in Data Warehousing projects. The rest of the time for the running of the data mining algorithms, examining the results, and yes some stats too.
Every little time is spent developing algorithms!!! Why is this ? Would it be that the algorithms are already developed (for a long time now and are well turned) and available in all the data mining tools. We can almost treat these algorithms as a black box. So one of the key abilities of a data miner/data scientist would be to know what the algorithms can do, what kind of problems they can be used for, know what kind of outputs they produce, etc.
Domain knowledge is important, no matter how little it is, in preparing for and being involved in a data mining project. As we define our business problem the domain expert can bring their knowledge to the problem and allows us separate the domain related problems from the data related problems. So the domain expertise is critical at that start of a project, but the domain expertise is also critical when we have the outputs from the data mining algorithms. We can use the domain knowledge to tied the outputs from the data mining algorithms back to the original problem to bring real meaning to the original business problem we are working on.
So what is the formula of skill sets for a data mining or data scientist. Well it is a little like the title of this blog;
Domain Knowledge + Data Skills + Data Mining Skills + a little bit of Machine Learning + a little bit of Stats = a Data Miner / Data Scientist