Wednesday, August 24, 2016

How to get ORE to work with APEX

This blog post will bring you through the steps of how to get Oracle R Enterprise (ORE) to work with APEX.

The reason for this blog posts is that since ORE 1.4+ the security model has changed for how you access and run in-database user defined R scripts using the ORE SQL API functions.

I have a series of blog posts going out on using Oracle Text, Oracle R Enterprise and Oracle Data Mining. It was during one of these posts I wanted to show how easy it was to display an R chart using ORE in APEX. Up to now my APEX environment consisted of APEX 4 and ORE 1.3. Everything worked, nice and easy. But in my new APEX environment (APEX 5 and ORE 1.5), it didn't work. This is the calling of an in-database user defined R script using the SQL API functions didn't work. Here is the error message that is displayed.

NewImage

So something extra was needed with using ORE 1.5. The security model around the use of in-database user defined R scripts has changed. Extra functions are now available to allow you who can run these scripts. For example we have an ore.grant function where you can grant another user the privilege to run the script.

But the problem was, when I was in APEX, the application was defined on the same schema that the r script was created in (this was the RQUSER schema). When I connect to the RQUSER schema using ORE and SQL, I was able to see and run this R script (see my previous blog post for these details). But when I was in APEX I wasn't able to see the R script. For example, when using the SQL Workshop in APEX, I just couldn't see the R script.

NewImage

Something strange is going on. It turns out that the view definitions for the in-database ORE scripts are defined with

owner=SYS_CONTEXT('USERENV', 'SESSION_USER');

(Thanks to the Oracle ORE team and the Oracle APEX team for their help in working out what needed to be done)

This means when I'm connected to APEX, using my schema (RQUSER), I'm not able to see any of my ORE objects.

How do you overcome this problem ?

To fix this problem, I needed to grant the APEX_PUBLIC_USER access to my ORE script.

ore.grant(name = "prepare_tm_data_2", type = "rqscript", user = "APEX_PUBLIC_USER")

Now when I query the ALL_RQ_SCRIPTS view again, using the APEX SQL Workshop, I now get the following.

NewImage

Great. Now I can see the ORE script in my schema.

Now when I run my APEX application I now get graphic produced by R, running on my DB server, and delivered to my APEX application using SQL (via a BLOB object), displayed on my screen.

NewImage

Monday, August 22, 2016

My 2nd Book: is now available: Real World SQL and PL/SQL

It has been a busy 12 month. In addition to the day jobs, I've also been busy writing. (More news on this in a couple of weeks!)

Today is a major milestone as my second book is officially released and available in print and ebook formats.

The tile of the book is 'Real Word SQL and PL/SQL: Advice from the Experts'. Check it out on Amazon.

Now that sounds like a very fancy title, but it isn't meant to be. This book is written by 5 people (including me), who are all Oracle ACE Directors, who all have 20+ years of experience, each, of working with the Oracle Database, and we all love sharing our knowledge. My co-authors are Arup Nanda, Heli Helskyaho, Martin Widlake and Alex Nuitjen. It was a pleasure working with you.

I haven't seen a physical copy of the book yet !!! Yes the book is released and I haven't held it in my hands. Although I have seen pictures of it that other people have taken. There was a delay in sending out the author copies of the book, but as of this morning my books are sitting in Stansted Airport and should be making their way to Ireland today. So fingers crossed I'll have them tomorrow. I'll update this blog post with a picture when I have them. UPDATE: They finally arrived at 13:25 on the 22nd August.

NewImage

In addition to the 5 authors we also had Chet Justice (Oraclenerd), and Oracle ACE Director, as the technical editor. We also had Tim Hall, Oracle ACE Director, wrote a foreword for us.

NewImage NewImage

To give you some background to the book and why we wrote it, here is an extract from the start of the book, where I describe how the idea for this book came about and the aim of the book.

"While attempting to give you an idea into our original thinking behind the need for this book and why we wanted to write it, . the words of Rod Stewart’s song ‘Sailing’ keeps popping into my mind. These are ‘We are sailing, we are sailing, home again 'cross the sea’. This is because the idea for this book was born on a boat. Some call it a ship. Some call it a cruise ship. Whatever you want to call it, this book was born at the OUG Norway conference in March 2015. What makes the OUG Norway conference special is that it is held on a cruise ship that goes between Oslo in Norway to Kiel in Germany and back again. This means as a speaker and conference attendee you are ‘trapped’ on the cruise ship for 2 days filled with presentations, workshops, discussions and idea sharing for the Oracle community.

It was during this conference that Heli and Brendan got talking about their books. Heli had just published her Oracle SQL Developer Data Modeler book and Brendan had published his book on Oracle Data Miner the previous year. Whilst they were discussing their experiences of writing and sharing their knowledge and how much they enjoyed this,they both recognized that there are a lot of books for the people starting out in their Oracle career and then there are lots of books on specialized topics. What was missing were books that covered the middle group. A question they kept on asking but struggled to answer was, ‘after reading the introductory books, what book would they read next before getting onto the specialized books?’ This was particularly true of SQL and PL/SQL.

They also felt that something that was missing from many books, especially introductory ones, was the “Why and How” of doing things in certain ways that comes from experience. It is all well and good knowing the syntax of commands and the options, but what takes people from understanding a language to being productive in using it is that real-world derived knowledge that comes from using it for real tasks. It would be great to share some of that experience.

Then over breakfast on the final day of the OUG Norway conference, as the cruise ship was sailing through the fjorrd and around the islands that lead back to Oslo, Heli and Brendan finally agreed that this book should happen. They then listed the type of content they thought would be in such a book and who are the recognized experts (or super heroes) for these topics. This list of experts was very easy to come up with and the writing team of Oracle ACE Directors was formed, consisting of Arup Nanda, Martin Widlake and Alex Nuijten, along with Heli Helskyaho and Brendan Tierney. The author team then got to work defining the chapters and their contents. Using their combined 120+ years of SQL and PL/SQL experience they finally came up with scope and content for the book at Oracle Open World.

..."

As you can see, this book was 17 months in the making. This consisted of 4 months of proposal writing, research and refinement, 8 months of writing, 3 months of editing and 2 months for production of book.

Yes it takes a lot of time and commitment. We all finished our last tasks and final edits on the book back in early June. Since then the book has been sent for printing, converted into an ebook, books shipped to Oracle Press warehouse, then shipped to Amazon and other book sellers. Today it is finally available officially.

(when I say officially, it seems that Amazon has shipped some pre-ordered books a week ago)

If you are at Oracle Open World (OOW) in September make sure to check out the book in the Oracle Book Store, and if you buy a copy try to track us down to get us to sign it. The best way to do this is to contact us on Twitter, leave a message at the Oracle Press stand, or you will find us hanging out at the OTN Lounge.

Wednesday, August 17, 2016

Oracle Text, Oracle R Enterprise and Oracle Data Mining - Part 3

This is the third blog post of a series on using Oracle Text, Oracle R Enterprise and Oracle Data Mining. Check out the first and second blog posts of the series, as the data used in this blog post was extracted, processed and stored in a databases table.

This blog post is divided into 3 parts. The first part will build on what was covered in in the previous blog post and will expand the in-database ORE R script to include more data processing. The second part of this blog post will look at how you can use SQL to call our in-database ORE R scripts and to be able to include it in our custom applications, for example using APEX (part 3).

Part 1 - Expanding our in-database ORE R script for Text Mining

In my previous blog post we created an ORE user defined R script, that is stored in the database, and this script was used to perform text mining and to create a word cloud. But the data/text to be mined was processed beforehand and passed into this procedure.

But what if we wanted to have a scenario where we just wanted to say, here is the table that contains the data. Go ahead and process it. To do this we need to expand our user defined R script to include the loop to merge the webpage text into one variable. The following is a new version of our ORE user defined R script.

> ore.scriptCreate("prepare_tm_data_2", function (local_data) { 
  library(tm)
  library(SnowballC)
  library(wordcloud)
  
  tm_data <-""
  for(i in 1:nrow(local_data)) {
    tm_data <- paste(tm_data, local_data[i,]$DOC_TEXT, sep=" ")
  }
    
  txt_corpus <- Corpus (VectorSource (tm_data))
  
  # data clean up
  tm_map <- tm_map (txt_corpus, stripWhitespace) # remove white space
  tm_map <- tm_map (tm_map, removePunctuation) # remove punctuations
  tm_map <- tm_map (tm_map, removeNumbers) # to remove numbers
  tm_map <- tm_map (tm_map, removeWords, stopwords("english")) # to remove stop words
  tm_map <- tm_map (tm_map, removeWords, c("work", "use", "java", "new", "support"))

  # prepare matrix of words and frequency counts
  Matrix <- TermDocumentMatrix(tm_map) # terms in rows
  matrix_c <- as.matrix (Matrix)
  freq <- sort (rowSums (matrix_c)) # frequency data
  
  res <- data.frame(words=names(freq), freq)
  wordcloud (res$words, res$freq, max.words=100, min.freq=3, scale=c(7,.5), random.order=FALSE, colors=brewer.pal(8, "Dark2"))
} ) 

To call this R scipts using the embedded R execution we can use the ore.tableApply function. Our parameter to our new R script will now be an ORE data frame. This can be a table in the database or we can create a subset of table and pass it as the parameter. This will mean all the data process will occur on the Oracle Database server. No data is passed to the client or processing performed on the client. All work is done on the database server. The only data that is passed back to the client is the result from the function and that is the word cloud image.

> res <- ore.tableApply(MY_DOCUMENTS, FUN.NAME="prepare_tm_data_2")
> res

Part 2 - Using SQL to perform R Text Mining

Another way you ccan call this ORE user defined R function is using SQL. Yes we can use SQL to call R code and to produce an R graphic. Then doing this the R graphic will be returned as a BLOB. So that makes it easy to view and to include in your applications, just like APEX.

To call our ORE user defined R function, we can use the rqTableEval SQL function. You only really need to set two of the parameters to this function. The first parameter is a SELECT statement the defines the data set to be passed to the function. This is similar to what I showed above using the ore.tableApply R function, except we can have easier control on what records to pass in as the data set. The fourth parameter gives the name of the ORE user defined R script.

select *
from table(rqTableEval( cursor(select * from MY_DOCUMENTS),
                        null,
                        'PNG',
                        'prepare_tm_data_2'));

This is the image that is produced by this SQL statement and viewed in SQL Developer.

NewImage

Part 3 - Adding our R Text Mining to APEX

Adding the SQL to call an ORE user defined script is very simple in APEX. You can create a form or a report based on a query, and this query can be the same query that is given above.

Something that I like to do is to create a view for the ORE SELECT statement. This gives me some flexibility with some potential future modifications. This could be as simple as just changing the name of the script. Also if I discover a new graphic that I want to use, all I need to do is to change the R code in my user defined R script and it will automatically be picked up and displayed in APEX. See the images below.

WARNING: Yes I do have a slight warning. Since the introduction of ORE 1.4 and higher there is a slightly different security model around the use of user defined R scripts. Instead of going into the details of this and what you need to do in this blog post, I will have a separate blog post that describes the behaviour and what you need to do allow APEX to use ORE and to call the user defined R scripts in your schema. So look out for this blog post coming really soon.

NewImage In this blog post I showed you how you use Oracle R Enterprise and the embedded R execution features of ORE to use the text from the webpages and to create a word cloud. This is a useful tool to be able to see visually what words can stand out most on your webpage and if the correct message is being put across to your customers.

Monday, August 8, 2016

Oracle Text, Oracle R Enterprise and Oracle Data Mining - Part 2

This is the second blog post of a series on using Oracle Text, Oracle R Enterprise and Oracle Data Mining. Check out the first blog post of the series, as the data used in this blog post was extracted, processed and stored in a databases table.

In this blog post I will show you how you use Oracle R Enterprise and the embedded R execution features of ORE to use the text from the webpages and to create a word cloud. This is a useful tool to be able to see visually what words can stand out most on your webpage and if the correct message is being put across to your customers.

Prerequisites: You will need to load the following R packages into your R environment 'tm', 'word cloud' 'SnowballC'. These are required to process the following R code segments.

install.packages (c( "tm", "wordcloud", "SnowballC"))
library (tm)
library (wordcloud)
library (SnowballC)

Select data from table and prepare: We need to select the data from the table in our schema and to merge it into one variable.

local_data <- ore.pull(MY_DOCUMENTS)

tm_data <-""
for(i in 1:nrow(local_data)) {
  tm_data <- paste(tm_data, local_data[i,]$DOC_TEXT, sep=" ")
}
tm_data

Create function to perform Text Mining: In my previous blog post on creating a word cloud I gave the R code. In order to allow for this R code to be run on the database server (using the embedded R execution of ORE) we need to package this text mining R code up into a ORE user defined R script. This is stored in the database.

ore.scriptDrop("prepare_tm_data")
ore.scriptCreate("prepare_tm_data", function (tm_data) { 
  library(tm)
  library(SnowballC)
  library(wordcloud)
  
  txt_corpus <- Corpus (VectorSource (tm_data))
  
  # data clean up
  tm_map <- tm_map (txt_corpus, stripWhitespace) # remove white space
  tm_map <- tm_map (tm_map, removePunctuation) # remove punctuations
  tm_map <- tm_map (tm_map, removeNumbers) # to remove numbers
  tm_map <- tm_map (tm_map, removeWords, stopwords("english")) # to remove stop words
  tm_map <- tm_map (tm_map, removeWords, c("work", "use", "java", "new", "support"))
  
  # prepare matrix of words and frequency counts
  Matrix <- TermDocumentMatrix(tm_map) # terms in rows
  matrix_c <- as.matrix (Matrix)
  freq <- sort (rowSums (matrix_c)) # frequency data
  
  res <- data.frame(words=names(freq), freq)
  wordcloud (res$words, res$freq, max.words=100, min.freq=3, scale=c(7,.5), random.order=FALSE, colors=brewer.pal(8, "Dark2"))
} ) 

Before we can run this user define R script, we need to ensure that we have the 'tm', 'SnowballC' and 'wordcloud' R packages installed on the Oracle Database server. On the Oracle Database server you need to rune ORE.

> library(ORE)

Then run the following command to install these R packages

> install_packages(c('tm','wordcloud', 'SnowballC'))

Run the function on the DB Server: You are now ready to run the function. In an earlier step we had gathered the data. Now we can pass this data to the in-database R script.

> res <- ore.doEval(FUN.NAME="prepare_tm_data", tm_data=tm_data)

The ore.doEval function is a general purpose ORE function. In this case we pass it two parameters. The first parameter is the neame of the user defined R script stored in the database, and the second parameter is the data. The function returns and ORE object that contains the word cloud graphic.

Display the results: You can very easily display the results.

> res

This gives us the following graphic.

NewImage

In my next blog post, of this series, I will show you how you can use the function created above and some other bits and pieces, using some other features of ORE and also in SQL.

Monday, August 1, 2016

Why Data Science projects fail

Over the past few weeks or months (maybe even years) I've had several conversations with various people about why Data Science (or whatever you want to call it) projects fail or never really get started.

Before we go any further perhaps we need to define what 'fail' means in these conversations. Typically fail means that the project doesn't deliver what was hoped for, it got bogged down is some technical or political issues, it did not deliver useful results, and more typically it is run once (or a couple of times) and never run again. You get the idea.

The following points outline some of the most typical reasons why Data Science projects fail, but this is not an exhaustive list. This list is just some of the most typical reason.

  • We need Big Data: It seems like everything that you read says you need Big Data for your data science project. Firstly what big data means to one person or company can be very different to what it means for another person/company. One possible definition is that it might include all the various social media and log type of data. If you don't have all of this data then no big deal. You can still do data science projects. You have lots and lots of other data. The data that you generate every day for the general running of your business. You can use that. If you have some history of this data going back over a few months or a couple of years then even better (and most of you will say Yes I have that data). Work with the data that you already have, that you already understand, that you are already using, etc and use that data to see if you can gain extra insights that will have some value to your business (it needs to have value otherwise whats the point). Some people call this everyday type of data you have, 'Small Data'. Big Data or Small Data are really bad terms. It is just Data. Let us work with data we already have and incrementally add in newer data (from your typical 'Big Data' sources) with each iteration of the data science project.
  • We need Big Technology: This kind of follows on from the mistake of believing we need Big Data to do our data science projects. As most companies will be working with the data that they already have, and you will have various technology solutions in place to manage this data. Then do we really need Big Data Technology solutions for our Data Science projects? Technologies like Hadoop and everything that goes along with it. The simple answer is 'No You Don't'. Now don't get me wrong. These technologies are important with it comes to managing Big Data, but you don't needs these to perform your data science projects. Many, many companies both large and small are performing data science projects using their existing technology solutions and have perhaps just added some analytics tools to support their project using the data that they are already managing. Most companies have databases to store and manage their data. You can use your analytics software to work with the data in these database to analyse, model and predict. Any results that are produced can be easily integrated back into these databases and the results can then be used by various groups within your organisation. Use the technologies you have, that you understand, that you can use to the max, supplemented with some newer analytics software that works with all of these for your data science projects. (An example: one project I've worked on included a retail organisation for one of the largest countries in the work. I was working with 3 years of sales data. Is this big data? I was able to use my laptop to perform advanced analytics on all their data)
  • Old School Data Science: Give me all your data, I'll analyse it and tell you what is happening. Unfortunately this kind of phrases are still very common. They are common and considered out of date 20 years ago when I worked on my first data science project (it wasn't called data science back then). If you do come across someone saying this to you, I would question their ability to deliver anything. If it was me, I would just say 'No thank you', and move onto someone else. You as a company will already know a lot of what is happening in your business, what data is currently being used for and any potential areas where you know advanced analytics and data science can help. You will know that the focus areas should be and how good or not your data is. You need someone who can help you to identify the key areas and what data science techniques can be used to help you to gain (a possible) greater insight into what is happening.
  • No clear objective or business question/problem and no measurable outcomes: In a way this is very similar to the previous point. You don't get into your car each morning and start driving, with the eventual hope that you arrive at work on time. No, you plan what you want to do (get to work), how you are going to get then (using your car) and when you want to get there by (your work start time). Using these you then plan out what is the best route you need to take to get to work, in the most efficient way you can, using your knowledge and experience of the road network, supplemented by traffic reports and making adjustments as necessary, to ensure that you get to work on time. This is exactly the same for data science projects. You need a good clear objective, that can be broken down into distinct problems, that will each require a specific set of advanced analytics to generate a measurable outcome. The measurable outcomes should allow you to measure if the advanced analytics actually gives you a valuable return. For example if you predict that you can increase sales by 3%, this sound good. But if the cost of implementing the solution is treating any the profit generated then you might decide that this solution is not worth continuing with.
  • Not productionalising the outcomes: This point follows on from the previous two points. A lot of what you read and a lot of what I've seen is that Data Science looks are discovering some new (and actionable) insights. But that is where the discussion ends. As if a report is produced that makes a recommendation or a list of customers to target, and that is it. What happens to your data science project then. It really gets canned or you might be told that we will come back to it in a few months (and possibly a year) from now. This is not what you really want. Why? because when you finally remember to come back to review the project and to do another run, the people who where involved in the original project have moved on or are not available. It then become too difficult to start over again and that is when the data science project fails. I've used the word 'productionalising' (is that a real word?) What I mean by that is that we need to take our data science project and build it into our every day applications and processes. For example if we build a customer risk model for loans in a bank. This should be built into the application that captures the loan application by the customer. That way when the bank employee is entering the loan application they can be given live feedback. They can then use this live feedback to address any issues with the customer. What can be typical is that this is discovered some weeks later when the loan has already been approved. We need to automate the use of our data science work. Another example is fraud detection. I know of several companies who have fraud detection measures in place. It can take them 4-6 weeks to identify a potential fraud case that needs investigation. Using data science and building this into their transaction monitoring systems they can now detect potential fraud cases in near real time )no big data architectures being used). By automating it we get quicker response and take actions at the right time. The quicker we can react the more money we can make or save. This is an area that a lot of companies are now focusing on when they are looking at data science project as this is they way that they can get a quicker return on their investment in their data science projects.
  • Very little senior management support: I think most of the data science projects are supported by senior management to some extent. The more successful the data science project the more involved the senior managers are and the more they understand of what these projects can potentially deliver. But with the ever changing and evolving world of IT most of the senior managers are very focused on the here and now, keeping the lights on, making sure their day-to-day applications are up and running, the backups and recovery processes are in place (and tested), and future proofing their application. It is well known that very little time and resources (human and money) are available for adding new functionality. Most of what I've mentioned is very IT related and perhaps the IT managers are not the most suitable people to sponsor data science projects. I've already some of the reasons but sometimes IT can get a bit caught up with the technology and trying to use the newest thing. Some of the most successful projects I've worked on have had senior managers from a business function. They will not be focused on the technology but on the processes around the data science project and how the outputs of the data science project can be used. The more focused they are on this the more successful the project will be. They will then act as the key to informing (and selling) the rest of the business on the success of the project. This in turn create more and more data scicene projects and will keep you busy for a long time to come
  • Ticking the box: Unfortunately I've seen this in way too many companies. Board level or the senior management team have hear about data science and all the magic that is can produce. The message is then passed down through the organisation that we need to be doing more and more of this. A business unit is chosen as for the pilot project. The pilot is completed, successfully, and the good news message is fed back up the ladder. But that is when enthusiasm ends. We have done a data science project, it was successful and now lets move on to the next thing. I've seen pilot or POC project that have proven to potentially save $10+M a year with a cost of $100K per year, being canned. Yes I've been told this is fantastic, this is beyond our wildest dreams. Only for nothing else to happen.
  • The data is no good: You need data, you need historical data. The more you have more more useful it will be for the data science project. But what if the data is of poor quality? How can this happen? Well it can happen very frequently. You may have applications that are poorly designed, that have a very poor data model, the staff are not trained correctly to ensure that good data gets entered, etc. etc. The list could go on and on. It is one thing for an application to capture data but if that data cannot be used for any meaningful purposes then it has very little value. Some companies have people hired that constantly inspect the data, assess the quality of the data and are then feeding back ideas on how to improve the quality of the data captured by the applications and also by the people inputting the data. Without good quality data then there is very little a data science project can do to magically convert it into good quality data. I've been in the situation where >90% of the data was unusable. We give them a list what improvements they needed to make and only come back to use then they have completed these and have at least 6 months of good quality data. We might be able to do something then. We never heard from them again. Also I get to talk to a lot of start ups who want to have data science build in from day one. These have very little 'real' data. Again I get to tell them come back to me when you have 6 months of data.
  • Too much focus on descriptive analytics: Although descriptive analytics is an important step in the early stages of all data science projects, they is still a huge number of consulting and product companies who are promoting this as a data science project. Like I said descriptive analytics is an important step, but it doesn't end there. It is just the beginning. When selecting a consulting or product company to partner with on your data science projects you need to ensure that they are offering more than just descriptive analytics. In a similar way to what I've mentioned in the points above, you need to look at how you can make use of these descriptive analytics and share them with the wider community in your company. But you also need to have some control over the proliferation of various visualisation tools. Descriptive analytics and visualisations is not data science or a major output of data science. It is only one part of a data science project and far more value outputs from a data science project can be achieved by using one or more of the advanced analytics methods that are available to you.
  • Ignoring your BI/DW: Unfortunately when it comes to a lot of data science projects your have two very different approaches to working with the data. One approach seems to be that we will look at your data that is available in the transactional databases (and other data sources), we will then look at how to integrate and clean this data before getting onto the fund stuff of exploring and then performing the advanced analytics. This approach completely ignores the BI team and any data warehouse that might exist. If a data warehouse already exists then it probably contains all or most of the data you are going to use. Therefore you can avoid all that them spent integrating and cleaning the data. The data warehouse will have this done for you. Plus the data warehouse will have a lot more data than what the current transactional databases will contain. Please, Please, Please use the data in the data warehouse and you will find that you will save a lot of time on your data science project. In addition to the time saved you will have a lot more (possibly years of) data to work with. I always try to work with data warehouse data. When I do I can go back 5 years and build predictive models from back then. I can then roll these through various time periods and can easily measure how good the level of predictive I'm getting. I also get to see if there are any changes in the data and how they affect the models. Plus I also get to see how the various algorithms and their associated models change and evolve over time. This allows me to demonstrate to the customer how the use of data science and predictive models works with their data over the past 5 years. This build up confidence with the customer on what is being done and what can be achieved. In one case I was able to demonstrate that if they implemented my solution 5 years ago, they would have save $40+M in that time period. If I didn't use the data warehouse I wouldn't have been able to prove this. Needless to say the customer was very happy.
  • Make up of team is wrong: You don't need a team of PhDs: There has been lots written about what the make up of skills what your data science team should be. Back a few years ago all the talk was that you need to have people with PhDs maths, stats or related states. Plus all you needed to do was to hire one of these. We all know that this is not true but was part of the rubbish that people were talking about. We all know that you really need a team of people and perhaps you already have some of these people already employed in your company already. You have database people, you have ETL people, you have data integration people, you have data analysts, you have project managers, you have business analysts, you have domain experts, etc. How many of those people have PhDs or require a PhD to do their job. But perhaps you don't have people with the skills of applying advanced analytic techniques to your data and business problems. Perhaps it is these people who you really need the most. Do these people really need to have a PhD? No they don't. You need someone who knows and understands the various techniques and most importantly how to use these to solve business problems. All too often people try to show off about using a particular technique or parameter setting, or a particular formula, or graphic technique, or using a certain language over another, or what library or package is the best. Don't engage in this. Look for people that can apply the correct technique or combination of techniques to your business problems. But despite what I said in the first two point, as your data management requirements grow you are going to need some addition people with some big data technologies.
  • Communication: being able to explain what data science can do, what it is producing and relating that back to the business. Being able to work with the management team, end users and all involved to show and explain what and how the data science project can do to support their work. Most technical people are not good at this. Bus some people are and these are a very valuable resource as part of your data science team or are keen supporter of what data science can do and how it can be used to help the business developed new and interesting actionable insights.
  • The output is not a report => You need to operationise/productionalise the data science project: See the point above on productionalising your data science work. The outputs should not be a report or a list of some form. With proper planning data science can become a central to all the operational systems in your company. They can help you make better and quicker decisions on how you interact with your customers, improve the efficiencies of your processes, etc. The list goes on and on. All data science projects are cyclical in nature. For example you developer a churn prediction system. You use this to interact with your customers. You are trying to change or alter their behaviour and this in turn changes them as a customer. This in turn affect the churn prediction system. It will no longer be as effective. So you will need to update it on a semi-regular basis. This could be every 3, 4, 6, or 12 months. It all depends. You can build in checks into your productionalised data science projects to detect when the predictive models need updating. This in turn helps your data science team to be more productive, with quicker turn around times of each iteration. Also with each iteration you can look to see if new data is available for you to include and use. Maybe at this point some of your big data sources are coming online with some useful data.

So when looking to start a Data Science project it is important to know a few things before you start. The following attempts to use the 5 W's to try explains these.

  • what you are doing
  • why you are doing it
  • who it is for and what they will gain from it
  • where will it be used within your applications/processes
  • when you are going to commence the project and how it will fit into strategic goals of your organisation

There has been plenty written about what magic Data Science projects will produce and bring to your organisation. You need to be careful of people who only talk about the magic. You also need to understand that it may not work or deliver what you are lead to believe. In all the projects I've worked on we have had some amazing results. But in one or two projects we have had results that where only a percentage or two better than what they are already doing.

Perhaps I need to write another blog post on 'Why Data Science projects succeed', and this will only be based on what I've experienced (in the real-world).

Like I said at the beginning, this is not an exhaustive list. There are many more and I'm sure you will have a few of your own. These are the typical reasons that I've come across in my 20 years of doing these kind of projects and long before the term data science existed.