Monday, September 12, 2016

My 3rd Book is now officially released

Today 12th September (2016) is the official release date of my 3rd book.

The title of the books is 'Oracle R Enterprise'. Make sure to check it out on Amazon.

It has been a busy 17 months, as you may have noticed that I had another book released a few weeks ago. Check it out here.

Yes, I was working on two books at the same time.

Yes, that was a lot of work, and looking back on it was a lot of fun too.

This new book (Oracle R Enterprise) is a good companion for my first book (Predictive Analytics using Oracle Data Miner), as I now have a book for each of the components of the Oracle Advanced Analytics option.

NewImage NewImage

Here is what is on the back cover of the book.

"Effectively manage your enterprise’s big data and keep complex processes running smoothly using the hands-on information contained in this Oracle Press guide. Oracle R Enterprise: Harnessing the Power of R in Oracle Database shows, step-by-step, how to create and execute large-scale predictive analytics and maintain superior performance. Discover how to explore and prepare your data, accurately model business processes, generate sophisticated graphics, and write and deploy powerful scripts. You will also find out how to effectively incorporate Oracle R Enterprise features in APEX applications, OBIEE dashboards, and Apache Hadoop systems. Learn to: • Install, configure, and administer Oracle R Enterprise • Establish connections and move data to the database • Create Oracle R Enterprise packages and functions • Use the R language to work with data in Oracle Database • Build models using ODM, ORE, and other algorithms • Develop and deploy R scripts and use the R script repository • Execute embedded R scripts and employ ORE SQL API functions • Map and manipulate data using Oracle R Advanced Analytics for Hadoop • Use ORE in Oracle Data Miner, OBIEE, and other applications ... "

This books is ideally suited to people who are starting out with Oracle R Enterprise (ORE) or have some experience with using it, and want to see what you can do with it and how it can be used with other products like APEX, OBIEE, Hadoop and Spark. Yes I touch on these in the book. This book may also be of interest for those who are working with the products I've just listed and want to see how to use ORE.

If you are at Oracle Open World (OOW) next week make sure to check out the book in the Oracle Book Store, and if you buy a copy try to track me down to get me to sign it. The best way to do this is to contact me on Twitter, leave a message at the Oracle Press stand, or you will find me hanging out at the OTN Lounge.

A special thanks to my technical editor, Mark Hornick, who is a Director of Oracle Advanced Analytics Product Management, for Oracle's R Technologies.

Here are quotes from some people about the book.

The book ‘Oracle R Enterprise’, written by Brendan Tierney, is a valuable resource for any data scientist who wants to use the R language with the Oracle Database. It demonstrates very well the many features of Oracle R Enterprise, from performing simple analytics to utilising the many performance features of the Oracle Database, allowing you to work with all your datasets - Big or small. Additionally the book demonstrates how you can use the power of the R language with the SQL language as well as with other Oracle products including APEX and OBIEE, as well as Hadoop and Spark.

- John Donnelly - Regional Director, Oracle Ireland

The new book by Brendan Tierney, Oracle ACE Director, on Oracle R Enterprise details how users can gain maximal value out of the Oracle Database’s tight integration with the popular open source R statistical programming language. The author guides the R community into how they can, through the ease and familiarity of R, tap into the power of the Oracle Database Enterprise Edition with its Oracle Advanced Analytics Option or the Oracle Database Cloud Service. Brendan, an expert in this field, clearly articulates how to get quickly started and provides extensive “how to” examples and R scripts. Readers of the book can learn how they can access data directly in the Database, eliminate data movement while exploiting the openness and flexibility of R. Readers can then tap into the scalability and security of SQL of the Oracle Database and leverage Oracle’s proprietary, parallelized in-database machine learning algorithms and Oracle R Enterprise’s R “push down” to SQL functions. Read this book and learn how to leverage R and reduce model development and enterprise model deployment from days/weeks to minutes/hours!

-Charlie Berger

Sr. Director Product Management, Oracle Advanced Analytics and Machine Learning

"Brendan Tierney conveys very clearly all the aspects required for a successful Data Scientist that wants to work with large Databases and large Big Data clusters. It contains a great articulation of all aspects related to building and deploying Machine Learning algorithms in an Oracle Database environment with an overview on the algorithms on Hadoop clusters, as well as the integration with Business Intelligence dashboards and Applications. This is an essential reference for anyone in the Data Science field today working with Oracle Databases.

Marcos Arancibia, Product Manager, Oracle Data Science.

Tuesday, September 6, 2016

Change the size of ORE PNG graphics using in-database R functions

In a previous blog post I showed you how create and display a ggplot2 R graphic using SQL. Make sure to check it out before reading the rest of this blog post.
In my previous blog post, I showed and mentioned that the PNG graphic returned by the embedded R execution SQL statement was not the same as what was produced if you created the graphic in an R session.
Here is the same ggplot2 graphic. The first one is what is produced in an R session and the section is what is produced by SQL query and the embedded R execution in Oracle.
NewImage NewImage
As you can see the second image (produced using the embedded R execution) gives a very square image.
The reason for this is that Oracle R Enterprise (ORE) creates the graphic image in PNG format. The default setting from this is 480 x 480. You will find this information when you go digging in the R documentation and not in the Oracle documentation.
So, how can I get my ORE produced graphic to appear like what is produced in R?
What you need to do is to change the height and width of the PNG image produced by ORE. You can do this by passing parameters in the SQL statement used to call the user defined R function, that in turn produces the ggplot2 image.
In my previous post, I gave the SQL statement to call and produce the graphic (shown above). One of the parameters to the rqTableEval function was set to null. This was because we didn't have any parameters to pass, apart from the data set.
We can replace this null with any parameters we want to pass to the user defined R function (demo_ggpplot). To pass the parameters we need to define them using a SELECT statement.
cursor(select 500 as "ore.png.height", 850 as "ore.png.width" from dual),
The full SELECT statement now becomes
select *
from table(rqTableEval( cursor(select * from claims),
                        cursor(select 500 as "ore.png.height", 850 as "ore.png.width" from dual),
                        'PNG',
                        'demo_ggpplot'));
When you view the graphic in SQL Developer, you will get something that looks a bit more like what you would expect or want to see.
NewImage
For each graphic image you want to produce using ORE you will need to figure out that are the best PNG height and width settings to use. Plus it also depends on what tool or application you are going to use to display the images (eg. APEX etc)

Thursday, September 1, 2016

How to Display a BLOB image in an APEX Report

Do you want to display an image on a report in APEX ?

Is the image stored as a BLOB data type in your schema or the blob is returned by some functions?

If so, then displaying the BLOB is not a simple or straight forward task.

Actually it is a simple and straight forward task, as long as you know "the trick" you need to create/defined in your APEX report.

The following steps outlines what you need to do to create a report with a BLOB images. Most of these are the standard steps, except for Step 4. That is the important one.

1. Create the Report using the APEX wizard

Create a new report. In my example here I'm going to create a classic report.

NewImage Enter a title for the report, and accept the default settings NewImage

Create as new navigation menu entry

NewImage

2. Define the Table or Query for the Report

Select the table or view that contains the data or define the SQL Query to return the results. It might be best to select this later option as it will make things clearer and easier to change in Step 4.

NewImage

Click next on the next 2 screens of the wizard and then click the Create button.

3. Set the BLOB attribute settings

When you run the report you will get something like the following being displayed. As you can see it clearly does not display the BLOB image.

NewImage

Next we need to setup the BLOB attribute settings. As shown in the following.

Screenshot 2016 08 26 13 59 30

When we run the report now, we now get an error message.

NewImage

4. Change the report query to return the length of the BLOB

Now this is the magic bit.

To get the image to display you need to go back to the Report level and change the query in the SQL Query box, to contain function below that get the length of the image in the BLOB attribute, dbms_lob.getlength() (in my example this attribute is call IMAGE)

select ID,
       dbms_lob.getlength(image)  image
from V_DOCUMENT_TM_IMAGE
Screenshot 2016 08 26 14 07 59

5. The BLOB object now appears :-)

That's it. Now when you run your report the image will be displayed.

NewImage

So now you know how to display a BLOB image in an APEX Report.

(Thanks to Roel and Joel for the help in working out how to do this)

Wednesday, August 24, 2016

How to get ORE to work with APEX

This blog post will bring you through the steps of how to get Oracle R Enterprise (ORE) to work with APEX.

The reason for this blog posts is that since ORE 1.4+ the security model has changed for how you access and run in-database user defined R scripts using the ORE SQL API functions.

I have a series of blog posts going out on using Oracle Text, Oracle R Enterprise and Oracle Data Mining. It was during one of these posts I wanted to show how easy it was to display an R chart using ORE in APEX. Up to now my APEX environment consisted of APEX 4 and ORE 1.3. Everything worked, nice and easy. But in my new APEX environment (APEX 5 and ORE 1.5), it didn't work. This is the calling of an in-database user defined R script using the SQL API functions didn't work. Here is the error message that is displayed.

NewImage

So something extra was needed with using ORE 1.5. The security model around the use of in-database user defined R scripts has changed. Extra functions are now available to allow you who can run these scripts. For example we have an ore.grant function where you can grant another user the privilege to run the script.

But the problem was, when I was in APEX, the application was defined on the same schema that the r script was created in (this was the RQUSER schema). When I connect to the RQUSER schema using ORE and SQL, I was able to see and run this R script (see my previous blog post for these details). But when I was in APEX I wasn't able to see the R script. For example, when using the SQL Workshop in APEX, I just couldn't see the R script.

NewImage

Something strange is going on. It turns out that the view definitions for the in-database ORE scripts are defined with

owner=SYS_CONTEXT('USERENV', 'SESSION_USER');

(Thanks to the Oracle ORE team and the Oracle APEX team for their help in working out what needed to be done)

This means when I'm connected to APEX, using my schema (RQUSER), I'm not able to see any of my ORE objects.

How do you overcome this problem ?

To fix this problem, I needed to grant the APEX_PUBLIC_USER access to my ORE script.

ore.grant(name = "prepare_tm_data_2", type = "rqscript", user = "APEX_PUBLIC_USER")

Now when I query the ALL_RQ_SCRIPTS view again, using the APEX SQL Workshop, I now get the following.

NewImage

Great. Now I can see the ORE script in my schema.

Now when I run my APEX application I now get graphic produced by R, running on my DB server, and delivered to my APEX application using SQL (via a BLOB object), displayed on my screen.

NewImage

Monday, August 22, 2016

My 2nd Book: is now available: Real World SQL and PL/SQL

It has been a busy 12 month. In addition to the day jobs, I've also been busy writing. (More news on this in a couple of weeks!)

Today is a major milestone as my second book is officially released and available in print and ebook formats.

The tile of the book is 'Real Word SQL and PL/SQL: Advice from the Experts'. Check it out on Amazon.

Now that sounds like a very fancy title, but it isn't meant to be. This book is written by 5 people (including me), who are all Oracle ACE Directors, who all have 20+ years of experience, each, of working with the Oracle Database, and we all love sharing our knowledge. My co-authors are Arup Nanda, Heli Helskyaho, Martin Widlake and Alex Nuitjen. It was a pleasure working with you.

I haven't seen a physical copy of the book yet !!! Yes the book is released and I haven't held it in my hands. Although I have seen pictures of it that other people have taken. There was a delay in sending out the author copies of the book, but as of this morning my books are sitting in Stansted Airport and should be making their way to Ireland today. So fingers crossed I'll have them tomorrow. I'll update this blog post with a picture when I have them. UPDATE: They finally arrived at 13:25 on the 22nd August.

NewImage

In addition to the 5 authors we also had Chet Justice (Oraclenerd), and Oracle ACE Director, as the technical editor. We also had Tim Hall, Oracle ACE Director, wrote a foreword for us.

NewImage NewImage

To give you some background to the book and why we wrote it, here is an extract from the start of the book, where I describe how the idea for this book came about and the aim of the book.

"While attempting to give you an idea into our original thinking behind the need for this book and why we wanted to write it, . the words of Rod Stewart’s song ‘Sailing’ keeps popping into my mind. These are ‘We are sailing, we are sailing, home again 'cross the sea’. This is because the idea for this book was born on a boat. Some call it a ship. Some call it a cruise ship. Whatever you want to call it, this book was born at the OUG Norway conference in March 2015. What makes the OUG Norway conference special is that it is held on a cruise ship that goes between Oslo in Norway to Kiel in Germany and back again. This means as a speaker and conference attendee you are ‘trapped’ on the cruise ship for 2 days filled with presentations, workshops, discussions and idea sharing for the Oracle community.

It was during this conference that Heli and Brendan got talking about their books. Heli had just published her Oracle SQL Developer Data Modeler book and Brendan had published his book on Oracle Data Miner the previous year. Whilst they were discussing their experiences of writing and sharing their knowledge and how much they enjoyed this,they both recognized that there are a lot of books for the people starting out in their Oracle career and then there are lots of books on specialized topics. What was missing were books that covered the middle group. A question they kept on asking but struggled to answer was, ‘after reading the introductory books, what book would they read next before getting onto the specialized books?’ This was particularly true of SQL and PL/SQL.

They also felt that something that was missing from many books, especially introductory ones, was the “Why and How” of doing things in certain ways that comes from experience. It is all well and good knowing the syntax of commands and the options, but what takes people from understanding a language to being productive in using it is that real-world derived knowledge that comes from using it for real tasks. It would be great to share some of that experience.

Then over breakfast on the final day of the OUG Norway conference, as the cruise ship was sailing through the fjorrd and around the islands that lead back to Oslo, Heli and Brendan finally agreed that this book should happen. They then listed the type of content they thought would be in such a book and who are the recognized experts (or super heroes) for these topics. This list of experts was very easy to come up with and the writing team of Oracle ACE Directors was formed, consisting of Arup Nanda, Martin Widlake and Alex Nuijten, along with Heli Helskyaho and Brendan Tierney. The author team then got to work defining the chapters and their contents. Using their combined 120+ years of SQL and PL/SQL experience they finally came up with scope and content for the book at Oracle Open World.

..."

As you can see, this book was 17 months in the making. This consisted of 4 months of proposal writing, research and refinement, 8 months of writing, 3 months of editing and 2 months for production of book.

Yes it takes a lot of time and commitment. We all finished our last tasks and final edits on the book back in early June. Since then the book has been sent for printing, converted into an ebook, books shipped to Oracle Press warehouse, then shipped to Amazon and other book sellers. Today it is finally available officially.

(when I say officially, it seems that Amazon has shipped some pre-ordered books a week ago)

If you are at Oracle Open World (OOW) in September make sure to check out the book in the Oracle Book Store, and if you buy a copy try to track us down to get us to sign it. The best way to do this is to contact us on Twitter, leave a message at the Oracle Press stand, or you will find us hanging out at the OTN Lounge.

Wednesday, August 17, 2016

Oracle Text, Oracle R Enterprise and Oracle Data Mining - Part 3

This is the third blog post of a series on using Oracle Text, Oracle R Enterprise and Oracle Data Mining. Check out the first and second blog posts of the series, as the data used in this blog post was extracted, processed and stored in a databases table.

This blog post is divided into 3 parts. The first part will build on what was covered in in the previous blog post and will expand the in-database ORE R script to include more data processing. The second part of this blog post will look at how you can use SQL to call our in-database ORE R scripts and to be able to include it in our custom applications, for example using APEX (part 3).

Part 1 - Expanding our in-database ORE R script for Text Mining

In my previous blog post we created an ORE user defined R script, that is stored in the database, and this script was used to perform text mining and to create a word cloud. But the data/text to be mined was processed beforehand and passed into this procedure.

But what if we wanted to have a scenario where we just wanted to say, here is the table that contains the data. Go ahead and process it. To do this we need to expand our user defined R script to include the loop to merge the webpage text into one variable. The following is a new version of our ORE user defined R script.

> ore.scriptCreate("prepare_tm_data_2", function (local_data) { 
  library(tm)
  library(SnowballC)
  library(wordcloud)
  
  tm_data <-""
  for(i in 1:nrow(local_data)) {
    tm_data <- paste(tm_data, local_data[i,]$DOC_TEXT, sep=" ")
  }
    
  txt_corpus <- Corpus (VectorSource (tm_data))
  
  # data clean up
  tm_map <- tm_map (txt_corpus, stripWhitespace) # remove white space
  tm_map <- tm_map (tm_map, removePunctuation) # remove punctuations
  tm_map <- tm_map (tm_map, removeNumbers) # to remove numbers
  tm_map <- tm_map (tm_map, removeWords, stopwords("english")) # to remove stop words
  tm_map <- tm_map (tm_map, removeWords, c("work", "use", "java", "new", "support"))

  # prepare matrix of words and frequency counts
  Matrix <- TermDocumentMatrix(tm_map) # terms in rows
  matrix_c <- as.matrix (Matrix)
  freq <- sort (rowSums (matrix_c)) # frequency data
  
  res <- data.frame(words=names(freq), freq)
  wordcloud (res$words, res$freq, max.words=100, min.freq=3, scale=c(7,.5), random.order=FALSE, colors=brewer.pal(8, "Dark2"))
} ) 

To call this R scipts using the embedded R execution we can use the ore.tableApply function. Our parameter to our new R script will now be an ORE data frame. This can be a table in the database or we can create a subset of table and pass it as the parameter. This will mean all the data process will occur on the Oracle Database server. No data is passed to the client or processing performed on the client. All work is done on the database server. The only data that is passed back to the client is the result from the function and that is the word cloud image.

> res <- ore.tableApply(MY_DOCUMENTS, FUN.NAME="prepare_tm_data_2")
> res

Part 2 - Using SQL to perform R Text Mining

Another way you ccan call this ORE user defined R function is using SQL. Yes we can use SQL to call R code and to produce an R graphic. Then doing this the R graphic will be returned as a BLOB. So that makes it easy to view and to include in your applications, just like APEX.

To call our ORE user defined R function, we can use the rqTableEval SQL function. You only really need to set two of the parameters to this function. The first parameter is a SELECT statement the defines the data set to be passed to the function. This is similar to what I showed above using the ore.tableApply R function, except we can have easier control on what records to pass in as the data set. The fourth parameter gives the name of the ORE user defined R script.

select *
from table(rqTableEval( cursor(select * from MY_DOCUMENTS),
                        null,
                        'PNG',
                        'prepare_tm_data_2'));

This is the image that is produced by this SQL statement and viewed in SQL Developer.

NewImage

Part 3 - Adding our R Text Mining to APEX

Adding the SQL to call an ORE user defined script is very simple in APEX. You can create a form or a report based on a query, and this query can be the same query that is given above.

Something that I like to do is to create a view for the ORE SELECT statement. This gives me some flexibility with some potential future modifications. This could be as simple as just changing the name of the script. Also if I discover a new graphic that I want to use, all I need to do is to change the R code in my user defined R script and it will automatically be picked up and displayed in APEX. See the images below.

WARNING: Yes I do have a slight warning. Since the introduction of ORE 1.4 and higher there is a slightly different security model around the use of user defined R scripts. Instead of going into the details of this and what you need to do in this blog post, I will have a separate blog post that describes the behaviour and what you need to do allow APEX to use ORE and to call the user defined R scripts in your schema. So look out for this blog post coming really soon.

NewImage In this blog post I showed you how you use Oracle R Enterprise and the embedded R execution features of ORE to use the text from the webpages and to create a word cloud. This is a useful tool to be able to see visually what words can stand out most on your webpage and if the correct message is being put across to your customers.

Monday, August 8, 2016

Oracle Text, Oracle R Enterprise and Oracle Data Mining - Part 2

This is the second blog post of a series on using Oracle Text, Oracle R Enterprise and Oracle Data Mining. Check out the first blog post of the series, as the data used in this blog post was extracted, processed and stored in a databases table.

In this blog post I will show you how you use Oracle R Enterprise and the embedded R execution features of ORE to use the text from the webpages and to create a word cloud. This is a useful tool to be able to see visually what words can stand out most on your webpage and if the correct message is being put across to your customers.

Prerequisites: You will need to load the following R packages into your R environment 'tm', 'word cloud' 'SnowballC'. These are required to process the following R code segments.

install.packages (c( "tm", "wordcloud", "SnowballC"))
library (tm)
library (wordcloud)
library (SnowballC)

Select data from table and prepare: We need to select the data from the table in our schema and to merge it into one variable.

local_data <- ore.pull(MY_DOCUMENTS)

tm_data <-""
for(i in 1:nrow(local_data)) {
  tm_data <- paste(tm_data, local_data[i,]$DOC_TEXT, sep=" ")
}
tm_data

Create function to perform Text Mining: In my previous blog post on creating a word cloud I gave the R code. In order to allow for this R code to be run on the database server (using the embedded R execution of ORE) we need to package this text mining R code up into a ORE user defined R script. This is stored in the database.

ore.scriptDrop("prepare_tm_data")
ore.scriptCreate("prepare_tm_data", function (tm_data) { 
  library(tm)
  library(SnowballC)
  library(wordcloud)
  
  txt_corpus <- Corpus (VectorSource (tm_data))
  
  # data clean up
  tm_map <- tm_map (txt_corpus, stripWhitespace) # remove white space
  tm_map <- tm_map (tm_map, removePunctuation) # remove punctuations
  tm_map <- tm_map (tm_map, removeNumbers) # to remove numbers
  tm_map <- tm_map (tm_map, removeWords, stopwords("english")) # to remove stop words
  tm_map <- tm_map (tm_map, removeWords, c("work", "use", "java", "new", "support"))
  
  # prepare matrix of words and frequency counts
  Matrix <- TermDocumentMatrix(tm_map) # terms in rows
  matrix_c <- as.matrix (Matrix)
  freq <- sort (rowSums (matrix_c)) # frequency data
  
  res <- data.frame(words=names(freq), freq)
  wordcloud (res$words, res$freq, max.words=100, min.freq=3, scale=c(7,.5), random.order=FALSE, colors=brewer.pal(8, "Dark2"))
} ) 

Before we can run this user define R script, we need to ensure that we have the 'tm', 'SnowballC' and 'wordcloud' R packages installed on the Oracle Database server. On the Oracle Database server you need to rune ORE.

> library(ORE)

Then run the following command to install these R packages

> install_packages(c('tm','wordcloud', 'SnowballC'))

Run the function on the DB Server: You are now ready to run the function. In an earlier step we had gathered the data. Now we can pass this data to the in-database R script.

> res <- ore.doEval(FUN.NAME="prepare_tm_data", tm_data=tm_data)

The ore.doEval function is a general purpose ORE function. In this case we pass it two parameters. The first parameter is the neame of the user defined R script stored in the database, and the second parameter is the data. The function returns and ORE object that contains the word cloud graphic.

Display the results: You can very easily display the results.

> res

This gives us the following graphic.

NewImage

In my next blog post, of this series, I will show you how you can use the function created above and some other bits and pieces, using some other features of ORE and also in SQL.

Monday, August 1, 2016

Why Data Science projects fail

Over the past few weeks or months (maybe even years) I've had several conversations with various people about why Data Science (or whatever you want to call it) projects fail or never really get started.

Before we go any further perhaps we need to define what 'fail' means in these conversations. Typically fail means that the project doesn't deliver what was hoped for, it got bogged down is some technical or political issues, it did not deliver useful results, and more typically it is run once (or a couple of times) and never run again. You get the idea.

The following points outline some of the most typical reasons why Data Science projects fail, but this is not an exhaustive list. This list is just some of the most typical reason.

  • We need Big Data: It seems like everything that you read says you need Big Data for your data science project. Firstly what big data means to one person or company can be very different to what it means for another person/company. One possible definition is that it might include all the various social media and log type of data. If you don't have all of this data then no big deal. You can still do data science projects. You have lots and lots of other data. The data that you generate every day for the general running of your business. You can use that. If you have some history of this data going back over a few months or a couple of years then even better (and most of you will say Yes I have that data). Work with the data that you already have, that you already understand, that you are already using, etc and use that data to see if you can gain extra insights that will have some value to your business (it needs to have value otherwise whats the point). Some people call this everyday type of data you have, 'Small Data'. Big Data or Small Data are really bad terms. It is just Data. Let us work with data we already have and incrementally add in newer data (from your typical 'Big Data' sources) with each iteration of the data science project.
  • We need Big Technology: This kind of follows on from the mistake of believing we need Big Data to do our data science projects. As most companies will be working with the data that they already have, and you will have various technology solutions in place to manage this data. Then do we really need Big Data Technology solutions for our Data Science projects? Technologies like Hadoop and everything that goes along with it. The simple answer is 'No You Don't'. Now don't get me wrong. These technologies are important with it comes to managing Big Data, but you don't needs these to perform your data science projects. Many, many companies both large and small are performing data science projects using their existing technology solutions and have perhaps just added some analytics tools to support their project using the data that they are already managing. Most companies have databases to store and manage their data. You can use your analytics software to work with the data in these database to analyse, model and predict. Any results that are produced can be easily integrated back into these databases and the results can then be used by various groups within your organisation. Use the technologies you have, that you understand, that you can use to the max, supplemented with some newer analytics software that works with all of these for your data science projects. (An example: one project I've worked on included a retail organisation for one of the largest countries in the work. I was working with 3 years of sales data. Is this big data? I was able to use my laptop to perform advanced analytics on all their data)
  • Old School Data Science: Give me all your data, I'll analyse it and tell you what is happening. Unfortunately this kind of phrases are still very common. They are common and considered out of date 20 years ago when I worked on my first data science project (it wasn't called data science back then). If you do come across someone saying this to you, I would question their ability to deliver anything. If it was me, I would just say 'No thank you', and move onto someone else. You as a company will already know a lot of what is happening in your business, what data is currently being used for and any potential areas where you know advanced analytics and data science can help. You will know that the focus areas should be and how good or not your data is. You need someone who can help you to identify the key areas and what data science techniques can be used to help you to gain (a possible) greater insight into what is happening.
  • No clear objective or business question/problem and no measurable outcomes: In a way this is very similar to the previous point. You don't get into your car each morning and start driving, with the eventual hope that you arrive at work on time. No, you plan what you want to do (get to work), how you are going to get then (using your car) and when you want to get there by (your work start time). Using these you then plan out what is the best route you need to take to get to work, in the most efficient way you can, using your knowledge and experience of the road network, supplemented by traffic reports and making adjustments as necessary, to ensure that you get to work on time. This is exactly the same for data science projects. You need a good clear objective, that can be broken down into distinct problems, that will each require a specific set of advanced analytics to generate a measurable outcome. The measurable outcomes should allow you to measure if the advanced analytics actually gives you a valuable return. For example if you predict that you can increase sales by 3%, this sound good. But if the cost of implementing the solution is treating any the profit generated then you might decide that this solution is not worth continuing with.
  • Not productionalising the outcomes: This point follows on from the previous two points. A lot of what you read and a lot of what I've seen is that Data Science looks are discovering some new (and actionable) insights. But that is where the discussion ends. As if a report is produced that makes a recommendation or a list of customers to target, and that is it. What happens to your data science project then. It really gets canned or you might be told that we will come back to it in a few months (and possibly a year) from now. This is not what you really want. Why? because when you finally remember to come back to review the project and to do another run, the people who where involved in the original project have moved on or are not available. It then become too difficult to start over again and that is when the data science project fails. I've used the word 'productionalising' (is that a real word?) What I mean by that is that we need to take our data science project and build it into our every day applications and processes. For example if we build a customer risk model for loans in a bank. This should be built into the application that captures the loan application by the customer. That way when the bank employee is entering the loan application they can be given live feedback. They can then use this live feedback to address any issues with the customer. What can be typical is that this is discovered some weeks later when the loan has already been approved. We need to automate the use of our data science work. Another example is fraud detection. I know of several companies who have fraud detection measures in place. It can take them 4-6 weeks to identify a potential fraud case that needs investigation. Using data science and building this into their transaction monitoring systems they can now detect potential fraud cases in near real time )no big data architectures being used). By automating it we get quicker response and take actions at the right time. The quicker we can react the more money we can make or save. This is an area that a lot of companies are now focusing on when they are looking at data science project as this is they way that they can get a quicker return on their investment in their data science projects.
  • Very little senior management support: I think most of the data science projects are supported by senior management to some extent. The more successful the data science project the more involved the senior managers are and the more they understand of what these projects can potentially deliver. But with the ever changing and evolving world of IT most of the senior managers are very focused on the here and now, keeping the lights on, making sure their day-to-day applications are up and running, the backups and recovery processes are in place (and tested), and future proofing their application. It is well known that very little time and resources (human and money) are available for adding new functionality. Most of what I've mentioned is very IT related and perhaps the IT managers are not the most suitable people to sponsor data science projects. I've already some of the reasons but sometimes IT can get a bit caught up with the technology and trying to use the newest thing. Some of the most successful projects I've worked on have had senior managers from a business function. They will not be focused on the technology but on the processes around the data science project and how the outputs of the data science project can be used. The more focused they are on this the more successful the project will be. They will then act as the key to informing (and selling) the rest of the business on the success of the project. This in turn create more and more data scicene projects and will keep you busy for a long time to come
  • Ticking the box: Unfortunately I've seen this in way too many companies. Board level or the senior management team have hear about data science and all the magic that is can produce. The message is then passed down through the organisation that we need to be doing more and more of this. A business unit is chosen as for the pilot project. The pilot is completed, successfully, and the good news message is fed back up the ladder. But that is when enthusiasm ends. We have done a data science project, it was successful and now lets move on to the next thing. I've seen pilot or POC project that have proven to potentially save $10+M a year with a cost of $100K per year, being canned. Yes I've been told this is fantastic, this is beyond our wildest dreams. Only for nothing else to happen.
  • The data is no good: You need data, you need historical data. The more you have more more useful it will be for the data science project. But what if the data is of poor quality? How can this happen? Well it can happen very frequently. You may have applications that are poorly designed, that have a very poor data model, the staff are not trained correctly to ensure that good data gets entered, etc. etc. The list could go on and on. It is one thing for an application to capture data but if that data cannot be used for any meaningful purposes then it has very little value. Some companies have people hired that constantly inspect the data, assess the quality of the data and are then feeding back ideas on how to improve the quality of the data captured by the applications and also by the people inputting the data. Without good quality data then there is very little a data science project can do to magically convert it into good quality data. I've been in the situation where >90% of the data was unusable. We give them a list what improvements they needed to make and only come back to use then they have completed these and have at least 6 months of good quality data. We might be able to do something then. We never heard from them again. Also I get to talk to a lot of start ups who want to have data science build in from day one. These have very little 'real' data. Again I get to tell them come back to me when you have 6 months of data.
  • Too much focus on descriptive analytics: Although descriptive analytics is an important step in the early stages of all data science projects, they is still a huge number of consulting and product companies who are promoting this as a data science project. Like I said descriptive analytics is an important step, but it doesn't end there. It is just the beginning. When selecting a consulting or product company to partner with on your data science projects you need to ensure that they are offering more than just descriptive analytics. In a similar way to what I've mentioned in the points above, you need to look at how you can make use of these descriptive analytics and share them with the wider community in your company. But you also need to have some control over the proliferation of various visualisation tools. Descriptive analytics and visualisations is not data science or a major output of data science. It is only one part of a data science project and far more value outputs from a data science project can be achieved by using one or more of the advanced analytics methods that are available to you.
  • Ignoring your BI/DW: Unfortunately when it comes to a lot of data science projects your have two very different approaches to working with the data. One approach seems to be that we will look at your data that is available in the transactional databases (and other data sources), we will then look at how to integrate and clean this data before getting onto the fund stuff of exploring and then performing the advanced analytics. This approach completely ignores the BI team and any data warehouse that might exist. If a data warehouse already exists then it probably contains all or most of the data you are going to use. Therefore you can avoid all that them spent integrating and cleaning the data. The data warehouse will have this done for you. Plus the data warehouse will have a lot more data than what the current transactional databases will contain. Please, Please, Please use the data in the data warehouse and you will find that you will save a lot of time on your data science project. In addition to the time saved you will have a lot more (possibly years of) data to work with. I always try to work with data warehouse data. When I do I can go back 5 years and build predictive models from back then. I can then roll these through various time periods and can easily measure how good the level of predictive I'm getting. I also get to see if there are any changes in the data and how they affect the models. Plus I also get to see how the various algorithms and their associated models change and evolve over time. This allows me to demonstrate to the customer how the use of data science and predictive models works with their data over the past 5 years. This build up confidence with the customer on what is being done and what can be achieved. In one case I was able to demonstrate that if they implemented my solution 5 years ago, they would have save $40+M in that time period. If I didn't use the data warehouse I wouldn't have been able to prove this. Needless to say the customer was very happy.
  • Make up of team is wrong: You don't need a team of PhDs: There has been lots written about what the make up of skills what your data science team should be. Back a few years ago all the talk was that you need to have people with PhDs maths, stats or related states. Plus all you needed to do was to hire one of these. We all know that this is not true but was part of the rubbish that people were talking about. We all know that you really need a team of people and perhaps you already have some of these people already employed in your company already. You have database people, you have ETL people, you have data integration people, you have data analysts, you have project managers, you have business analysts, you have domain experts, etc. How many of those people have PhDs or require a PhD to do their job. But perhaps you don't have people with the skills of applying advanced analytic techniques to your data and business problems. Perhaps it is these people who you really need the most. Do these people really need to have a PhD? No they don't. You need someone who knows and understands the various techniques and most importantly how to use these to solve business problems. All too often people try to show off about using a particular technique or parameter setting, or a particular formula, or graphic technique, or using a certain language over another, or what library or package is the best. Don't engage in this. Look for people that can apply the correct technique or combination of techniques to your business problems. But despite what I said in the first two point, as your data management requirements grow you are going to need some addition people with some big data technologies.
  • Communication: being able to explain what data science can do, what it is producing and relating that back to the business. Being able to work with the management team, end users and all involved to show and explain what and how the data science project can do to support their work. Most technical people are not good at this. Bus some people are and these are a very valuable resource as part of your data science team or are keen supporter of what data science can do and how it can be used to help the business developed new and interesting actionable insights.
  • The output is not a report => You need to operationise/productionalise the data science project: See the point above on productionalising your data science work. The outputs should not be a report or a list of some form. With proper planning data science can become a central to all the operational systems in your company. They can help you make better and quicker decisions on how you interact with your customers, improve the efficiencies of your processes, etc. The list goes on and on. All data science projects are cyclical in nature. For example you developer a churn prediction system. You use this to interact with your customers. You are trying to change or alter their behaviour and this in turn changes them as a customer. This in turn affect the churn prediction system. It will no longer be as effective. So you will need to update it on a semi-regular basis. This could be every 3, 4, 6, or 12 months. It all depends. You can build in checks into your productionalised data science projects to detect when the predictive models need updating. This in turn helps your data science team to be more productive, with quicker turn around times of each iteration. Also with each iteration you can look to see if new data is available for you to include and use. Maybe at this point some of your big data sources are coming online with some useful data.

So when looking to start a Data Science project it is important to know a few things before you start. The following attempts to use the 5 W's to try explains these.

  • what you are doing
  • why you are doing it
  • who it is for and what they will gain from it
  • where will it be used within your applications/processes
  • when you are going to commence the project and how it will fit into strategic goals of your organisation

There has been plenty written about what magic Data Science projects will produce and bring to your organisation. You need to be careful of people who only talk about the magic. You also need to understand that it may not work or deliver what you are lead to believe. In all the projects I've worked on we have had some amazing results. But in one or two projects we have had results that where only a percentage or two better than what they are already doing.

Perhaps I need to write another blog post on 'Why Data Science projects succeed', and this will only be based on what I've experienced (in the real-world).

Like I said at the beginning, this is not an exhaustive list. There are many more and I'm sure you will have a few of your own. These are the typical reasons that I've come across in my 20 years of doing these kind of projects and long before the term data science existed.


Friday, July 29, 2016

Checking out the Oracle Reserved Words using V$RESERVED_WORDS

When working with SQL or PL/SQL we all know there are some words we cannot use in our code or to label various parts of it. These languages have a number of reserved words that form the language.
Somethings it can be a challenge to know what is or isn't a reserved word. Yes we can check the Oracle documentation for the SQL reserved words and the PL/SQL reserved words. There are other references and list in the Oracle documentation listing the reserved and key words.
But we also have the concept of Key Words (as opposed to reserved words). In the SQL documentation these are are not listed. In the PL/SQL documentation most are listed.
What is a Key Word in Oracle ?
Oracle SQL keywords are not reserved. BUT Oracle uses them internally in specific ways. If you use these words as names for objects and object parts, then your SQL statements may be more difficult to read and may lead to unpredictable results.
But if we didn't have access to the documentation (or google) how can we find out what the key words are. You can use the data dictionary view called V$RESERVED_WORDS.
NewImage
But this view isn't available to version. So if you want to get your hands on it you will need the SYS user. Alternatively if you are a DBA you could share this with all your developers.
When we query this view we get 2,175 entries (for 12.1.0.2 Oracle Database).
NewImage

Tuesday, July 26, 2016

Oracle Text, Oracle R Enterprise and Oracle Data Mining - Part 1

A project that I've been working on for a while now involves the use of Oracle Text, Oracle R Enterprise and Oracle Data Mining. Oracle Text comes with your Oracle Database licence. Oracle R Enterprise and Oracle Data Mining are part of the Oracle Advanced Analytics (extra cost) option.

What I will be doing over the course of 4 or maybe 5 blog posts is how these products can work together to help you gain a grater insight into your data, and part of your data being large text items like free format text, documents (in various forms e.g. html, xml, pdf, ms word), etc.

Unfortunately I cannot show you examples from the actual project I've been working on (and still am, from time to time). But what I can do is to show you how products and components can work together.

In this blog post I will just do some data setup. As with all project scenarios there can be many ways of performing the same tasks. Some might be better than others. But what I will be showing you is for demonstration purposes.

The scenario: The scenario for this blog post is that I want to extract text from some webpages and store them in a table in my schema. I then want to use Oracle Text to search the text from these webpages.

Schema setup: We need to create a table that will store the text from the webpages. We also want to create an Oracle Text index so that this text is searchable.

drop sequence my_doc_seq;
create sequence my_doc_seq;

drop table my_documents;

create table my_documents (
doc_pk number(10) primary key, 
doc_title varchar2(100), 
doc_extracted date, 
data_source varchar2(200), 
doc_text clob);

create index my_documents_ot_idx on my_documents(doc_text) 
indextype is CTXSYS.CONTEXT;

In the table we have a number of descriptive attributes and then a club for storing the website text. We will only be storing the website text and not the html document (More on that later). In order to make the website text searchable in the DOC_TEXT attribute we need to create an Oracle Text index of type CONTEXT.

There are a few challenges with using this type of index. For example when you insert a new record or update the DOC_TEXT attribute, the new values/text will not be reflected instantly, just like we are use to with traditional indexes. Instead you have to decide when you want to index to be updated. For example, if you would like the index to be updated after each commit then you can create the index using the following.

create index my_documents_ot_idx on my_documents(doc_text) 
indextype is CTXSYS.CONTEXT
parameters ('sync (on commit)');

Depending on the number of documents you have being committed to the DB, this might not be for you. You need to find the balance. Alternatively you could schedule the index to be updated by passing an interval to the 'sync' in the above command. Alternatively you might want to use DBMS_JOB to schedule the update.

To manually sync (or via DBMS_JOB) the index, assuming we used the first 'create index' statement, we would need to run the following.

EXEC CTX_DDL.SYNC_INDEX('my_documents_ot_idx');

This function just adds the new documents to the index. This can, over time, lead to some fragmentation of the index, and will require it to the re-organised on a semi-regular basis. Perhaps you can schedule this to happen every night, or once a week, or whatever makes sense to you.

BEGIN
  CTX_DDL.OPTIMIZE_INDEX('my_documents_ot_idx','FULL');
END;

(I could talk a lot more about setting up some basics of Oracle Text, the indexes, etc. But I'll leave that for another day or you can read some of the many blog posts that already exist on the topic.)

Extracting text from a webpage using R: Some time ago I wrote a blog post on using some of the text mining features and packages in R to produce a word cloud based on some of the Oracle Advanced Analytics webpages. I'm going to use the same webpages and some of the same code/functions/packages here. The first task you need to do is to get your hands on the 'htmlToText function. You can download the htmlToText function on github. This function requires the 'Curl' and 'XML' R packages. So you may need to install these. I also use the str_replace_all function ("stringer' R package) to remove some of the html that remains, to remove some special quotes and to replace and occurrences of '&' with 'and'. # Load the function and required R packages source("c:/app/htmltotext.R") library(stringr)
data1 <- str_replace_all(htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/overview/index.html"), "[\r\n\t\"\'\u201C\u201D]" , "")
data1 <- str_replace_all(data1, "&", "and")
data2 <- str_replace_all(str_replace_all(htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html"), "[\r\n\t\"\'\u201C\u201D]" , ""), "&", "and")
data2 <- str_replace_all(data2, "&", "and")
data3 <- str_replace_all(str_replace_all(htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-technologies/overview/index.html"), "[\r\n\t\"\'\u201C\u201D]" , ""), "&", "and")
data3 <- str_replace_all(data3, "&", "and")
data4 <- str_replace_all(str_replace_all(htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html"), "[\r\n\t\"\'\u201C\u201D]" , ""), "&", "and")
data4 <- str_replace_all(data4, "&", "and")
We now have the text extracted and cleaned up. Create a data frame to contain all our data: Now that we have the text extracted, we can prepare the other data items we need to insert the data into our table ('my_documents'). The first stept is to construct a data frame to contain all the data.
data_source = c("http://www.oracle.com/technetwork/database/options/advanced-analytics/overview/index.html",
                 "http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html",
                 "http://www.oracle.com/technetwork/database/database-technologies/r/r-technologies/overview/index.html",
                 "http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html")
doc_title = c("OAA_OVERVIEW", "OAA_ODM", "R_TECHNOLOGIES", "OAA_ORE")
doc_extracted = Sys.Date()
data_text <- c(data1, data2, data3, data4)

my_docs <- data.frame(doc_title, doc_extracted, data_source, data_text)

Insert the data into our database table: With the data in our data fram (my_docs) we can now use this data to insert into our database table. There are a number of ways of doing this in R. What I'm going to show you here is how to do it using Oracle R Enterprise (ORE). The thing with ORE is that there is no explicit functionality for inserting and updating records in a database table. What you need to do is to construct, in my case, the insert statement and then use ore.exec to execute this statement in the database.

library(ORE)
ore.connect(user="ora_text", password="ora_text", host="localhost", service_name="PDB12C", 
            port=1521, all=TRUE) 

for(i in 1:nrow(my_docs)) {
  insert_stmt <- "BEGIN insert_tab_document ('"
  insert_stmt <- paste(insert_stmt,  my_docs[i,]$doc_title, sep="")
  insert_stmt <- paste(insert_stmt, "', '",  my_docs[i,]$doc_extracted, "'", sep="")
  insert_stmt <- paste(insert_stmt, ", '",  my_docs[i,]$data_source, sep="")
  insert_stmt <- paste(insert_stmt, "', '",  my_docs[i,]$data_text, "');", " END;", sep="")
  ore.exec(insert_stmt)
}
ore.exec("commit")

You can now view the inserted webpage text using R or using SQL.

In my next blog post in this series, I will look at how you can use the ORE embedded features to read and process this data.

Monday, July 11, 2016

Creating ggplot2 graphics using SQL

Did you read the title of this blog post! Read it again.

Yes, Yes, I know what you are saying, "SQL cannot produce graphics or charts and particularly not ggplot2 graphics".

You are correct to a certain extent. SQL is rubbish a creating graphics (and I'm being polite).

But with Oracle R Enterprise you can now produce graphics on your data using the embedded R execution feature of Oracle R Enterprise using SQL. In this blog post I will show you how.

1. Pre-requisites

You need to have installed Oracle R Enterprise on your Oracle Database Server. Plus you need to install the ggplot2 R package.

In your R session you will need to setup a ORE connection to your Oracle schema.

2. Write and Test your R code to produce the graphic

It is always a good idea to write and test your R code before you go near using it in a user defined function.

For our (first) example we are going to create a bar chart using the ggplot2 R package. This is a basic example and the aim is to illustrate the steps you need to go through to call and produce this graphic using SQL.

The following code using the CLAIMS data set that is available with/for Oracle Advanced Analytics. The first step is to pull the data from the table in your Oracle schema to your R session. This is because ggplot2 cannot work with data referenced by an ore.frame object.

data.subset <- ore.pull(CLAIMS) 

Next we need to aggregate the data. Here we are counting the number of records for each Make of car.

aggdata2 <- aggregate(data.subset$POLICYNUMBER,
                      by = list(MAKE = data.subset$MAKE),
                      FUN = length)

Now load the ggplot2 R package and use it to build the bar chart.

ggplot(data=aggdata2, aes(x=MAKE, y=x, fill=MAKE)) + 
       geom_bar(color="black", stat="identity") +
       xlab("Make of Car") + 
       ylab("Num of Accidents") + 
       ggtitle("Accidents by Make of Car")

The following is the graphic that our call to ggplot2 produces in R.

NewImage

At this point we have written and tested our R code and know that it works.

3. Create a user defined R function and store it in the Oracle Database

Our next step in the process is to create an in-database user defined R function. This is were we store R code in our Oracle Database and make this available as an R function. To create the user defined R function we can use some PL/SQL to define it, and then take our R code (see above) and in it.

BEGIN
   -- sys.rqScriptDrop('demo_ggpplot');
   sys.rqScriptCreate('demo_ggpplot', 
      'function(dat) {
         library(ggplot2)
         
         aggdata2 <- aggregate(dat$POLICYNUMBER,
                      by = list(MAKE = dat$MAKE),
                      FUN = length)

        g <-ggplot(data=aggdata2, aes(x=MAKE, y=x, fill=MAKE)) + geom_bar(color="black", stat="identity") +
                   xlab("Make of Car") + ylab("Num of Accidents") + ggtitle("Accidents by Make of Car")

        plot(g)
   }');
END;

We have to make a small addition to our R code. We need need to include a call to the plot function so that the image can be returned as a BLOB object. If you do not do this then the SQL query in step 4 will return no rows.

4. Write the SQL to call it

To call our defined R function we will need to use one of the ORE SQL API functions. In the following example we are using the rqTableEval function. The first parameter for this function passes in the data to be processed. In our case this is the data from the CLAIMS table. The second parameter is set to null. The third parameter is set to the output format and in our case we want this to be PNG. The fourth parameter is the name of the user defined R function.

select *
from table(rqTableEval( cursor(select * from claims),
                        null,
                        'PNG',
                        'demo_ggpplot'));                        

5. How to view the results

The SQL query in Step 4 above will return one row and this row will contain a column with a BLOB data type.

NewImage

The easiest way to view the graphic that is produced is to use SQL Developer. It has an inbuilt feature that allows you to display BLOB objects. All you need to do is to double click on the BLOB cell (under the column labeled IMAGE). A window will open called 'View Value'. In this window click the 'View As Image' check box on the top right hand corner of the window. When you do the R ggplot2 graphic will be displayed.

NewImage

Yes the image is not 100% the same as the image produced in our R session. I will have another blog post that deals with this at a later date.

But, now you have written a SQL query, that calls R code to produce an R graphic (using ggplot2) of our data.

6. Now you can enhance the graphics (without changing your SQL)

What if you get bored with the bar chart and you want to change it to a different type of graphic? All you need to do is to change the relevant code in the user defined R function.

For example, if we want to change the graphic to a polar plot. The following is the PL/SQL code that re-defines the user defined R script.

BEGIN
   sys.rqScriptDrop('demo_ggpplot');
   sys.rqScriptCreate('demo_ggpplot', 
      'function(dat) {
         library(ggplot2)
         
         aggdata2 <- aggregate(dat$POLICYNUMBER,
                      by = list(MAKE = dat$MAKE),
                      FUN = length)

         n <- nrow(aggdata2)
         degrees <- 360/n

        aggdata2$MAKE_ID <- 1:nrow(aggdata2)

        g<- ggplot(data=aggdata2, aes(x=MAKE, y=x, fill=MAKE)) + geom_bar(color="black", stat="identity") +
               xlab("Make of Car") + ylab("Num of Accidents") + ggtitle("Accidents by Make of Car") + coord_polar(theta="x") 
        plot(g)
   }');
END;

We can use the exact same SQL query we defined in Step 4 above to call the next graphic.

NewImage

All done.

Now that was easy! Right?

I kind of is easy once you have been shown. There are a few challenges when working in-database user defined R functions and writing the SQL to call them. Most of the challenges are around the formatting of R code in the function and the syntax of the SQL statement to call it. With a bit of practice it does get easier.

7. Where/How can you use these graphics ?

Any application or program that can call and process a BLOB data type can display these images. For example, I've been able to include these graphics in applications developed in APEX.

Tuesday, July 5, 2016

Cluster Distance using SQL with Oracle Data Mining - Part 4

This is the fourth and last blog post in a series that looks at how you can examine the details of predicted clusters using Oracle Data Mining. In the previous blog posts I looked at how to use CLUSER_ID, CLUSTER_PROBABILITY and CLUSTER_SET.

In this blog post we will look at CLUSTER_DISTANCE. We can use the function to determine how close a record is to the centroid of the cluster. Perhaps we can use this to determine what customers etc we might want to focus on most. The customers who are closest to the centroid are one we want to focus on first. So we can use it as a way to prioritise our workflows, particularly when it is used in combination with the value for CLUSTER_PROBABILITY.

Here is an example of using CLUSTER_DISTANCE to list all the records that belong to Cluster 14 and the results are ordered based on closeness to the centroid of this cluster.

SELECT customer_id, 
       cluster_probability(clus_km_1_37 USING *) as cluster_Prob,
       cluster_distance(clus_km_1_37 USING *) as cluster_Distance
FROM   insur_cust_ltv_sample
WHERE   cluster_id(clus_km_1_37 USING *) = 14
order by cluster_Distance asc;

Here is a subset of the results from this query.

NewImage

When you examine the results you may notice that the records that is listed first and closest record to the centre of cluster 14 has a very low probability. You need to remember that we are working in a N-dimensional space here. Although this first record is closest to the centre of cluster 14 it has a really low probability and if we examine this record in more detail we will find that it is at an overlapping point between a number of clusters.

This is why we need to use the CLUSTER_DISTANCE and CLUSTER_PROBABILITY functions together in our workflows and applications to determine how we need to process records like these.