Brendan Tierney - Oralytics Blog: Oracle Advanced Analytics

Showing posts with label Oracle Advanced Analytics. Show all posts

Monday, August 8, 2016

Oracle Text, Oracle R Enterprise and Oracle Data Mining - Part 2

This is the second blog post of a series on using Oracle Text, Oracle R Enterprise and Oracle Data Mining. Check out the first blog post of the series, as the data used in this blog post was extracted, processed and stored in a databases table.

In this blog post I will show you how you use Oracle R Enterprise and the embedded R execution features of ORE to use the text from the webpages and to create a word cloud. This is a useful tool to be able to see visually what words can stand out most on your webpage and if the correct message is being put across to your customers.

Prerequisites: You will need to load the following R packages into your R environment 'tm', 'word cloud' 'SnowballC'. These are required to process the following R code segments.

install.packages (c( "tm", "wordcloud", "SnowballC"))
library (tm)
library (wordcloud)
library (SnowballC)

Select data from table and prepare: We need to select the data from the table in our schema and to merge it into one variable.

local_data <- ore.pull(MY_DOCUMENTS)

tm_data <-""
for(i in 1:nrow(local_data)) {
  tm_data <- paste(tm_data, local_data[i,]$DOC_TEXT, sep=" ")
}
tm_data

Create function to perform Text Mining: In my previous blog post on creating a word cloud I gave the R code. In order to allow for this R code to be run on the database server (using the embedded R execution of ORE) we need to package this text mining R code up into a ORE user defined R script. This is stored in the database.

ore.scriptDrop("prepare_tm_data")
ore.scriptCreate("prepare_tm_data", function (tm_data) { 
  library(tm)
  library(SnowballC)
  library(wordcloud)
  
  txt_corpus <- Corpus (VectorSource (tm_data))
  
  # data clean up
  tm_map <- tm_map (txt_corpus, stripWhitespace) # remove white space
  tm_map <- tm_map (tm_map, removePunctuation) # remove punctuations
  tm_map <- tm_map (tm_map, removeNumbers) # to remove numbers
  tm_map <- tm_map (tm_map, removeWords, stopwords("english")) # to remove stop words
  tm_map <- tm_map (tm_map, removeWords, c("work", "use", "java", "new", "support"))
  
  # prepare matrix of words and frequency counts
  Matrix <- TermDocumentMatrix(tm_map) # terms in rows
  matrix_c <- as.matrix (Matrix)
  freq <- sort (rowSums (matrix_c)) # frequency data
  
  res <- data.frame(words=names(freq), freq)
  wordcloud (res$words, res$freq, max.words=100, min.freq=3, scale=c(7,.5), random.order=FALSE, colors=brewer.pal(8, "Dark2"))
} )

Before we can run this user define R script, we need to ensure that we have the 'tm', 'SnowballC' and 'wordcloud' R packages installed on the Oracle Database server. On the Oracle Database server you need to rune ORE.

> library(ORE)

Then run the following command to install these R packages

> install_packages(c('tm','wordcloud', 'SnowballC'))

Run the function on the DB Server: You are now ready to run the function. In an earlier step we had gathered the data. Now we can pass this data to the in-database R script.

> res <- ore.doEval(FUN.NAME="prepare_tm_data", tm_data=tm_data)

The ore.doEval function is a general purpose ORE function. In this case we pass it two parameters. The first parameter is the neame of the user defined R script stored in the database, and the second parameter is the data. The function returns and ORE object that contains the word cloud graphic.

Display the results: You can very easily display the results.

> res

This gives us the following graphic.

In my next blog post, of this series, I will show you how you can use the function created above and some other bits and pieces, using some other features of ORE and also in SQL.

Tuesday, July 26, 2016

Oracle Text, Oracle R Enterprise and Oracle Data Mining - Part 1

A project that I've been working on for a while now involves the use of Oracle Text, Oracle R Enterprise and Oracle Data Mining. Oracle Text comes with your Oracle Database licence. Oracle R Enterprise and Oracle Data Mining are part of the Oracle Advanced Analytics (extra cost) option.

What I will be doing over the course of 4 or maybe 5 blog posts is how these products can work together to help you gain a grater insight into your data, and part of your data being large text items like free format text, documents (in various forms e.g. html, xml, pdf, ms word), etc.

Unfortunately I cannot show you examples from the actual project I've been working on (and still am, from time to time). But what I can do is to show you how products and components can work together.

In this blog post I will just do some data setup. As with all project scenarios there can be many ways of performing the same tasks. Some might be better than others. But what I will be showing you is for demonstration purposes.

The scenario: The scenario for this blog post is that I want to extract text from some webpages and store them in a table in my schema. I then want to use Oracle Text to search the text from these webpages.

Schema setup: We need to create a table that will store the text from the webpages. We also want to create an Oracle Text index so that this text is searchable.

drop sequence my_doc_seq;
create sequence my_doc_seq;

drop table my_documents;

create table my_documents (
doc_pk number(10) primary key, 
doc_title varchar2(100), 
doc_extracted date, 
data_source varchar2(200), 
doc_text clob);

create index my_documents_ot_idx on my_documents(doc_text) 
indextype is CTXSYS.CONTEXT;

In the table we have a number of descriptive attributes and then a club for storing the website text. We will only be storing the website text and not the html document (More on that later). In order to make the website text searchable in the DOC_TEXT attribute we need to create an Oracle Text index of type CONTEXT.

There are a few challenges with using this type of index. For example when you insert a new record or update the DOC_TEXT attribute, the new values/text will not be reflected instantly, just like we are use to with traditional indexes. Instead you have to decide when you want to index to be updated. For example, if you would like the index to be updated after each commit then you can create the index using the following.

create index my_documents_ot_idx on my_documents(doc_text) 
indextype is CTXSYS.CONTEXT
parameters ('sync (on commit)');

Depending on the number of documents you have being committed to the DB, this might not be for you. You need to find the balance. Alternatively you could schedule the index to be updated by passing an interval to the 'sync' in the above command. Alternatively you might want to use DBMS_JOB to schedule the update.

To manually sync (or via DBMS_JOB) the index, assuming we used the first 'create index' statement, we would need to run the following.

EXEC CTX_DDL.SYNC_INDEX('my_documents_ot_idx');

This function just adds the new documents to the index. This can, over time, lead to some fragmentation of the index, and will require it to the re-organised on a semi-regular basis. Perhaps you can schedule this to happen every night, or once a week, or whatever makes sense to you.

BEGIN
  CTX_DDL.OPTIMIZE_INDEX('my_documents_ot_idx','FULL');
END;

(I could talk a lot more about setting up some basics of Oracle Text, the indexes, etc. But I'll leave that for another day or you can read some of the many blog posts that already exist on the topic.)

Extracting text from a webpage using R: Some time ago I wrote a blog post on using some of the text mining features and packages in R to produce a word cloud based on some of the Oracle Advanced Analytics webpages. I'm going to use the same webpages and some of the same code/functions/packages here. The first task you need to do is to get your hands on the 'htmlToText function. You can download the htmlToText function on github. This function requires the 'Curl' and 'XML' R packages. So you may need to install these. I also use the str_replace_all function ("stringer' R package) to remove some of the html that remains, to remove some special quotes and to replace and occurrences of '&' with 'and'. # Load the function and required R packages source("c:/app/htmltotext.R") library(stringr)

data1 <- str_replace_all(htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/overview/index.html"), "[\r\n\t\"\'\u201C\u201D]" , "")
data1 <- str_replace_all(data1, "&", "and")
data2 <- str_replace_all(str_replace_all(htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html"), "[\r\n\t\"\'\u201C\u201D]" , ""), "&", "and")
data2 <- str_replace_all(data2, "&", "and")
data3 <- str_replace_all(str_replace_all(htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-technologies/overview/index.html"), "[\r\n\t\"\'\u201C\u201D]" , ""), "&", "and")
data3 <- str_replace_all(data3, "&", "and")
data4 <- str_replace_all(str_replace_all(htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html"), "[\r\n\t\"\'\u201C\u201D]" , ""), "&", "and")
data4 <- str_replace_all(data4, "&", "and")

We now have the text extracted and cleaned up. Create a data frame to contain all our data: Now that we have the text extracted, we can prepare the other data items we need to insert the data into our table ('my_documents'). The first stept is to construct a data frame to contain all the data.

data_source = c("http://www.oracle.com/technetwork/database/options/advanced-analytics/overview/index.html",
                 "http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html",
                 "http://www.oracle.com/technetwork/database/database-technologies/r/r-technologies/overview/index.html",
                 "http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html")
doc_title = c("OAA_OVERVIEW", "OAA_ODM", "R_TECHNOLOGIES", "OAA_ORE")
doc_extracted = Sys.Date()
data_text <- c(data1, data2, data3, data4)

my_docs <- data.frame(doc_title, doc_extracted, data_source, data_text)

Insert the data into our database table: With the data in our data fram (my_docs) we can now use this data to insert into our database table. There are a number of ways of doing this in R. What I'm going to show you here is how to do it using Oracle R Enterprise (ORE). The thing with ORE is that there is no explicit functionality for inserting and updating records in a database table. What you need to do is to construct, in my case, the insert statement and then use ore.exec to execute this statement in the database.

library(ORE)
ore.connect(user="ora_text", password="ora_text", host="localhost", service_name="PDB12C", 
            port=1521, all=TRUE) 

for(i in 1:nrow(my_docs)) {
  insert_stmt <- "BEGIN insert_tab_document ('"
  insert_stmt <- paste(insert_stmt,  my_docs[i,]$doc_title, sep="")
  insert_stmt <- paste(insert_stmt, "', '",  my_docs[i,]$doc_extracted, "'", sep="")
  insert_stmt <- paste(insert_stmt, ", '",  my_docs[i,]$data_source, sep="")
  insert_stmt <- paste(insert_stmt, "', '",  my_docs[i,]$data_text, "');", " END;", sep="")
  ore.exec(insert_stmt)
}
ore.exec("commit")

You can now view the inserted webpage text using R or using SQL.

In my next blog post in this series, I will look at how you can use the ORE embedded features to read and process this data.

Monday, July 11, 2016

Creating ggplot2 graphics using SQL

Did you read the title of this blog post! Read it again.

Yes, Yes, I know what you are saying, "SQL cannot produce graphics or charts and particularly not ggplot2 graphics".

You are correct to a certain extent. SQL is rubbish a creating graphics (and I'm being polite).

But with Oracle R Enterprise you can now produce graphics on your data using the embedded R execution feature of Oracle R Enterprise using SQL. In this blog post I will show you how.

1. Pre-requisites

You need to have installed Oracle R Enterprise on your Oracle Database Server. Plus you need to install the ggplot2 R package.

In your R session you will need to setup a ORE connection to your Oracle schema.

2. Write and Test your R code to produce the graphic

It is always a good idea to write and test your R code before you go near using it in a user defined function.

For our (first) example we are going to create a bar chart using the ggplot2 R package. This is a basic example and the aim is to illustrate the steps you need to go through to call and produce this graphic using SQL.

The following code using the CLAIMS data set that is available with/for Oracle Advanced Analytics. The first step is to pull the data from the table in your Oracle schema to your R session. This is because ggplot2 cannot work with data referenced by an ore.frame object.

data.subset <- ore.pull(CLAIMS)

Next we need to aggregate the data. Here we are counting the number of records for each Make of car.

aggdata2 <- aggregate(data.subset$POLICYNUMBER,
                      by = list(MAKE = data.subset$MAKE),
                      FUN = length)

Now load the ggplot2 R package and use it to build the bar chart.

ggplot(data=aggdata2, aes(x=MAKE, y=x, fill=MAKE)) + 
       geom_bar(color="black", stat="identity") +
       xlab("Make of Car") + 
       ylab("Num of Accidents") + 
       ggtitle("Accidents by Make of Car")

The following is the graphic that our call to ggplot2 produces in R.

At this point we have written and tested our R code and know that it works.

3. Create a user defined R function and store it in the Oracle Database

Our next step in the process is to create an in-database user defined R function. This is were we store R code in our Oracle Database and make this available as an R function. To create the user defined R function we can use some PL/SQL to define it, and then take our R code (see above) and in it.

BEGIN
   -- sys.rqScriptDrop('demo_ggpplot');
   sys.rqScriptCreate('demo_ggpplot', 
      'function(dat) {
         library(ggplot2)
         
         aggdata2 <- aggregate(dat$POLICYNUMBER,
                      by = list(MAKE = dat$MAKE),
                      FUN = length)

        g <-ggplot(data=aggdata2, aes(x=MAKE, y=x, fill=MAKE)) + geom_bar(color="black", stat="identity") +
                   xlab("Make of Car") + ylab("Num of Accidents") + ggtitle("Accidents by Make of Car")

        plot(g)
   }');
END;

We have to make a small addition to our R code. We need need to include a call to the plot function so that the image can be returned as a BLOB object. If you do not do this then the SQL query in step 4 will return no rows.

4. Write the SQL to call it

To call our defined R function we will need to use one of the ORE SQL API functions. In the following example we are using the rqTableEval function. The first parameter for this function passes in the data to be processed. In our case this is the data from the CLAIMS table. The second parameter is set to null. The third parameter is set to the output format and in our case we want this to be PNG. The fourth parameter is the name of the user defined R function.

select *
from table(rqTableEval( cursor(select * from claims),
                        null,
                        'PNG',
                        'demo_ggpplot'));

5. How to view the results

The SQL query in Step 4 above will return one row and this row will contain a column with a BLOB data type.

The easiest way to view the graphic that is produced is to use SQL Developer. It has an inbuilt feature that allows you to display BLOB objects. All you need to do is to double click on the BLOB cell (under the column labeled IMAGE). A window will open called 'View Value'. In this window click the 'View As Image' check box on the top right hand corner of the window. When you do the R ggplot2 graphic will be displayed.

Yes the image is not 100% the same as the image produced in our R session. I will have another blog post that deals with this at a later date.

But, now you have written a SQL query, that calls R code to produce an R graphic (using ggplot2) of our data.

6. Now you can enhance the graphics (without changing your SQL)

What if you get bored with the bar chart and you want to change it to a different type of graphic? All you need to do is to change the relevant code in the user defined R function.

For example, if we want to change the graphic to a polar plot. The following is the PL/SQL code that re-defines the user defined R script.

BEGIN
   sys.rqScriptDrop('demo_ggpplot');
   sys.rqScriptCreate('demo_ggpplot', 
      'function(dat) {
         library(ggplot2)
         
         aggdata2 <- aggregate(dat$POLICYNUMBER,
                      by = list(MAKE = dat$MAKE),
                      FUN = length)

         n <- nrow(aggdata2)
         degrees <- 360/n

        aggdata2$MAKE_ID <- 1:nrow(aggdata2)

        g<- ggplot(data=aggdata2, aes(x=MAKE, y=x, fill=MAKE)) + geom_bar(color="black", stat="identity") +
               xlab("Make of Car") + ylab("Num of Accidents") + ggtitle("Accidents by Make of Car") + coord_polar(theta="x") 
        plot(g)
   }');
END;

We can use the exact same SQL query we defined in Step 4 above to call the next graphic.

All done.

Now that was easy! Right?

I kind of is easy once you have been shown. There are a few challenges when working in-database user defined R functions and writing the SQL to call them. Most of the challenges are around the formatting of R code in the function and the syntax of the SQL statement to call it. With a bit of practice it does get easier.

7. Where/How can you use these graphics ?

Any application or program that can call and process a BLOB data type can display these images. For example, I've been able to include these graphics in applications developed in APEX.

Tuesday, July 5, 2016

Cluster Distance using SQL with Oracle Data Mining - Part 4

This is the fourth and last blog post in a series that looks at how you can examine the details of predicted clusters using Oracle Data Mining. In the previous blog posts I looked at how to use CLUSER_ID, CLUSTER_PROBABILITY and CLUSTER_SET.

In this blog post we will look at CLUSTER_DISTANCE. We can use the function to determine how close a record is to the centroid of the cluster. Perhaps we can use this to determine what customers etc we might want to focus on most. The customers who are closest to the centroid are one we want to focus on first. So we can use it as a way to prioritise our workflows, particularly when it is used in combination with the value for CLUSTER_PROBABILITY.

Here is an example of using CLUSTER_DISTANCE to list all the records that belong to Cluster 14 and the results are ordered based on closeness to the centroid of this cluster.

SELECT customer_id, 
       cluster_probability(clus_km_1_37 USING *) as cluster_Prob,
       cluster_distance(clus_km_1_37 USING *) as cluster_Distance
FROM   insur_cust_ltv_sample
WHERE   cluster_id(clus_km_1_37 USING *) = 14
order by cluster_Distance asc;

Here is a subset of the results from this query.

When you examine the results you may notice that the records that is listed first and closest record to the centre of cluster 14 has a very low probability. You need to remember that we are working in a N-dimensional space here. Although this first record is closest to the centre of cluster 14 it has a really low probability and if we examine this record in more detail we will find that it is at an overlapping point between a number of clusters.

This is why we need to use the CLUSTER_DISTANCE and CLUSTER_PROBABILITY functions together in our workflows and applications to determine how we need to process records like these.

Thursday, June 23, 2016

Cluster Sets using SQL with Oracle Data Mining - Part 3

This is the third blog post on my series on examining the Clusters that were predicted by an Oracle Data Mining model. Check out the previous blog posts.

In the previous posts we were able to list the predicted cluster for each record in our data set. This is the cluster that the records belonged to the most. I also mentioned that a record could belong to many clusters.

So how can you list all the clusters that the a record belongs to?

You can use the CLUSTER_SET SQL function. This will list the Cluster Id and a probability measure for each cluster. This function returns a array consisting of the set of all clusters that the record belongs to.

The following example illustrates how to use the CLUSTER_SET function for a particular cluster model.

SELECT t.customer_id, s.cluster_id, s.probability
FROM   (select customer_id, cluster_set(clus_km_1_37 USING *) as Cluster_Set
        from   insur_cust_ltv_sample 
        WHERE  customer_id in ('CU13386', 'CU100')) T,
      TABLE(T.cluster_set) S
order by t.customer_id, s.probability desc;

The output from this query will be an ordered data set based on the customer id and then the clusters listed in descending order of probability. The cluster with the highest probability is what would be returned by the CLUSTER_ID function. The output from the above query is shown below.

If you would like to see the details of each of the clusters and to examine the differences between these clusters then you will need to use the CLUSTER_DETAILS function (see previous blog post).

You can specify topN and cutoff to limit the number of clusters returned by the function. By default, both topN and cutoff are null and all clusters are returned.

- topN is the N most probable clusters. If multiple clusters share the Nth probability, then the function chooses one of them.

- cutoff is a probability threshold. Only clusters with probability greater than or equal to cutoff are returned. To filter by cutoff only, specify NULL for topN.

You may want to use these individually or combined together if you have a large number of customers. To return up to the N most probable clusters that are greater than or equal to cutoff, specify both topN and cutoff.

The following example illustrates using the topN value to return the top 4 clusters.

SELECT t.customer_id, s.cluster_id, s.probability
FROM   (select customer_id, cluster_set(clus_km_1_37, 4, null USING *) as Cluster_Set
        from   insur_cust_ltv_sample 
        WHERE  customer_id in ('CU13386', 'CU100')) T,
      TABLE(T.cluster_set) S
order by t.customer_id, s.probability desc;

and the output from this query shows only 4 clusters displayed for each record.

Alternatively you can select the clusters based on a cut off value for the probability. In the following example this is set to 0.05.

SELECT t.customer_id, s.cluster_id, s.probability
FROM   (select customer_id, cluster_set(clus_km_1_37, NULL, 0.05 USING *) as Cluster_Set
        from   insur_cust_ltv_sample 
        WHERE  customer_id in ('CU13386', 'CU100')) T,
      TABLE(T.cluster_set) S
order by t.customer_id, s.probability desc;

and the output this time looks a bit different.

Finally, yes you can combine these two parameters to work together.

SELECT t.customer_id, s.cluster_id, s.probability FROM (select customer_id, cluster_set(clus_km_1_37, 2, 0.05 USING *) as Cluster_Set from insur_cust_ltv_sample WHERE customer_id in ('CU13386', 'CU100')) T, TABLE(T.cluster_set) S order by t.customer_id, s.probability desc;

Thursday, June 16, 2016

Cluster Details with Oracle Data Mining - Part 2

This is the second blog post of my series on examining the clusters that are predicted for by an Oracle Data Mining model for your data. In my previous blog post I should you how to use CLUSTER_ID and CLUSTER_PROBABILITY functions. These are the core of what you will be used when working with clusters and automating the process.

In this blog post I will look at what details are used by the clustering model to make the prediction. The function that you can use is called CLUSTER_DETAILS. I had an earlier blog post on using PREDICTION_DETAILS to see some of the details that are produced when performing classification.

CLUSTER_DETAILS returns the cluster details for each row in the selection. The return value is an XML string that describes the attributes of the highest probability cluster.

Here is an example of using the CLUSTER_DETAILS function in a SELECT statement.

select cluster_details(clus_km_1_37, 14 USING *) as Cluster_Details
from   insur_cust_ltv_sample 
where  customer_id = 'CU13386';

The output is an XML string and the easiest way to view this is in SQL Developer. It will list the top 5 highest weighted attributes for the cluster centroid.

The returned attributes are ordered by weight. The weight of an attribute expresses its positive or negative impact on cluster assignment. A positive weight indicates an increased likelihood of assignment. A negative weight indicates a decreased likelihood of assignment. By default, CLUSTER_DETAILS returns the attributes with the highest positive weights in defending order.

Tuesday, June 7, 2016

Examining predicted Clusters and Cluster details using SQL

In a previous blog post I gave some details of how you can examine some of the details behind a prediction made using a classification model. This seemed to spark a lot of interest. But before I come back to looking at classification prediction details and other information, this blog post is the first in a 4 part blog post on examining the details of Clusters, as identified by a cluster model created using Oracle Data Mining.

The 4 blog posts will consist of:

1 - (this blog post) will look at how to determine the predicted cluster and cluster probability for your record.
2 - will show you how to examine the details behind and used to predict the cluster.
3 - A record could belong to many clusters. In this blog post we will look at how you can determine what clusters a record can belong to.
4 - Cluster distance is a measure of how far the record is from the cluster centroid. As a data point or record can belong to many clusters, it can be useful to know the distances as you can build logic to perform different actions based on the cluster distances and cluster probabilities.

Right. Let's have a look at the first set of these closer functions. These are CLUSTER_ID and CLUSTER_PROBABILITY.

CLUSER_ID : Returns the number of the cluster that the record most closely belongs to. This is measured by the cluster distance to the centroid of the cluster. A data point or record can belong or be part of many clusters. So the CLUSTER_ID is the cluster number that the data point or record most closely belongs too.

CLUSTER_PROBABILITY : Is a probability measure of the likelihood of the data point or record belongs to a cluster. The cluster with the highest probability score is the cluster that is returned by the CLUSTER_ID function.

Now let us have a quick look at the SQL for these two functions. This first query returns the cluster number that each record most strong belongs too.

SELECT customer_id, 
       cluster_id(clus_km_1_37 USING *) as Cluster_Id, 
FROM   insur_cust_ltv_sample
WHERE  customer_id in ('CU13386', 'CU6607', 'CU100');

NewImage

Now let us add in the cluster probability function.

SELECT customer_id, 
       cluster_id(clus_km_1_37 USING *) as Cluster_Id,
       cluster_probability(clus_km_1_37 USING *) as cluster_Prob       
FROM   insur_cust_ltv_sample
WHERE  customer_id in ('CU13386', 'CU6607', 'CU100');

NewImage

These functions gives us some insights into what the cluster predictive model is doing. In the remaining blog posts in this series I will look at how you can delve deeper into the predictions that the cluster algorithm is make.

Monday, May 30, 2016

PREDICTION_DETAILS function in Oracle

When building predictive models the data scientist can spend a large amount of time examining the models produced and how they work and perform on their hold out sample data sets. They do this to understand is the model gives a good general representation of the data and can identify/predict many different scenarios. When the "best" model has been selected then this is typically deployed is some sort of reporting environment, where a list is produced. This is typical deployment method but is far from being ideal. A more ideal deployment method is that the predictive models are build into the everyday applications that the company uses. For example, it is build into the call centre application, so that the staff have live and real-time feedback and predictions as they are talking to the customer.

But what kind of live and real-time feedback and predictions are possible. Again if we look at what is traditionally done in these applications they will get a predicted outcome (will they be a good customer or a bad customer) or some indication of their value (maybe lifetime value, possible claim payout value) etc.

But can we get anymore information? Information like what was reason for the prediction. This is sometimes called prediction insight. Can we get some details of what the prediction model used to decide on the predicted value. In more predictive analytics products this is not possible, as all you are told is the final out come.

What would be useful is to know some of the thinking that the predictive model used to make its thinking. The reasons when one customer may be a "bad customer" might be different to that of another customer. Knowing this kind of information can be very useful to the staff who are dealing with the customers. For those who design the workflows etc can then build more advanced workflows to support the staff when dealing with the customers.

Oracle as a unique feature that allows us to see some of the details that the prediction model used to make the prediction. This functions (based on using the Oracle Advanced Analytics option and Oracle Data Mining to build your predictive model) is called PREDICTION_DETAILS.

When you go to use PREDICTION_DETAILS you need to be careful as it will work differently in the 11.2g and 12c versions of the Oracle Database (Enterprise Editions). In Oracle Database 11.2g the PREDICTION_DETAILS function would only work for Decision Tree models. But in 12c (and above) it has been opened to include details for models created using all the classification algorithms, all the regression algorithms and also for anomaly detection.

The following gives an example of using the PREDICTION_DETAILS function.

select cust_id, 
       prediction(clas_svm_1_27 using *) pred_value,
       prediction_probability(clas_svm_1_27 using *) pred_prob,
       prediction_details(clas_svm_1_27 using *) pred_details
from mining_data_apply_v;

The PREDICTION_DETAILS function produces its output in XML, and this consists of the attributes used and their values that determined why a record had the predicted value. The following gives some examples of the XML produced for some of the records.

NewImage

I've used this particular function in lots of my projects and particularly when building the applications for a particular business unit. Oracle too has build this functionality into many of their applications. The images below are from the HCM application where you can examine the details why an employee may or may not leave/churn. You can when perform real-time what-if analysis by changing some of attribute values to see if the predicted out come changes.

NewImage

Friday, April 29, 2016

Accessing the R datasets in ORE and SQL

When you install R you also get a set of pre-compiled datasets. These are great for trying out many of the features that are available with R and all the new packages that are being produced on an almost daily basis.

The exact list of data sets available will depend on the version of R that you are using.

To get the list of available data sets in R you can run the following.

> library(help="datasets")

This command will list all the data sets that you can reference and start using immediately.

I'm currently running the latest version of Oracle R Distribution version 3.2. See the listing at the end of this blog post for the available data sets.

But are these data sets available to you if you are using Oracle R Enterprise (ORE)? The answer is Yes of course they are.

But are these accessible on the Oracle Database server? Yes they are, as you have R installed there and you can use ORE to access and use the data sets.

But how? how can I list what is on the Oracle Database server using R? Simple use the following ORE code to run an embedded R execution function using the ORE R API.

What? What does that mean? Using the R on your client machine, you can use ORE to send some R code to the Oracle Database server. The R code will be run on the Oracle Database server and the results will be returned to the client. The results contain the results from the server. Try the following code.

ore.doEval(function() library(help="datasets")) 

# let us create a functions for this code
myFn <- function() {library(help="datasets")}

# Now send this function to the DB server and run it there.
ore.doEval(myFn)

# create an R script in the Oracle Database that contains our R code
ore.scriptDrop("inDB_R_DemoData")
ore.scriptCreate("inDB_R_DemoData", myFn)
# Now run the R script, stored in the Oracle Database, on the Database server
#   and return the results to my client
ore.doEval(FUN.NAME="inDB_R_DemoData")

Simple, Right!

Yes it is. You have shown us how to do this in R using the ORE package. But what if I'm a SQL developer. Can I do this in SQL? Yes you can. Connect you your schema using SQL Developer/SQL*Plus/SQLcl or whatever tool you will be using to run SQL. Then run the following SQL.

select * 
from table(rqEval(null, 'XML', 'inDB_R_DemoData'));

This SQL code will return the results in XML format. You can parse this to extract and display the results and when you do you will get something like the following listing, which is exactly the same that is produced when you run the R code that I gave above.

So what this means is that evening if you have an empty schema with no data in it, and as long as you have the privileges to run embedded R execution, you actually have access to all these different data sets. You can use these to try our R using the ORE SQL APIs too.

		Information on package ‘datasets’

Description:

Package:       datasets
Version:       3.2.0
Priority:      base
Title:         The R Datasets Package
Author:        R Core Team and contributors worldwide
Maintainer:    R Core Team 
Description:   Base R datasets.
License:       Part of R 3.2.0
Built:         R 3.2.0; ; 2015-08-07 02:20:26 UTC; windows

Index:

AirPassengers           Monthly Airline Passenger Numbers 1949-1960
BJsales                 Sales Data with Leading Indicator
BOD                     Biochemical Oxygen Demand
CO2                     Carbon Dioxide Uptake in Grass Plants
ChickWeight             Weight versus age of chicks on different diets
DNase                   Elisa assay of DNase
EuStockMarkets          Daily Closing Prices of Major European Stock
                        Indices, 1991-1998
Formaldehyde            Determination of Formaldehyde
HairEyeColor            Hair and Eye Color of Statistics Students
Harman23.cor            Harman Example 2.3
Harman74.cor            Harman Example 7.4
Indometh                Pharmacokinetics of Indomethacin
InsectSprays            Effectiveness of Insect Sprays
JohnsonJohnson          Quarterly Earnings per Johnson & Johnson Share
LakeHuron               Level of Lake Huron 1875-1972
LifeCycleSavings        Intercountry Life-Cycle Savings Data
Loblolly                Growth of Loblolly pine trees
Nile                    Flow of the River Nile
Orange                  Growth of Orange Trees
OrchardSprays           Potency of Orchard Sprays
PlantGrowth             Results from an Experiment on Plant Growth
Puromycin               Reaction Velocity of an Enzymatic Reaction
Theoph                  Pharmacokinetics of Theophylline
Titanic                 Survival of passengers on the Titanic
ToothGrowth             The Effect of Vitamin C on Tooth Growth in
                        Guinea Pigs
UCBAdmissions           Student Admissions at UC Berkeley
UKDriverDeaths          Road Casualties in Great Britain 1969-84
UKLungDeaths            Monthly Deaths from Lung Diseases in the UK
UKgas                   UK Quarterly Gas Consumption
USAccDeaths             Accidental Deaths in the US 1973-1978
USArrests               Violent Crime Rates by US State
USJudgeRatings          Lawyers' Ratings of State Judges in the US
                        Superior Court
USPersonalExpenditure   Personal Expenditure Data
VADeaths                Death Rates in Virginia (1940)
WWWusage                Internet Usage per Minute
WorldPhones             The World's Telephones
ability.cov             Ability and Intelligence Tests
airmiles                Passenger Miles on Commercial US Airlines,
                        1937-1960
airquality              New York Air Quality Measurements
anscombe                Anscombe's Quartet of 'Identical' Simple Linear
                        Regressions
attenu                  The Joyner-Boore Attenuation Data
attitude                The Chatterjee-Price Attitude Data
austres                 Quarterly Time Series of the Number of
                        Australian Residents
beavers                 Body Temperature Series of Two Beavers
cars                    Speed and Stopping Distances of Cars
chickwts                Chicken Weights by Feed Type
co2                     Mauna Loa Atmospheric CO2 Concentration
crimtab                 Student's 3000 Criminals Data
datasets-package        The R Datasets Package
discoveries             Yearly Numbers of Important Discoveries
esoph                   Smoking, Alcohol and (O)esophageal Cancer
euro                    Conversion Rates of Euro Currencies
eurodist                Distances Between European Cities and Between
                        US Cities
faithful                Old Faithful Geyser Data
freeny                  Freeny's Revenue Data
infert                  Infertility after Spontaneous and Induced
                        Abortion
iris                    Edgar Anderson's Iris Data
islands                 Areas of the World's Major Landmasses
lh                      Luteinizing Hormone in Blood Samples
longley                 Longley's Economic Regression Data
lynx                    Annual Canadian Lynx trappings 1821-1934
morley                  Michelson Speed of Light Data
mtcars                  Motor Trend Car Road Tests
nhtemp                  Average Yearly Temperatures in New Haven
nottem                  Average Monthly Temperatures at Nottingham,
                        1920-1939
npk                     Classical N, P, K Factorial Experiment
occupationalStatus      Occupational Status of Fathers and their Sons
precip                  Annual Precipitation in US Cities
presidents              Quarterly Approval Ratings of US Presidents
pressure                Vapor Pressure of Mercury as a Function of
                        Temperature
quakes                  Locations of Earthquakes off Fiji
randu                   Random Numbers from Congruential Generator
                        RANDU
rivers                  Lengths of Major North American Rivers
rock                    Measurements on Petroleum Rock Samples
sleep                   Student's Sleep Data
stackloss               Brownlee's Stack Loss Plant Data
state                   US State Facts and Figures
sunspot.month           Monthly Sunspot Data, from 1749 to "Present"
sunspot.year            Yearly Sunspot Data, 1700-1988
sunspots                Monthly Sunspot Numbers, 1749-1983
swiss                   Swiss Fertility and Socioeconomic Indicators
                        (1888) Data
treering                Yearly Treering Data, -6000-1979
trees                   Girth, Height and Volume for Black Cherry Trees
uspop                   Populations Recorded by the US Census
volcano                 Topographic Information on Auckland's Maunga
                        Whau Volcano
warpbreaks              The Number of Breaks in Yarn during Weaving
women                   Average Heights and Weights for American Women

Tuesday, April 12, 2016

Oracle Advanced Analytics on Oracle Cloud

You have heard about the cloud? Right? Even the Oracle Cloud?

If you haven't, then maybe we need to look at how you can learn more about the Oracle Cloud.

Over the past while, and in the past few weeks in particular, Oracle has been advertising about how you can get a trail Oracle cloud service setup for FREE. Well it is free for one month when you set it up on the Oracle website (cloud.oracle.com).

As I like to talk about and use the Oracle Advanced Analytics (OAA) option (a lot), I thought I'd just give you some pointers on how to use OAA on the Oracle cloud.

To do this you need to set up an account on the Oracle cloud website (your Oracle single sign on should help with making that process a lot quicker). There are lots of websites and blog that will talk/show you through the process. Then you need to select what Database as a Service that you want to setup

OAA is not available on the Database Schema Service just yet (maybe one day they will)

Although Oracle Advanced Analytics comes pre-installed in the Oracle Enterprise Edition database (yes it is a separately priced option) when you install it on your own servers, but for the Enterprise Edition DaaS OAA is not part of it.

DaaS has the following versions

Standard Edition Service
Enterprise Edition Service
High Performance Service
Extreme Performance Service

OAA is only available for these last two versions of the DaaS.

High Performance DaaS: Multitenant, Partitioning, Real Application Testing, Advanced Compression, Advanced Security, Label Security, Database Vault, OLAP, Advanced Analytics, Spatial and Graph, Diagnostics Pack, Tuning Pack, Database Lifecycle Management Pack, Data Masking & Subsetting Pack and Cloud Management Pack for Oracle Database.

Extreme Performance DaaS: In-Memory Database, RAC (Real Application Clusters), Active Data Guard, Multitenant, Partitioning, Real Application Testing, Advanced Compression, Advanced Security, Label Security, Database Vault, OLAP, Advanced Analytics, Spatial and Graph, Diagnostics Pack, Tuning Pack, Database Lifecycle Management Pack, Data Masking & Subsetting Pack and Cloud Management Pack for Oracle Database.

Oracle Advanced Analytics has two main products or components. The first is the in-database Oracle Data Mining features. This are part of the High Performance and Extreme Performance DaaS offerings. But Oracle R Enterprise is not installed on these DaaS. Well if kind of is if you can get an 11g DaaS, but at time of writing this post ORE is not part of the 12c DaaS images. So you will need to factor in some time to go and install ORE, if you need to use it.

I've been lucky to have one of these DaaS with OAA trials and with thanks to Thomas Kurian he has extended these trials to 12 months for all Oracle ACE Directors. Thank you Thomas.

When you get your DaaS setup you just need to configure your connection privileges, ssh, etc and away you go. All you need to do is to move your data across the internet to your own Oracle DaaS, and once it is in the DaaS all your OAA and other analytics is performed on the Database Server. Only the results are returned to you and displayed in your tool. This significantly reduces the processing time for your data and removes the need to constantly extract your data (in whole or in parts) to feed into other advanced analytics tools.

So if you haven't tried Oracle Advanced Analytics yet, then go ahead and setup your free trial of Oracle DaaS and try it out. You never know what you might discover by using Oracle Advanced Analytics (in the cloud)

Tuesday, February 2, 2016

ORE video : Demo Code Part 5

The following is the fifth and final set of demo code from my video on using R in the Oracle Database. Check out the video before using the following code. The blog post for the video will be updated to contain links to all blog posts that have the various demo code.

The following examples illustrate how you can use the Oracle R Enterprise capabilities within SQL and PL/SQL. The following illustrate building a GLM model using the glm algorithm that comes with the R language, and then uses this mode to score or label new data that is stored in a table. The last part of the example illustrates how you can perform What-If analysis using this ORE model

-- Build & save the R script, called Demo_GLM in the DB
--  This builds a GLM  DM model in the DB
--
Begin
   sys.rqScriptDrop('Demo_GLM');
   sys.rqScriptCreate('Demo_GLM',
      'function(dat,datastore_name) {
          mod <- glm(AFFINITY_CARD ~ CUST_GENDER + AGE + CUST_MARITAL_STATUS + COUNTRY_NAME + CUST_INCOME_LEVEL + EDUCATION + HOUSEHOLD_SIZE + YRS_RESIDENCE, dat, family = binomial())
       ore.save(mod, name=datastore_name, overwrite=TRUE)   }');
end;
/

--
-- After creating the script you need to run it to create the GLM model
--
select * 	
from table(rqTableEval(
             cursor(select CUST_GENDER,
                           AGE,
                           CUST_MARITAL_STATUS,
                           COUNTRY_NAME,
                           CUST_INCOME_LEVEL,
                           EDUCATION,
                           HOUSEHOLD_SIZE,
                           YRS_RESIDENCE,
                           AFFINITY_CARD
                    from mining_data_build_v),
             cursor(select 1 as "ore.connect", 'myDatastore' as "datastore_name” from dual),
                  'XML', 'Demo_GLM' ));


--
-- There are 2 ways to use the GLM model : in Batch and in Real-Time mode
--
-- First Step : Build the in-database R script to score you new data
--
Begin
   sys.rqScriptDrop('Demo_GLM_Batch');
   sys.rqScriptCreate('Demo_GLM_Batch',
      'function(dat, datastore_name) {
      ore.load(datastore_name)
      prd <- predict(mod, newdata=dat)
      prd[as.integer(rownames(prd))] <- prd
      res <- cbind(dat, PRED = prd)
      res}');
end;
/

-- 
-- Now you can run the script to score the new data in Batch model
--   The data is located in the table MINING_DATA_APPLY
--
select * from table(rqTableEval(
              cursor(select CUST_GENDER, AGE, CUST_MARITAL_STATUS, COUNTRY_NAME, CUST_INCOME_LEVEL, EDUCATION, HOUSEHOLD_SIZE, YRS_RESIDENCE
                     from   MINING_DATA_APPLY_V
                     where rownum <= 10),
             cursor(select 1 as "ore.connect", 'myDatastore' as "datastore_name" from dual),
             'select CUST_GENDER, AGE, CUST_MARITAL_STATUS, COUNTRY_NAME, CUST_INCOME_LEVEL, EDUCATION, HOUSEHOLD_SIZE, YRS_RESIDENCE, 1 PRED from MINING_DATA_APPLY_V','Demo_GLM_Batch'))
order by 1, 2, 3;

--
-- Now let us use the Demo_GLM_Batch script to score data in Real-Time
--  The data values are passed to the GLM model
--
select * from table(rqTableEval(
              cursor(select 'M' CUST_GENDER,
                            23 AGE,
                            'Married' CUST_MARITAL_STATUS, 
                            'United States of America' COUNTRY_NAME,
                            'B: 30,000 - 49,999' CUST_INCOME_LEVEL, 
                            'Assoc-A' EDUCATION,
                            '3' HOUSEHOLD_SIZE, 
                            5 YRS_RESIDENCE
                     from dual),
              cursor(select 'myDatastore' "datastore_name", 1 "ore.connect" from dual),
                    'select CUST_GENDER, AGE, CUST_MARITAL_STATUS, COUNTRY_NAME, CUST_INCOME_LEVEL, EDUCATION, HOUSEHOLD_SIZE, YRS_RESIDENCE, 1 PRED from MINING_DATA_APPLY',
              'Demo_GLM_Batch')) order by 1, 2, 3;

Tuesday, January 19, 2016

ORE video : Demo Code Part 4

The following is the fourth set of demo code from my video on using R in the Oracle Database. Check out the video before using the following code. The blog post for the video will be updated to contain links to all blog posts that have the various demo code.

The following code example illustrate how you can build a Data Mining model using the in-database data mining algorithms. In this example a Decision Tree model is created. This model is then applied to new data, scoring this data with the predicted values.

> #
> # Build am in-database ODM Decision Tree
> #
> dtData <- ore.get("MINING_DATA_BUILD_V")
> # Create a ODM DT model in the DB : Only a temporary model. It is deleted when you logout
> dtModel <- ore.odmDT(AFFINITY_CARD ~ ., dtData)
> # View the details of the ODM model
> #summary(dtModel)
> names(dtModel)
 [1] "name"          "settings"      "attributes"    "costs"         "distributions”
 [6] "nodes"         "formula"       "extRef"        "call"         
> dtModel$name
 [1] "ORE$208_210”
> dtModel$settings
                          value
prep.auto                    on
impurity.metric   impurity.gini
term.max.depth                7
term.minpct.node           0.05
term.minpct.split           0.1
term.minrec.node             10
term.minrec.split            20
> dtModel$attributes
                 name        type data.type data.length precision scale is.target
1       AFFINITY_CARD categorical    number          22         0     0      TRUE
2                 AGE   numerical    number          22        NA    NA     FALSE
3 CUST_MARITAL_STATUS categorical  varchar2          20        NA    NA     FALSE
4           EDUCATION categorical  varchar2          21        NA    NA     FALSE
5      HOUSEHOLD_SIZE categorical  varchar2          21        NA    NA     FALSE
6          OCCUPATION categorical  varchar2          21        NA    NA     FALSE
7       YRS_RESIDENCE   numerical    number          22        NA    NA     FALSE
>

> ## Compute the Compusion Matrix
> dtResults <- predict(dtModel, dtData, "AFFINITY_CARD")
> with(dtResults, table(AFFINITY_CARD, PREDICTION))
             PREDICTION
AFFINITY_CARD    0    1
            0 1056   64
            1  201  179
> ## How do you persist the model in the DB
> ##     Rename and save the model in the database
> dtModel$name
 [1] "ORE$208_210"

> ## Save the ODM model in the in-database R datastore
> ore.save(dtModel, name = "ORE_MODELS", overwrite=TRUE)
> ore.load(name = "ORE_MODELS")
 [1] "dtModel"

> ## Score new data using the DM Model
> ore.sync(table = c("MINING_DATA_APPLY"))
> ore.ls()
 [1] "DEMO_R_APPLY_RESULT"   "DEMO_R_TABLE"          "DEMO_SUBSET_TABLE"    
 [4] "INSUR_CUST_LTV_SAMPLE" "MINING_DATA_APPLY"     "MINING_DATA_BUILD_V"  
 [7] "MINING_DATA_TEST_V"   > dtApply <- ore.get("MINING_DATA_APPLY")
> dim(dtApply)
 [1] 1500   18
> class(dtApply)
 [1] "ore.frame”
 attr(,"package")
 [1] "OREbase”
> DTAPPLY <- ore.push(dtApply)
> dtApplyResult <- predict(dtModel, DTAPPLY)

> dtApplyResult <- predict(dtModel, DTAPPLY)
> head(dtApplyResult)
             '0'        '1' PREDICTION
100001 0.9521912 0.04780876          0
100002 0.9521912 0.04780876          0
100003 0.9521912 0.04780876          0
100004 0.9521912 0.04780876          0
100005 0.2633745 0.73662551          1
100006 0.9521912 0.04780876          0
> dim(dtApplyResult)
 [1] 1500    3
> dim(dtApply)
 [1] 1500   18
> dtResults <- cbind(dtApply, dtApplyResult)
> dim(dtResults)
 [1] 1500   21
> ore.drop(table = "DEMO_R_APPLY_RESULT")
> ore.create(dtApplyResult, table="DEMO_R_APPLY_RESULT")
> ## Run the following for the first time you will rename a mode
> # ore.exec(paste("BEGIN> 
  #                  DBMS_DATA_MINING.RENAME_MODEL(model_name => '", dtModel$name, "',> 
  #                      new_model_name => 'DEMO_R_DT_MODEL'); END;",sep=""))> 
  ## Run the following to refresh an existing model
> ore.exec(paste("BEGIN
+ DBMS_DATA_MINING.DROP_MODEL('DEMO_R_DT_MODEL');
+ DBMS_DATA_MINING.RENAME_MODEL(model_name => '", dtModel$name,"',
+ new_model_name => 'DEMO_R_DT_MODEL');
+ END;",sep=""))

Wednesday, January 6, 2016

ORE video : Demo Code Part 2

The following is the second set of demo code from my video on using R in the Oracle Database. Check out the video before using the following code. The blog post for the video will be updated to contain links to all blog posts that have the various demo code.

The following code gives a very quick demonstration of using the RORACLE R package to access the data in your Oracle schema. ROracle has a number of advantages over using RJDBC and most of the advantages are about the performance improvements. Typically when using ROracle you will see a many fold improvement with selecting data and moving it to your R client, processing data in the database and also writing data back to the Oracle Database. In some tests you can see a 7 times improvement in performance over RJDBC. Now that is a big difference.

But the problem with ROracle is that it is only available on certain platforms/OS. For example it is not officially available for the Mac. But if you google this issue carefully you will find unofficial ways over coming this problem.

ROracle is dependent on Oracle Client. So you will need to have Oracle Client installed on you machine and have it available on the search path.

When you have Oracle Client installed and the ROracle R package installed you are ready to start using it.

So here is the demo code from the video.

> library(ROracle)
> drv <- dbDriver("Oracle")
> # Create the connection string
> host <- "localhost"
> port <- 1521
> sid <- "orcl"
>connect.string <- paste("(DESCRIPTION=”, "(ADDRESS=(PROTOCOL=tcp)(HOST=", host, ")(PORT=", port, "))",
>    "(CONNECT_DATA=(SID=", sid, ")))", sep = "")

> con <- dbConnect(drv, username = "dmuser", password = "dmuser",dbname=connect.string)

> rs <- dbSendQuery(con, "select view_name from user_views")
> # fetch records from the resultSet into a data.frame
> data <- fetch(rs)
> # extract all rows
> dim(data)
[1] 6 1
> data
                  VIEW_NAME
1       MINING_DATA_APPLY_V
2       MINING_DATA_BUILD_V
3        MINING_DATA_TEST_V
4  MINING_DATA_TEXT_APPLY_V
5  MINING_DATA_TEXT_BUILD_V
6   MINING_DATA_TEXT_TEST_V
> dbCommit(con)
> dbClearResult(rs)
> dbDisconnect(con)

Wednesday, December 30, 2015

ORE Video : Demo Code part 1

In a previous blog post I posted a video on using R with the Oracle Database and using Oracle R Enterprise. This is a part 1 extension of that blog post that gives the first set of demo code.

This first set of demonstration code is for using RJDBC to connect to the Oracle Database. Using RJDBC relies on using the JDBC jar file for Oracle. It is easily found in various installations of Oracle products and will be called something like ojdbc.jar. I like to take a copy of this file and place it in the root/home directory.

> library(RJDBC)
> # Create connection driver and open 
> connectionjdbcDriver <- JDBC(driverClass="oracle.jdbc.OracleDriver", classPath="c:/ojdbc6.jar")
> jdbcConnection <- dbConnect(jdbcDriver, "jdbc:oracle:thin:@//localhost:1521/orcl", "dmuser", "dmuser")
> #list the tables in the schema
> #dbListTables(jdbcConnection)
> #get the DB connections details - it get LOTS of info - Do not run unless it is really needed
> dbGetInfo(jdbcConnection)
> # Query on the Oracle instance name.
> #instanceName <- dbGetQuery(jdbcConnection, "SELECT instance_name FROM v$instance")
              TABLE_NAME1 
1  INSUR_CUST_LTV_SAMPLE2            
2              OUTPUT_1_2
> #print(instanceName)tableNames <- dbGetQuery(jdbcConnection, "SELECT table_name from user_tables where  
                                                 table_name not like 'DM$%' and table_name not like 'ODMR$%'")
> print(tableNames)
> viewNames <- dbGetQuery(jdbcConnection, "SELECT view_name from user_views")print(viewNames)
1       MINING_DATA_APPLY_V
2       MINING_DATA_BUILD_V
3        MINING_DATA_TEST_V
4  MINING_DATA_TEXT_APPLY_V
5  MINING_DATA_TEXT_BUILD_V
6   MINING_DATA_TEXT_TEST_V

> v <- dbReadTable(jdbcConnection, "MINING_DATA_BUILD_V")
> names(v)
[1] "CUST_ID"                 "CUST_GENDER"             "AGE"                     
[4] "CUST_MARITAL_STATUS"     "COUNTRY_NAME"            "CUST_INCOME_LEVEL"       
[7] "EDUCATION"               "OCCUPATION"              "HOUSEHOLD_SIZE"         
[10] "YRS_RESIDENCE"           "AFFINITY_CARD"           "BULK_PACK_DISKETTES"    
[13] "FLAT_PANEL_MONITOR"      "HOME_THEATER_PACKAGE"    "BOOKKEEPING_APPLICATION”
[16] "PRINTER_SUPPLIES"        "Y_BOX_GAMES"             "OS_DOC_SET_KANJI" 
> dim(v)
[1] 1500   18
> summary(v)
    CUST_ID       CUST_GENDER             AGE        CUST_MARITAL_STATUS COUNTRY_NAME       
Min.   :101501   Length:1500        Min.   :17.00   Length:1500         Length:1500        
1st Qu.:101876   Class :character   1st Qu.:28.00   Class :character    Class :character   
Median :102251   Mode  :character   Median :37.00   Mode  :character    Mode  :character   
Mean   :102251                      Mean   :38.89                                          
3rd Qu.:102625                      3rd Qu.:47.00                                          
Max.   :103000                      Max.   :90.00                                          
CUST_INCOME_LEVEL   EDUCATION          OCCUPATION        HOUSEHOLD_SIZE     YRS_RESIDENCE    
Length:1500        Length:1500        Length:1500        Length:1500        Min.   : 0.000   
Class :character   Class :character   Class :character   Class :character   1st Qu.: 3.000   
Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median : 4.000                                                                               
                                                                            Mean   : 4.089                                                                               
                                                                            3rd Qu.: 5.000                                                                               
                                                                            Max.   :14.000 
> hist(v$RESIDENCE)
> hist(v$AGE)
> dbDisconnect(jdbcConnection)

Make sure to check out the other demonstration scripts that are shown in the video.

Tuesday, December 29, 2015

Oracle R Enterprise 1.5 (new release)

The Oracle Santa had a busy time just before Christmas with the release of several new version of products. One of these was Oracle R Enterprise version 1.5.

Oracle R Enterprise (1.5) is part of the Oracle Advanced Analytics option for the enterprise edition of the Oracle Database.

As with every new release of a product there are a range of bug fixes. But with ORE 1.5 there are also some important new features. These important new features include:

New Random Forest specific for ORE.
New ORE Data Store functions and privileges.
Partitioning on multiple columns for ore.groupApply.
Multiple improvements to ore.summary.
Now performs parallel in-database execution for functions prcomp and svd.
BLOB and CLOB data types are now supported in some of the ORE functions.

Check out the ORE 1.5 Release Notes for more details on the new features.

ORE 1.5 is only certified (for now) on R 3.2.x in both the open source version and the Oracle R Distribution version 3.2.

Check out the ORE 1.5 Documentation.

You can download ORE 1.5 Server side and Client side software here.

Monday, December 21, 2015

Running R in the Oracle Database video

Earlier this year I was asked by the Business Analysics & Big Data SIG (of the UKOUG) to give a presentation on Oracle R Enterprise. Unfortunately I had already committed to giving the same presentation at the OUG Norway conference on the same day.

But then they asked me if I could record a video of the presentation and they would show it at the SIG. The following video is what I recorded.

At the UKOUG annual (2015) conferences I was supposed to give a 2 hour presentation during their Super Sunday event. Unfortunately due to a storm passing over Ireland on the Saturday all flights going to the UK were cancelled. This meant that I would miss my 2 hour presentation.

Instead of trying to find an alternative speaker for my presentation slot at such sort notice, the committee suggested that they would show the video.

Based on the feedback and the people who thanked me in person during the rest of the conference, I've decided to make it available to everyone. Hopefully you will find it useful.

The following are the links to the demo code that is shown or referred to in the video.

Demo Code 1
Demo Code 2
Demo Code 3
Demo Code 4
Demo Code 5

People have been asking me if the demo scripts I used in video are available. You will probably find some of these on various blog posts. So to make it easier for everyone I will post the demo scripts in one or more blog posts over the coming weeks. When these are available I will update this blog post with the links.

I have a few new presentations on Oracle R Enterprise in 2016 so watch out for these at an Oracle User Group conference.

Saturday, December 12, 2015

KScope 2016 Acceptances

I've never been to KScope. Yes never.

I've always wanted to. Each year you hear of all of these stories about how much people really enjoy KScope and how much they learn.

So back in October I decided to submit 5 presentations to KScope. 4 of these presentations are solo presentations and 1 joint presentation.

This week I have received the happy news that 2 of my solo presentations have been accepted, plus my joint presentation with Kim Berg Hansen.

So at the end of June 2016 I will be making my way to Chicago for a week of Oracle geekie fun at KScope.

My presentations will be:

Is Oracle SQL the best language for Statistic?
Running R in your Oracle Database using Oracle R Enterprise

and my join presentations is called

Forecasting in Oracle using the Power of SQL (this will talk about ROracle, Forecasting in R, Using Oracle R Enterprise and SQL)

I was really hoping that one of my rejected presentations would have been accepted. I really enjoy this presentation and I get to share stories about some of my predictive analytics projects. Ah well, maybe in 2017.

The last time I was in Chicago was over 15 years ago when I sent 5 days in Cellular One (The brand was sold to Trilogy Partners by AT&T in 2008 shortly after AT&T had completed its acquisition of Dobson Communications). I was there to kick off a project to build them a data warehouse and to build their first customer churn predictive model. I stayed in a hotel across the road from their office which was famous because a certain person had stayed in it why one the run. Unfortunately I didn't get time to visit downtown Chicago.

Thursday, November 5, 2015

Slide from my OOW15 Presentations

At Oracle Open World (OOW15) I gave 2 presentations on the Sunday during the Oracle User Group Forum. The slides are now available for download from the Oracle Open World website.

Go get them now!

More Than Another 12 on Oracle Database 12c [UGF3190]

During this sessions I was one of 16 presenters talking about various features in the Oracle Database. All of the presenters where from the EOUC region.

Real Business Value from Big Data and Advanced Analytics [UGF4519]

I co-presented with Antony Heljula from Peak Indicators. During this presentation we talked about some of the Advanced Analytics projects we have worked on over the past 18-24 months. We also announced a new Analytics-as-a-Service offering.

The slides are also available for most of the other Oracle Open World Presentations and these can be accessed here. Just go search for the topic you are interested in.

Check out my previous blog post that summarises just a small part of what I got up to at OOW15.

Friday, August 14, 2015

Managing ORE in-database Data Stores using SQL

When working with ORE you will end up creating a number of different data stores in the database. Also as your data science team increases the number of data stores can grow to a very large number.

When you install Oracle R Enterprise you will get a number of views that are made available to ORE users to see what ORE Data stores they have and what objects exist in them. All using SQL.

Perhaps some of the time the ORE developers and data analysts will use the set of ORE functions to manage the in-database ORE Data stores. These include:

NewImage

When using these ORE function the schema user/data scientist can see what ORE Data stores they have. You can use the ore.delete to delete an ORE Data store when it is no longer needed.

But the problem here is that over time your schemas can get a bit clogged up with ORE Data stores. Particularly when the data scientist is not longer working on the project or there is no need to maintain ORE Data stores. This is common on data science projects when you might have a number of data scientists work in/sharing the one database schema.

For a DBA, who's role will be to clean up the ORE Data store that are no longer needed, you have 4 options.

The first of these, is if all the ORE Data stores exist in the data scientists schema and nothing else in the schema is needed then you can just go ahead and drop the schema.

The second option is to log into the schema using SQL and drop the ORE Data stores. See an example of this below.

The third option is to connect to the Oracle schema using R and ORE and then use the ore.delete function to drop the ORE Data stores.

The fourth option is to connect to the RQSYS schema. This schema is the owner of the views used to query the ORE Data stores in each schema. After the RQSYS schema was created it was locked as part of the ORE installation. You as the DBA will need to unlock and then connect.

The following SQL lists the ORE Data stores that were created for that schema.

column dsname format a20
column description format a35

SELECT * FROM rquser_DataStoreList;

DSNAME                     NOBJ     DSSIZE CDATE     DESCRIPTION
-------------------- ---------- ---------- --------- -----------------------------------
ORE_DS                        2       5104 04-AUG-15 Example of ORE Datastore
ORE_FOR_DELETION              1       1675 14-AUG-15 Need to Delete this ORE Datastore
ORE_DS2                       5   51466509 04-AUG-15 DS for all R Env Data

You can also view what objects have saved in the ORE Data store.

column objname format a15
column class format a15
SELECT * FROM rquser_DataStoreContents;
 
DSNAME               OBJNAME         CLASS              OBJSIZE     LENGTH       NROW       NCOL
-------------------- --------------- --------------- ---------- ---------- ---------- ----------
ORE_DS               CARS_DATA       ore.frame             1306         11         32         11
ORE_DS               cars_ds         data.frame            3798         11         32         11
ORE_DS2              cars_ds         data.frame            3798         11         32         11
ORE_DS2              cars_ore_ds     ore.frame             1675         11         32         11
ORE_DS2              sales_ds        data.frame        51455575          7     918843          7
ORE_DS2              usa_ds          ore.frame             2749         23      18520         23
ORE_DS2              usa_ds2         ore.frame             2712         23      18520         23
ORE_FOR_DELETION     cars_ore_ds     ore.frame             1675         11         32         11

To drop an ORE Data store for you current schema you can use the rqDropDataStore SQL function.

BEGIN
   rqDropDataStore('ORE_FOR_DELETION');
END;
/

For the DBA when you unlock and connect to the RQSYS schema you will be able to see all the ORE Data stores in the data. The views will contain an additional column.

But if you use the above SQL function to delete an ORE Data store it will not work. This because this SQL function will only drop and ORE Data store if it exists in your schema. If we have connected to the RQSYS schema we will not have any ORE Data stores in it.

We can create a procedure that will allow use to delete/drop any ORE Data store in any schema.

create or replace PROCEDURE my_ORE_Datastore_Drop(
  ds_owner  in VARCHAR2,
  ds_name  IN VARCHAR2
  )
IS
  del_objIds rqNumericSet;
BEGIN
  del_objIds := rq$DropDataStoreImpl(ds_owner, ds_name);
  IF del_objIds IS NULL THEN
    raise_application_error(-20101, 'DataStore ' ||
                            ds_name || ' does not exist');
  END IF;

  -- remove from rq$datastoreinventory
  BEGIN
    execute immediate
       'delete from RQ$DATASTOREINVENTORY c where c.objID IN (' ||
       'select column_value from table(:del_objIds))' using del_objIds;
     COMMIT;
  EXCEPTION WHEN others THEN null;
  END;
END;

We are the DBA, logged into the RQSYS schema can now delete any ORE Data store in the database, using the following.

BEGIN
   my_ORE_Datastore_Drop('ORE_USER', 'ORE_FOR_DELETION');
END;
/

Monday, July 13, 2015

V506 of Oracle OBIEE SampleApp Virtual Machine

A few days ago Oracle released the latest version of the Virtual Machine for OBIEE SampleApp. The current version has a number of new features and new product versions (see below).

To get this latest version go to the following link to download the VM files and to install. As always this is a beast of a VM and you should only consider the install and setup if you have the space and in particular you have 16G RAM.

Oracle Business Intelligence Enterprise Edition Samples on OTN.

NewImage

v506 New Features

DB 12.1.0.2 with the In-Memory option
Load and process JSON data
Integrates with Big Data SQL
Connects to Impala
Session tracking in UT
Exalytics Aggregation Functions
Lots of new Visualizations
Custom Style Features
Hierarchical Session Variables
etc.

Software on V506

Oracle Enterprise Linux 6.5 x64
OBIEE 11.1.1.9GA two distinct OBIEE instances, Essbase 11.1.2.4, updated BIMAD
Oracle MapViewer11.1.1.9.1
Oracle BICS Data Sync v1
Oracle Database 12c IMDB 12.1.0.2, PDB Install, AWM 12.1.0.2a, APEX 4.2.6 & ORDS 2.0.1, ODM, Oracle Spatial and Graph
ORE 1.4.1 & R 3.1.1
ENDECA 3.1, Server 7.6.1, Studio 3.1, Provisioning Services
Cloudera CDH 5.1.2, Oracle BigData SQL, Oracle BigData Connectors
Plug and Play Companions : EPM 11.1.2.3, BIApps Demos
Utils: Start scripts, MapBuilder, SQLDev 4.1

Pages

Monday, August 8, 2016

Tuesday, July 26, 2016

Monday, July 11, 2016

Tuesday, July 5, 2016

Thursday, June 23, 2016

Thursday, June 16, 2016

Tuesday, June 7, 2016

Monday, May 30, 2016

Friday, April 29, 2016

Tuesday, April 12, 2016

Tuesday, February 2, 2016

Tuesday, January 19, 2016

Wednesday, January 6, 2016

Wednesday, December 30, 2015

Tuesday, December 29, 2015

Monday, December 21, 2015

Saturday, December 12, 2015

Thursday, November 5, 2015

Friday, August 14, 2015

Monday, July 13, 2015