Showing posts with label oracle data mining. Show all posts
Showing posts with label oracle data mining. Show all posts

Monday, April 1, 2019

Data Normalization in Oracle Data Mining

Normalization is the process of scaling continuous values down to a specific range, often between zero and one. Normalization transforms each numerical value by subtracting a number, called the shift, and dividing the result by another number called the scale. The normalization techniques include:
  • Min-Max Normalization : There is where the normalization is based on the using the minimum value for the shift and the (maximum-minimum) for the scale.
  • Scale Normalization : This is where the normalization is based on zero being used for the shift and the value calculated using max[abs(max), abs(min)] being used for the scale
  • Z-Score Normalization : This is where the normalization is based on using the mean value for the shift and the standard deviation for the scale.
When using Automatic Data Processing the normalization functions are used. But sometimes you may want to process the data is a more explicit manner. To do so you can use the various normalization function. To use these there is a three stage process. The first stage involves the creation of a table that will contain the normalization transformation data. The second stage applies the normalization procedures to your data source, defines the normalization required and inserts the required transformation data  into the table create during the first stage. The third stage involves the defining of a view that applies the normalization transformations to your data source and displays the output via a database view. The following example illustrates how you can normalize the AGE and YRS_RESIDENCE attributes. The input data source will be the view that was created as the output of the previous transformation (MINING_DATA_V_2). This is passed on the original MINING_DATA_BUILD_V data set. The final output from this transformation step and all the other data transformation steps is MINING_DATA_READY_V.

BEGIN
   -- Clean-up : Drop the previously created tables
   BEGIN
      execute immediate 'drop table TRANSFORM_NORMALIZE';
   EXCEPTION
      WHEN others THEN
         null;
   END;

   -- Stage 1 : Create the table for the transformations
   -- Perform normalization for: AGE and YRS_RESIDENCE
   dbms_data_mining_transform.CREATE_NORM_LIN (
      norm_table_name => 'MINING_DATA_NORMALIZE');       

   -- Step 2 : Insert the normalization data into the table
   dbms_data_mining_transform.INSERT_NORM_LIN_MINMAX (
      norm_table_name => 'MINING_DATA_NORMALIZE',
      data_table_name => 'MINING_DATA_V_2',
      exclude_list    => DBMS_DATA_MINING_TRANSFORM.COLUMN_LIST (
                         'affinity_card',
                         'bookkeeping_application',
                         'bulk_pack_diskettes',
                         'cust_id',
                         'flat_panel_monitor',
                         'home_theater_package',
                         'os_doc_set_kanji',
                         'printer_supplies',
                         'y_box_games'));

   -- Stage 3 : Create the view with the transformed data
   DBMS_DATA_MINING_TRANSFORM.XFORM_NORM_LIN (
      norm_table_name => 'MINING_DATA_NORMALIZE',
      data_table_name => 'MINING_DATA_V_2',
      xform_view_name => 'MINING_DATA_READY_V');

END;
/
 
The above example performs normalization based on the Minimum-Maximum values of the variables/columns. The other normalization functions are:

INSERT_NORM_LIN_SCALEInserts linear scale normalization definitions in a transformation definition table.
INSERT_NORM_LIN_ZSCOREInserts linear zscore normalization definitions in a transformation definition table.

Monday, December 12, 2016

Renaming & Commenting Oracle Data Mining Models

As your company evolves with their data mining projects, the number of models produced and in use in production will increase dramatically.

Care needs to be taken when it comes to managing these. This includes using meaningful names, adding descriptions of what the model is about or for, and being able to track their usage, etc.

I will look at tracking the usage of the models in another blog post, but the following gives examples of how to rename Oracle Data Mining models and how to add comments or descriptions to these models. This is particularly useful because our data analytics teams have a constant turn over or it has been many months since you last worked on a model and you want a quick idea of what purpose of the model was for.

If you have been using the Oracle Data Mining tool (part of SQL Developer) will will see your model being created with some sort of sequencing numbers. For example for a Support Vector Machine (SVM) model you might see it labelled for classification:

CLAS_SVM_5_22

While you are working on this project you will know and understand what it was about and why it is being used. But afterward you may forget as you will be dealing with many hundreds of models. Yes you could check your documentation for the purpose of this model but that can take some time.

What if you could run a SQL query to find out?

But first we need to rename the model.

DBMS_DATA_MINING.RENAME_MODEL('CLAS_SVM_5_22', 'HIGH_VALUE_CHURN_CLAS_SVM');

Next we will want to add a longer description of what the model is about. We can do this by adding a comment to the model.

COMMENT ON MINING MODEL high_value_churn_clas_svm IS
'Classification Model to Predict High Value Customers most likely to Churn';

We can now see these updated details when we query the Oracle Data Mining models in a user schema.

SELECT model_name, mining_function, algorithm, comments 
FROM user_mining_models;

These are two very useful commands.

Monday, December 5, 2016

Evaluating Cluster Dispersion in Oracle Data Mining

When working with the Clustering algorithms, and particularly k-Means, in the Oracle Data Miner tool there is no way of seeing how compact or dispersed the data is within a cluster.

There are a number of measures typically used in various tools and algorithms, but with Oracle Data Miner we are not presented with any of this information.

But if we flip from using the Oracle Data Miner tool to using SQL we can get to see some more details of the clusters produced by the k-Means algorithm along with some additional and useful information.

As I said there are a number of different measures used to evaluate clusters. The one that Oracle uses is called Dispersion. Now there are a few different definitions of what this could be and I haven't been able to locate what is Oracle's own definition of it in any of the documentation.

We can use the Dispersion value as a measure of how compact or how spread out the data is within a cluster. The Dispersion value is a number greater than 0. The lower the value of the more compact the cluster is i.e. the data points are close the the centroid of the cluster. The larger the value the more disperse or spread out the data points are.

The DBMS_DATA_MINING PL/SQL package comes with a function called GET_MODEL_DETAILS_KM. This function returns a record of the form DM_CLUSTERS.

(id                   NUMBER,
 cluster_id           VARCHAR2(4000),
 record_count         NUMBER,
 parent               NUMBER,
 tree_level           NUMBER,
 dispersion           NUMBER,
 split_predicate      DM_PREDICATES,
 child                DM_CHILDREN,
 centroid             DM_CENTROIDS,
 histogram            DM_HISTOGRAMS,
 rule                 DM_RULE)

We can not use the following query to get the Dispersion value for each of the clusters from an ODM cluster model.

SELECT cluster_id,
       record_count,
       parent,
       tree_level,
       dispersion
FROM  table(dbms_data_mining.get_model_details_km('CLUS_KM_3_2'));
NewImage

Monday, November 14, 2016

Using the Identity column for Oracle Data Miner

If you are a user of the Oracle Data Miner tool (the workflow data mining tool that is part of SQL Developer), then you will have noticed that for many of the algorithms you can specify a Case Id attribute along with, say, the target attribute.

NewImage

The idea is that you have one attribute that is a unique identifier for each case record. This may or may not be the case in your data model and you may have a multiple attribute primary key or case record identifier.

But what is the Case Id field used for in Oracle Data Miner?

Based on the documentation this field does not need to have a value. But it is recommended that you do identify an attribute for the Case Id, as this will allow for reproducible results. What this means is that if we run our workflow today and again in a few days time, on the exact same data, we should get the same results. So the Case Id allows this to happen. But how? Well it looks like the attribute used or specified for the Case Id is used as part of the Hashing algorithm to partition the data into a train and test data set, for classification problems.

So if you don't have a single attribute case identifier in your data set, then you need to create one. There are a few options open to you to do this.

  • Create one: write some code that will generate a unique identifier for each of your case records based on some defined rule.
  • Use a sequence: and update the records to use this sequence.
  • Use ROWID: use the unique row identifier value. You can write some code to populate this value into an attribute. Or create a view on the table containing the case records and add a new attribute that will use the ROWID. But if you move the data, then the next time you use the view then you will be getting different ROWIDs and that in turn will mean we may have different case records going into our test and training data sets. So our workflows will generate different results. Not what we want.
  • Use ROWNUM: This is kind of like using the ROWID. Again we can have a view that will select ROWNUM for each record. Again we may have the same issues but if we have our data ordered in a way that ensures we get the records returned in the same order then this approach is OK to use.
  • Use Identity Column: In Oracle 12c we have a new feature called Identify Column. This kind of acts like a sequence but we can defined an attribute in a table to be an Identity Column, and as records are inserted into the the data (in our scenario our case table) then this column will automatically generate a unique number for our data. Again if we need to repopulate the case table, you will need to drop and recreate the table to get the Identity Column to reset, otherwise the newly inserted records will start with the next number of the Identity Column

Here is an example of using the Identity Column in a case table.

CREATE TABLE case_table (
id_column	NUMBER GENERATED ALWAYS AS IDENTITY,
affinity_card 	NUMBER,
age		NUMBER,
cust_gender	VARCHAR2(5),
country_name	VARCHAR2(20)
...
);

You can now use this Identity Column as the Case Id in your Oracle Data Miner workflows.

NewImage

Tuesday, October 25, 2016

Oracle Text, Oracle R Enterprise and Oracle Data Mining - Part 5

In this 5th blog post in my series on using the capabilities of Oracle Text, Oracle R Enterprise and Oracle Data Mining to process documents and text, I will have a look at some of the machine learning features of Oracle Text.

Oracle Text comes with a number of machine learning algorithms. These can be divided into two types. The first is called 'Supervised Learning' where we have two machine learning algorithms for classification type of problem. The second type is called 'Unsupervised Learning' where we have the ability to use clustering machine learning algorithms to look for patterns in our text documents and to find similarities between documents based on their contents.

It is this second type of document clustering that I will work through in this blog post.

When using clustering with text documents, the machine learning algorithm will look for patterns that are common between the documents. These patterns will include the words used, the frequency of the words, the position or ordering of these words, the co-occurance of words, etc. Yes this is a large an complex task and that is why we need a machine learning algorithm to help us.

With Oracle Text we only have one clustering machine learning algorithm available to use. When we move onto using the Oracle Advanced Analytics Option (Oracle Data Mining and Oracle R Enterprise) we more algorithms available to us.

With Oracle Text the clustering algorithm is called k-Means. In a way the actual algorithm is unimportant as it is the only one available to us when using Oracle Text. To use this algorithm we have the CTX_CLS.CLUSTERING procedure. This procedure takes the documents we want to compare and will then identify the clusters (using hierarchical clustering) and will then tells us, for each document, what clusters the documents belong to and they probability value. With clustering a document (or a record) can belong to many clusters. Typically in the text books we see clusters that are very distinct and are clearly separated from each other. When you work on real data this is never the case. We will have many over lapping clusters and a data point/record can belong to one or more clusters. This is why we need the probability vale. We can use this to determine what cluster our record belongs to most and what other clusters it is associated with.

Using the example documents that I have been using during this series of blog posts we can use the CTX_CLS.CLUSTERING algorithm to cluster and identify similarities in these documents.

We need to setup the parameters that will be used by the CTX_CLS.CLUSTERING procedure. Tell it to use the k-Means algorithm and then the number of clusters to generate. As with all Oracle Text procedures or algorithms there are a number of settings you can configure or you can just accept the default values.

exec ctx_ddl.drop_preference('Cluster_My_Documents');
exec ctx_ddl.create_preference('Cluster_My_Documents','KMEAN_CLUSTERING');
exec ctx_ddl.set_attribute('Cluster_My_Documents','CLUSTER_NUM','3');

The code above is an example of the basics of what you need to setup for clustering. Other attribute or cluster parameter setting available to you include, MAX_DOCTERMS, MAX_FEATURES, THEME_ON, TOKEN_ON, STEM_ON, MEMORY_SIZE and SECTION_WEIGHT.

Now we can run the CTX_CLS.CLUSTERING procedure on our documents. This procedure has the following parameters.

- The Oracle Text Index Name

- Document Id Column Name

- Document Assignment (cluster assignment) Table Name. This table will be created if it doesn't already exist

- Cluster Description Table Name. This table will be created if it doesn't already exist.

- Name of the Oracle Text Preference (list)

exec ctx_cls.clustering(
'MY_DOCUMENTS_OT_IDX',
'DOC_PK',
'OT_CLUSTER_RESULTS',
'DOC_CLUSTER_DETAILS',
'Cluster_My_Documents');

When the procedure has completed we can now examine the OT_CLUSTER_RESULTS and the DOC_CLUSTER_DETAILS tables. The first of these (OT_CLUSTER_RESULTS) allows us to see what documents have been clustered together. The following is what was produced for my documents.

SELECT d.doc_pk, 
       d.doc_title, 
       r.clusterid, 
       r.score 
FROM my_documents d, 
     ot_cluster_results r 
WHERE d.doc_pk = r.docid;

NewImage

We can see that two of the documents have been grouped into the same cluster (ClusterId=2). If you have a look back at what these documents are about then you can see that yes these are very similar. For the other two documents we can see that they have been clustered into separate clusters (ClusterId=4 & 5). The clustering algorithms have said that they are different types of documents. Again when you examine these documents you will see that they are talking about different topics. So the clustering process worked !

You can also explore the various features of the clusters by looking that he DOC_CLUSTER_DETAILS table. Although the details in this table are not overly useful but it will give you some insight into what clusters the k-Means algorithm has produced.

Hopefully I've shown you how easy it is to setup and use the clustering feature of Oracle Text.

WARNING: Before using the Clustering or Classification with Oracle Text, you need to check with your local Oracle Sales representative about if there is licence implication. There seems to be some mentions the the algorithms used are those that come with Oracle Data Mining. Oracle Data Mining is a licence cost option for the database. So make sure you check before you go using these features.

Monday, August 8, 2016

Oracle Text, Oracle R Enterprise and Oracle Data Mining - Part 2

This is the second blog post of a series on using Oracle Text, Oracle R Enterprise and Oracle Data Mining. Check out the first blog post of the series, as the data used in this blog post was extracted, processed and stored in a databases table.

In this blog post I will show you how you use Oracle R Enterprise and the embedded R execution features of ORE to use the text from the webpages and to create a word cloud. This is a useful tool to be able to see visually what words can stand out most on your webpage and if the correct message is being put across to your customers.

Prerequisites: You will need to load the following R packages into your R environment 'tm', 'word cloud' 'SnowballC'. These are required to process the following R code segments.

install.packages (c( "tm", "wordcloud", "SnowballC"))
library (tm)
library (wordcloud)
library (SnowballC)

Select data from table and prepare: We need to select the data from the table in our schema and to merge it into one variable.

local_data <- ore.pull(MY_DOCUMENTS)

tm_data <-""
for(i in 1:nrow(local_data)) {
  tm_data <- paste(tm_data, local_data[i,]$DOC_TEXT, sep=" ")
}
tm_data

Create function to perform Text Mining: In my previous blog post on creating a word cloud I gave the R code. In order to allow for this R code to be run on the database server (using the embedded R execution of ORE) we need to package this text mining R code up into a ORE user defined R script. This is stored in the database.

ore.scriptDrop("prepare_tm_data")
ore.scriptCreate("prepare_tm_data", function (tm_data) { 
  library(tm)
  library(SnowballC)
  library(wordcloud)
  
  txt_corpus <- Corpus (VectorSource (tm_data))
  
  # data clean up
  tm_map <- tm_map (txt_corpus, stripWhitespace) # remove white space
  tm_map <- tm_map (tm_map, removePunctuation) # remove punctuations
  tm_map <- tm_map (tm_map, removeNumbers) # to remove numbers
  tm_map <- tm_map (tm_map, removeWords, stopwords("english")) # to remove stop words
  tm_map <- tm_map (tm_map, removeWords, c("work", "use", "java", "new", "support"))
  
  # prepare matrix of words and frequency counts
  Matrix <- TermDocumentMatrix(tm_map) # terms in rows
  matrix_c <- as.matrix (Matrix)
  freq <- sort (rowSums (matrix_c)) # frequency data
  
  res <- data.frame(words=names(freq), freq)
  wordcloud (res$words, res$freq, max.words=100, min.freq=3, scale=c(7,.5), random.order=FALSE, colors=brewer.pal(8, "Dark2"))
} ) 

Before we can run this user define R script, we need to ensure that we have the 'tm', 'SnowballC' and 'wordcloud' R packages installed on the Oracle Database server. On the Oracle Database server you need to rune ORE.

> library(ORE)

Then run the following command to install these R packages

> install_packages(c('tm','wordcloud', 'SnowballC'))

Run the function on the DB Server: You are now ready to run the function. In an earlier step we had gathered the data. Now we can pass this data to the in-database R script.

> res <- ore.doEval(FUN.NAME="prepare_tm_data", tm_data=tm_data)

The ore.doEval function is a general purpose ORE function. In this case we pass it two parameters. The first parameter is the neame of the user defined R script stored in the database, and the second parameter is the data. The function returns and ORE object that contains the word cloud graphic.

Display the results: You can very easily display the results.

> res

This gives us the following graphic.

NewImage

In my next blog post, of this series, I will show you how you can use the function created above and some other bits and pieces, using some other features of ORE and also in SQL.

Tuesday, July 26, 2016

Oracle Text, Oracle R Enterprise and Oracle Data Mining - Part 1

A project that I've been working on for a while now involves the use of Oracle Text, Oracle R Enterprise and Oracle Data Mining. Oracle Text comes with your Oracle Database licence. Oracle R Enterprise and Oracle Data Mining are part of the Oracle Advanced Analytics (extra cost) option.

What I will be doing over the course of 4 or maybe 5 blog posts is how these products can work together to help you gain a grater insight into your data, and part of your data being large text items like free format text, documents (in various forms e.g. html, xml, pdf, ms word), etc.

Unfortunately I cannot show you examples from the actual project I've been working on (and still am, from time to time). But what I can do is to show you how products and components can work together.

In this blog post I will just do some data setup. As with all project scenarios there can be many ways of performing the same tasks. Some might be better than others. But what I will be showing you is for demonstration purposes.

The scenario: The scenario for this blog post is that I want to extract text from some webpages and store them in a table in my schema. I then want to use Oracle Text to search the text from these webpages.

Schema setup: We need to create a table that will store the text from the webpages. We also want to create an Oracle Text index so that this text is searchable.

drop sequence my_doc_seq;
create sequence my_doc_seq;

drop table my_documents;

create table my_documents (
doc_pk number(10) primary key, 
doc_title varchar2(100), 
doc_extracted date, 
data_source varchar2(200), 
doc_text clob);

create index my_documents_ot_idx on my_documents(doc_text) 
indextype is CTXSYS.CONTEXT;

In the table we have a number of descriptive attributes and then a club for storing the website text. We will only be storing the website text and not the html document (More on that later). In order to make the website text searchable in the DOC_TEXT attribute we need to create an Oracle Text index of type CONTEXT.

There are a few challenges with using this type of index. For example when you insert a new record or update the DOC_TEXT attribute, the new values/text will not be reflected instantly, just like we are use to with traditional indexes. Instead you have to decide when you want to index to be updated. For example, if you would like the index to be updated after each commit then you can create the index using the following.

create index my_documents_ot_idx on my_documents(doc_text) 
indextype is CTXSYS.CONTEXT
parameters ('sync (on commit)');

Depending on the number of documents you have being committed to the DB, this might not be for you. You need to find the balance. Alternatively you could schedule the index to be updated by passing an interval to the 'sync' in the above command. Alternatively you might want to use DBMS_JOB to schedule the update.

To manually sync (or via DBMS_JOB) the index, assuming we used the first 'create index' statement, we would need to run the following.

EXEC CTX_DDL.SYNC_INDEX('my_documents_ot_idx');

This function just adds the new documents to the index. This can, over time, lead to some fragmentation of the index, and will require it to the re-organised on a semi-regular basis. Perhaps you can schedule this to happen every night, or once a week, or whatever makes sense to you.

BEGIN
  CTX_DDL.OPTIMIZE_INDEX('my_documents_ot_idx','FULL');
END;

(I could talk a lot more about setting up some basics of Oracle Text, the indexes, etc. But I'll leave that for another day or you can read some of the many blog posts that already exist on the topic.)

Extracting text from a webpage using R: Some time ago I wrote a blog post on using some of the text mining features and packages in R to produce a word cloud based on some of the Oracle Advanced Analytics webpages. I'm going to use the same webpages and some of the same code/functions/packages here. The first task you need to do is to get your hands on the 'htmlToText function. You can download the htmlToText function on github. This function requires the 'Curl' and 'XML' R packages. So you may need to install these. I also use the str_replace_all function ("stringer' R package) to remove some of the html that remains, to remove some special quotes and to replace and occurrences of '&' with 'and'. # Load the function and required R packages source("c:/app/htmltotext.R") library(stringr)
data1 <- str_replace_all(htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/overview/index.html"), "[\r\n\t\"\'\u201C\u201D]" , "")
data1 <- str_replace_all(data1, "&", "and")
data2 <- str_replace_all(str_replace_all(htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html"), "[\r\n\t\"\'\u201C\u201D]" , ""), "&", "and")
data2 <- str_replace_all(data2, "&", "and")
data3 <- str_replace_all(str_replace_all(htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-technologies/overview/index.html"), "[\r\n\t\"\'\u201C\u201D]" , ""), "&", "and")
data3 <- str_replace_all(data3, "&", "and")
data4 <- str_replace_all(str_replace_all(htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html"), "[\r\n\t\"\'\u201C\u201D]" , ""), "&", "and")
data4 <- str_replace_all(data4, "&", "and")
We now have the text extracted and cleaned up. Create a data frame to contain all our data: Now that we have the text extracted, we can prepare the other data items we need to insert the data into our table ('my_documents'). The first stept is to construct a data frame to contain all the data.
data_source = c("http://www.oracle.com/technetwork/database/options/advanced-analytics/overview/index.html",
                 "http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html",
                 "http://www.oracle.com/technetwork/database/database-technologies/r/r-technologies/overview/index.html",
                 "http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html")
doc_title = c("OAA_OVERVIEW", "OAA_ODM", "R_TECHNOLOGIES", "OAA_ORE")
doc_extracted = Sys.Date()
data_text <- c(data1, data2, data3, data4)

my_docs <- data.frame(doc_title, doc_extracted, data_source, data_text)

Insert the data into our database table: With the data in our data fram (my_docs) we can now use this data to insert into our database table. There are a number of ways of doing this in R. What I'm going to show you here is how to do it using Oracle R Enterprise (ORE). The thing with ORE is that there is no explicit functionality for inserting and updating records in a database table. What you need to do is to construct, in my case, the insert statement and then use ore.exec to execute this statement in the database.

library(ORE)
ore.connect(user="ora_text", password="ora_text", host="localhost", service_name="PDB12C", 
            port=1521, all=TRUE) 

for(i in 1:nrow(my_docs)) {
  insert_stmt <- "BEGIN insert_tab_document ('"
  insert_stmt <- paste(insert_stmt,  my_docs[i,]$doc_title, sep="")
  insert_stmt <- paste(insert_stmt, "', '",  my_docs[i,]$doc_extracted, "'", sep="")
  insert_stmt <- paste(insert_stmt, ", '",  my_docs[i,]$data_source, sep="")
  insert_stmt <- paste(insert_stmt, "', '",  my_docs[i,]$data_text, "');", " END;", sep="")
  ore.exec(insert_stmt)
}
ore.exec("commit")

You can now view the inserted webpage text using R or using SQL.

In my next blog post in this series, I will look at how you can use the ORE embedded features to read and process this data.

Wednesday, June 3, 2015

PMML in Oracle Data Mining

PMML (Predictive Model Markup Langauge) is an XML formatted output that defines the core elements and settings for your Predictive Models. This XML formatted output can be used to migrate your models from one data mining or predictive modelling tool to another data mining or predictive modelling tool, such as Oracle.

Using PMML to migrate your models from one tool to another allows for you to use the most appropriate tools for developing your models and then allows them to be imported into another tool that will be used for deploying your predictive models in batch or real-time mode. In particular the ability to use your Predictive Model within your everyday applications enables you to work in the area of Automatic or Prescriptive Analytics. Oracle Data Mining and the Oracle Database are ideal or even the best possible tools to allow for Automatic and Prescriptive Analytics for your transa

PMML is an XML based standard specified by the Data Mining Group

Oracle Data Mining supports the importing of PMML models that are compliant with version 3.1 of the standard and for Regression Models only. The regression models can be for linear regression or binary logistic regression.

The Data Mining Group Archive webpage have a number of sample PMML files for you to download and then to load into your Oracle database.

To Load the PMML file into your Oracle Database you can use the DBMS_DATA_MINING.IMPORT_MODEL function. I’ve given examples of how you can use this function to import an Oracle Data Mining model that was exported using the EXPORT_MODEL function.

The syntax of the IMPORT_MODEL function when importing a PMML file is the following

DBMS_DATA_MINING.IMPORT_MODEL (
      model_name        IN  VARCHAR2,
      pmmldoc           IN  XMLTYPE
      strict_check      IN  BOOLEAN DEFAULT FALSE);

The following example shows how you can load the version 3.1 Logistic Regression PMML file from the Data Mining Group archive webpage

NewImage

 

BEGIN    
   dbms_data_mining.IMPORT_MODEL (‘PMML_MODEL',
        XMLType (bfilename (‘IMPORT_DIR', 'sas_3.1_iris_logistic_reg.xml'),
          nls_charset_id ('AL32UTF8')
        ));
END;

 

This example uses the default value for STRICT_CHECK as FALASE. In this case if there are any errors in the PMML structure then these will be ignored and the imported model may contain “features” that may make it perform in a slightly odd manner.

Monday, May 4, 2015

Oracle Data Miner (ODM 4.1) New Features

With the release of SQL Developer 4.1 we also get a number of new features with Oracle Data Miner (ODMr). These include:

  • Data Source node can now include data sources that contain JSON data, generating JSON schema and has a JSON viewer
  • Create Table can now create data in JSON
  • JSON Query Node allows you to view, query and process JSON data, combine it with relational data, generate sub-group by, and nested columns to be part of input to algorithms
  • New PL/SQL APIs for managing Data Miner projects and workflows. This includes run, cancel, rename, delete, import and export of workflows using PL/SQL.
  • New ODMr Repository views that allows us to query and monitor our workflows.
  • Transformation Node now allows you different ways of handling NULLS.
  • Transformation Node now allows us to create Custom Bins, define bin labels and bin values
  • Overall Workflow and ODMr environment improvements to allow for greater efficiency in workflow behaviour and interactions with the database. So using ODMr should feel quicker and more responsive.

What out for the Gotchas: Although support for JSON has been added to ODMr, as outlined above, you are still a bit limited to what else you can do with your JSON data. Based on the documentation you can use JSON data in the Association and Classification build nodes.

I'm not sure about the other nodes and this will need a bit of investigation to see what nodes can and cannot use JSON data. I'm sure this will all be sorted out in the next release.

Keep an eye out for some blog posts over the coming weeks on how to explore and use these new features of Oracle Data Miner.

Thursday, April 30, 2015

Viewing Models Details for Decision Trees using SQL

When you are working with and developing Decision Trees by far the easiest way to visualise these is by using the Oracle Data Miner (ODMr) tool that is part of SQL Developer.
Developing your Decision Tree models using the ODMr allows you to explore the decision tree produced, to drill in on each of the nodes of the tree and to see all the statistics etc that relate to each node and branch of the tree.
But when you are working with the DBMS_DATA_MINING PL/SQL package and with the SQL commands for Oracle Data Mining you don't have the same luxury of the graphical tool that we have in ODMr. For example here is an image of part of a Decision Tree I have and was developed using ODMr.
Blog dt 1
What if we are not using the ODMr tool? In that case you will be using SQL and PL/SQL. When using these you do not have luxury of viewing the Decision Tree.
So what can you see of the Decision Tree? Most of the model details can be used by a variety of functions that can apply the model to your data. I've covered many of these over the years on this blog.
For most of the data mining algorithms there is a PL/SQL function available in the DBMS_DATA_MINING package that allows you to see inside the models to find out the settings, rules, etc. Most of these packages have a name something like GET_MODEL_DETAILS_XXXX, where XXXX is the name of the algorithm. For example GET_MODEL_DETAILS_NB will get the details of a Naive Bayes model. But when you look through the list there doesn't seem to be one for Decision Trees.
Actually there is and it is called GET_MODEL_DETAILS_XML. This function takes one parameter, the name of the Decision Tree model and produces an XML formatted output that contains the attributes used by the model, the overall model settings, then for each node and branch the attributes and the values used and the other statistical measures required for each node/branch.
The following SQL uses this PL/SQL function to get the Decision Tree details for model called CLAS_DT_1_59.
SELECT dbms_data_mining.get_model_details_xml('CLAS_DT_1_59')
FROM dual;

If you are using SQL Developer you will need to double click on the output column and click on the pencil icon to view the full listing.
Blog dt 2
Nothing too fancy like what we get in ODMr, but it is something that we can work with.
If you examine the XML output you will see references to PMML. This refers to the Predictive Model Markup Language (PMML) and this is defined by the Data Mining Group (www.dmg.org). I will discuss the PMML in another blog post and how you can use it with Oracle Data Mining.

Friday, April 24, 2015

Changing REVERSE Transformations in Oracle Data Miner

In my previous blog post I showed you how you can have a look at the transformations that the Automatic Data Preparation (ADP) feature of Oracle Data Mining produces. I also gave some example of the different types of ADF that are performed for different algorithms.

One of the features of the transformations produced is that it will generate a REVERSE_EXPRESSION. This will take the scored results and apply the inverse of the transformation that was performed when the data was being prepared for input to the algorithm.

Somethings you may want to have the scored data returned in a slightly different ways or labeled in a slightly different way.

In this blog post I will show you how to define an alternative REVERSE_EXPRESSION for an attribute.

The function we need to use for this is the ALTER_REVERSE_EXPRESSION procedure that is part of the DBMS_DATA_MINING package.

When we score data for a typical classification problem we typically use 0 (zero) and 1 to be the target variable values. But what if we wanted the output from our classification model to label the scored data slighted differently.

In this case we can use the ALTER_REVERSE_EXPRESSION procedure to define the new values. What if we wanted the zero to be labeled as NO and the 1 as YES. In this case we can use the following.

BEGIN

    dbms_data_mining.alter_reverse_expression(

       model_name => 'CLAS_NB_1_59',

       expression => 'decode(affinity_card, ''1'', ''YES'', ''NO'')',

       attribute_name => 'AFFINITY_CARD');

END;

When we view the transformations for our data mining model we can now see the transformation.

Blog dat trans 3

Now when we score our data the predicted target variable will now have our newly defined values.

SELECT cust_id,

        PREDICTION(CLAS_NB_1_59 USING *) PRED

FROM mining_data_apply_v

FETHC FIRST 5 ROWS ONLY;

Blog dat trans 4

You can see that this is a very powerful feature and allows use to turn the scored data values is a different way to make them more useful. This is particularly the case as we work towards a more Automatic type of Predictive Analytics.

Saturday, April 18, 2015

ODM : View Transformations generated by Automatic Data Prepreparation

A very powerful feature of Oracle Data Mining and one that I think does not get enough notice is called Automatic Data Preparation.

Data Preparation is one of the most time consuming, repetitive and boring parts of the work that a Data Miner or Data Scientist performs as part of their daily tasks. Apart from gathering the data, integrating the data, getting the data into the required formation the most interesting part of the work is with feature engineering.

Then you have all the other boring data preparation tasks of how to handle missing data, type conversion, binning, normalization, outlier treatment etc.

With Automatic Data Preparation (ADP) in Oracle Data Mining you can let Oracle work all of these things out for you and to perform all the necessary coding and to store all of this coding as part of the in-database data mining model.

This is Fantastic. This ADP feature can same you hours and in some cases days of effort.

But (there is always a but :-) ) what if you are a bit unsure if the transformations that are being performed are exactly what you would wanted. Maybe you would like to see what Oracle is doing and depending on this you can do it a different way.

The first step is to examine the transformations that are generated by stored as part of the in-database data mining model. The DBMS_DATA_MINING package has a function called GET_MODEL_TRANSFORMATIONS. When you query this function, passing in the name of the data mining model, you will get returned the list of transformations that have been applied to each model.

In the following example a GLM model was created using the Oracle Data Miner tool (that is part of SQL Developer). When you use Oracle Data Miner, ADP is automatically turned on.

The following query calls the GET_MODEL_TRANSFORMATIONS function with the data mining model called CLAS_GLM_1_59/.

SELECT * FROM TABLE(DBMS_DATA_MINING.GET_MODEL_TRANSFORMATIONS('CLAS_GLM_1_59'));

The following image contains the output generated by this query.

Blog dat trans 1

When you look at the data under the EXPRESSION column we get to see what the ADP did to the data. In most of the cases there are just some simple data clean-up being performed and formatting for getting the data ready for input into the algorithm.

If we now look at the Naive Bayes model for the same data set we get a very different sent of transformations being listed under the EXPRESSION column.

SELECT * FROM TABLE(DBMS_DATA_MINING.GET_MODEL_TRANSFORMATIONS('CLAS_NB_1_59'));

Blog dat trans 2

Now we get to see some of the data binning that ADP performs and is required for input to the Naive Bayes algorithm. You will also notices that we also have some transformations in the REVERSE_EXPRESSION column. These are the inverse or reverse of the transformation that was generated in the EXPRESSION column.

I will let you explore the data transformations that are produced by ADP for the SVM and Decision Tree algorithms.

I will show you how you change the reverse expression in my next blog post, as there are times when you might want the data to be presented slightly differently after the model has been run to score your data.

To get more details of what Automatic Data Preparation is performed for each data mining algorithm you can check out this link in the 11g documentaion. This section seems to be missing from the online 12c documentation.

Tuesday, December 16, 2014

ODMr 4.1 EA1 Repository Upgrade

If you are downloading the EA1 of SQL Developer that includes Oracle Data Miner (ODMr), and you intend to use Oracle Data Miner then you will need to update the ODMr Repository.

You could do it the hard way and run the upgrade repository sql scripts that are located in the ...\sqldeveloper-4.1.0.17.29-no-jre\sqldeveloper\dataminer\scripts directory.

Or you could do it the easy way and let the inbuilt functionality in Oracle Data Miner do it for you.

To do it the easy way all you need to do is to open the ODMr Connections window and the double click on one of your ODM connections.

ODMr will check the version of the repository you have installed and if needed it will prompt you about upgrading the repository. Select Yes and you will be prompted to enter the SYS password. So talk kindly with your DBA for them to enter the password for you. Then click on the Start button. They will lick off the OMDr Repository Upgrade scripts.

NB: Make sure you have a backup of your workflows before you do this. A little think happened to me during the SQL Dev / ODMr 4.0 upgrade back in September 2013 where all my workflows disappeared. You can imagine how happy I was about that. Since then the ODMr team have added some functionality to ensure something like this doesn't happen again. But you never know.

To backup your ODMr workflows use the Export Workflow option.

When the repository upgrade has finished you will get a 'Task Complete Successfully' message in the upgrade window. Click on the close button and away you go with this updated version.

Check out this blog post for details of what is new in ODMr 4.1.

Friday, November 7, 2014

ODMr : Graph Node: Zooming in on Graphs

When Oracle Data Miner (ODMr) 4.0 (which is part of SQL Developer) came out back in late 2013 there was a number of new features added to the tool. One of these was a Graph node that allows us to create various graphs and charts that include Line, Scatter, Bar, Histogram and Box plot.

I've been using this node recently to produce graphs and particularly scatter plots. I've been using the scatter plots to graph the Actual values in a data set against the Predicted values that was generated by ODMr. In this scenario I had a separate data set for training my ODM data mining models and another testing data set for, well testing how well the model performed against an unseen data set.

In general the graphs produced by the Graph node look good and gives you the information that you need. But what I found was that as you increased the size of the data set, the scatter plot can look a messy. This was in part due to the size of the square used to represent a data point. As the volume of data increased then your scatter plot could just look like a coloured in area of blue squares. This is illustrated in the following image.

Graph node 1

What I discovered today is that you can zoom in on this graph to explore different regions and data point on it. This do this you need to select an data that is within the x-axis and y-axis area. When you do this you will see a box form on your graph that selects the area that you indicate by moving your mouse. After you have finished selecting the area, the Graph Node will zooms into this part of the graph and shows the data points. For example if I select the area from about 1000 on the x-axis and 1000 on the y-axis, I will get the following.

Graph node 2

Again if I select a similar are area of 350 on the x-axis and 400 on the y-axis I get the following zoomed area.

Graph node 3

You can keep zooming in on various areas.

At some point you will have finished zooming in and you will want to return to the original graph. To zoom back outward all you need to do in the graph is to click on it. When you do this you will go back to the previous step or image of the graph. You can keep doing this until you get back to the original graph. Alternatively you can zoom in and out on various parts of the graph.

Hopefully you will find this feature useful.

Monday, September 15, 2014

Oracle Advanced Analytics sessions at OOW14

With Oracle Open World just a few days away now, I was going through the list of presentations that are focused on using the Oracle Advanced Analytics Option. These will cover Oracle Data Miner and Oracle R Enterprise.

So I've decided to share this list with you :-) and hopefully I will get to see you are some or all of these sessions.

DateTimeLocationPresentation Title
Sunday 28th Sept.9:00-9:45Moscone South Room 304What Are They Thinking? With Oracle Application Express and Oracle Data Miner [UGF2861]. (This is my presentation with Roel Hartman.)
Tuesday 30th Sept.17:00-17:45Intercontinental - Grand Ballroom CAdvanced Predictive Analytics for Database Developers on Oracle [CON7977]
Tuesday 30th Sept.18:00-18:45Moscone South - 303Oracle’s Big Data Management System [MTE9350]
Wednesday 1st Oct.10:15-11:00Moscone South - 301Big Data and Predictive Analytics: Fiserv Data Mining Case Study [CON8631]
Wednesday 1st Oct.10:30-10:50Big Data Theater, Moscone South, Big Data ShowcaseBig Data: Maximize the Business Impact with Oracle Advanced Analytics [THT10395]
Wednesday 1st Oct.11:30-12:15Moscone South - 300A Perfect Storm: Oracle Big Data Science for Enterprise R and SAS Users [CON8331]
Wednesday 1st Oct.12:45-13:30Moscone West - 3002Predictive Analytics with Oracle Data Mining [CON8596]
Wednesday 1st Oct.14:00-14:45Moscone South - 308Developing Relevant Dining Visits with Oracle Advanced Analytics at Olive Garden [CON2898]

If I have missed any sessions then do please let me know and I can update the list above.

Thursday, March 20, 2014

Issues with using latest release of ODM

The title of this blog post makes it sound more dramatic than it actually is.

The reason for this blog post is down to me receiving a recent comment on the blog, plus having received numerous emails and a recent OTN Discussion Forum topic for Oracle Data Mining.

The main thing that they have in common is that if I use the latest version of Oracle Data Mining (ODM) it tells me that I need to upgrade my ODM Repository. What impact will this have?

The ODM Repository stores lots of information about the workflows you create using the (free) Oracle Data Mining tool that comes as part of SQL Developer. Yes you do have to pay for the OAA option, so is it really free? Well some part are like the explore node and the graph node.

If you download and want to use the latest version of the ODM tool or you want to try it out before rolling it out to others then you will need to upgrade your ODM repository.

And this the problem that people are facing.

If you upgrade then the ODM Repository it is updated to work with the latest version of the ODM tool. But what happens to everyone else who is using the previous release of the tool? The answer to that is they can no longer use ODM against their database.

Why is that? Well the version of the tool is tied to a version of the Repository. If you upgrade to the newer tool and repository then your older versions of the ODM tool no longer work.

The result of all of this is that you cannot have a mixture of versions of the ODM tool (SQL Developer) being used in your team/company.

There is a very simple solution to all of this. Everyone uses the same version of the ODM tool (i.e. the same version of SQL Developer). For example your team might be using SQL Dev 4 that was released last December. But in early March there was a new patch release 4.1. In order to use this new version of the tool all of your team needs to start using it at the same time. The first person to use it will be prompted to migrate the ODM repository. This is automatically done once you enter the password for SYS.

But in some teams this is not possible to do, you want to try out the tool to see that it works correctly before getting others to use it. The way around this is to have a separate database and use it for your testing. You can easily copy across your workflows and ODM objects to the test database.

This might not be possible for everyone, so what can you do. Create a Virtual Machine and try it out on your own desktop is one way.

The answer to this problem is not ideal, but hopefully you have a better idea of why things are happening this way and what you can or cannot do about it.

Like I said at the topic of this blog post that the title is a bit more dramatic than is really the case :-)


My next blog post will be on another question I've been asked a few times and this is 'When I go to use the ODM tool it tells me that the Oracle Text feature of Oracle needs to be enabled'

Wednesday, December 11, 2013

Running PL/SQL Procedures in Parallel

As your data volumes increase, particularly as you evolve into the big data world, you will be start to see that your Oracle Data Mining scoring functions will start to take longer and longer.  To apply an Oracle Data Mining model to new data is a very quick process. The models are, what Oracle calls, first class objects in the database. This basically means that they run Very quickly with very little overhead.

But as the data volumes increase you will start to see that your Apply process or scoring the data will start to take longer and longer. As with all OLTP or OLAP environments as the data grows you will start to use other in-database features to help your code run quicker. One example of this is to use the Parallel Option.

You can use the Parallel Option to run your Oracle Data Mining functions in real-time and in batch processing mode. The examples given below shows you how you can do this.

Let us first start with some basics. What are the typical commands necessary to setup our schema or objects to use Parallel. The following commands are examples of what we can use

ALTER session enable parallel dml;
ALTER TABLE table_name PARALLEL (DEGREE 8);
ALTER TABLE table_name NOPARALLEL;
CREATE TABLE … PARALLEL degree …
ALTER  TABLE … PARALLEL degree …
CREATE INDEX … PARALLEL degree …
ALTER  INDEX … PARALLEL degree …

You can force parallel operations for tables that have a degree of 1 by using the force option.

ALTER SESSION ENABLE PARALLEL DDL;
ALTER SESSION ENABLE PARALLEL DML;
ALTER SESSION ENABLE PARALLEL QUERY;

alter session force parallel query PARALLEL 2

You can disable parallel processing with the following session statements.

ALTER SESSION DISABLE PARALLEL DDL;
ALTER SESSION DISABLE PARALLEL DML;
ALTER SESSION DISABLE PARALLEL QUERY;

We can also tell the database what degree of Parallelism to use


ALTER SESSION FORCE PARALLEL DDL PARALLEL 32;
ALTER SESSION FORCE PARALLEL DML PARALLEL 32;
ALTER SESSION FORCE PARALLEL QUERY PARALLEL 32;


 


Using your Oracle Data Mining model in real-time using Parallel


When you want to use your Oracle Data Mining model in real-time, on one record or a set of records you will be using the PREDICTION and PREDICTION_PROBABILITY function. The following example shows how a Classification model is being applied to some data in a view called MINING_DATA_APPLY_V.


column prob format 99.99999
SELECT cust_id,
       PREDICTION(DEMO_CLASS_DT_MODEL USING *)  Pred,
       PREDICTION_PROBABILITY(DEMO_CLASS_DT_MODEL USING *) Prob
FROM   mining_data_apply_v
WHERE  rownum <= 18
/


   CUST_ID       PRED      PROB
---------- ---------- ---------
    100574          0    .63415
    100577          1    .73663
    100586          0    .95219
    100593          0    .60061
    100598          0    .95219
    100599          0    .95219
    100601          1    .73663
    100603          0    .95219
    100612          1    .73663
    100619          0    .95219
    100621          1    .73663
    100626          1    .73663
    100627          0    .95219
    100628          0    .95219
    100633          1    .73663
    100640          0    .95219
    100648          1    .73663
    100650          0    .60061


If the volume of data warrants the use of the Parallel option then we can add the necessary hint to the above query as illustrated in the example below.


SELECT /*+ PARALLEL(mining_data_apply_v, 4) */
       cust_id,
       PREDICTION(DEMO_CLASS_DT_MODEL USING *)  Pred,
       PREDICTION_PROBABILITY(DEMO_CLASS_DT_MODEL USING *) Prob
FROM   mining_data_apply_v
WHERE  rownum <= 18
/


If you turn on autotrace you will see that Parallel was used. So you should now be able to use your Oracle Data Mining models to work on a Very large number of records and by adjusting the degree of parallelism you can improvements.


Using your Oracle Data Mining model in Batch mode using Parallel


When you want to perform some batch scoring of your data using your Oracle Data Mining model you will have to use the APPLY procedure that is part of the DBMS_DATA_MINING package. But the problem with using a procedure or function is that you cannot give it a hint to tell it to use the parallel option. So unless you have the tables(s) setup with parallel and/or the session to use parallel, then you cannot run your Oracle Data Mining model in Parallel using the APPLY procedure.


So how can you get the DBMA_DATA_MINING.APPLY procedure to run in parallel?


The answer is that you can use the DBMS_PARALLEL_EXECUTE package. The following steps walks you through what you need to do to use the DMBS_PARALLEL_EXECUTE package to run your Oracle Data Mining models in parallel.


The first step required is for you to put the DBMS_DATA_MINING.APPLY code into a stored procedure. The following code shows how our DEMO_CLASS_DT_MODEL can be used by the APPLY procedure and how all of this can be incorporated into a stored procedure called SCORE_DATA.


create or replace procedure score_data
is
begin


dbms_data_mining.apply(
  model_name => 'DEMO_CLAS_DT_MODEL',
  data_table_name => 'NEW_DATA_TO_SCORE',
  case_id_column_name => 'CUST_ID',
  result_table_name => 'NEW_DATA_SCORED');


end;
/


Next we need to create a Parallel Task for the DBMS_PARALLEL_EXECUTE package. In the following example this is called ODM_SCORE_DATA.


-- Create the TASK
  DBMS_PARALLEL_EXECUTE.CREATE_TASK ('ODM_SCORE_DATA');


Next we need to define the Parallel Workload Chunks details

 -- Chunk the table by ROWID
DBMS_PARALLEL_EXECUTE.CREATE_CHUNKS_BY_ROWID('ODM_SCORE_DATA', 'DMUSER', 'NEW_DATA_TO_SCORE', true, 100);
The scheduled jobs take an unassigned workload chunk, process it and will then move onto the next unassigned chunk. 
 
Now you are ready to execute the stored procedure for your Oracle Data Mining model, in parallel by 10.

DECLARE
   l_sql_stmt   varchar2(200);
BEGIN
   -- Execute the DML in parallel
   l_sql_stmt := 'begin score_data(); end;';
  
   DBMS_PARALLEL_EXECUTE.RUN_TASK('ODM_SCORE_DATA', l_sql_stmt, DBMS_SQL.NATIVE,
                                  parallel_level => 10);
END;
/


When every thing is finished you can then clean up and remove the task using



BEGIN
   dbms_parallel_execute.drop_task('ODM_SCORE_DATA');
END;
/


 

NOTE: The schema that will be running the above code will need to have the necessary privileges to run DBMS_SCHEDULER, for example


grant create job to dmuser;

Friday, June 7, 2013

DBMS_PREDICTIVE_ANALYTICS & Predict

In this blog post I will look at the PREDICT procedure that is part of the DBMS_PREDICTIVE_ANALTYICS package. This package allows you to perform data mining in an automated way without having to go through the steps of building, testing and scoring data.

I had a previous blog post that showed how to use the EXPLAIN function to create an Attribute Importance model.

The predictive analytics procedures analyze and prepare the input data, create and test mining models using the input data, and then use the input data for scoring. The results of scoring are returned to the user. The models and supporting objects are not persisted and are removed from the database when the procedure is finished.

The PREDICT procedure should only be used for a Classification problem and data set.

The PREDICT procedure create a model based on the supplied data (out input table) and a target value,  and returns scored data set in a new table. When using PREDICT you do not get to select an algorithm to use.

The input data source should contain records that already have the target value populated.  It can also contain records where you do not have the target value. In this case the PREDICT function will use the records that have a target value to generate the model. This model will then score all records a the predicted target value

The syntax of the PREDICT procedure is:

DBMS_PREDICTIVE_ANALYTICS.PREDICT (
   accuracy OUT NUMBER,
   data_table_name IN VARCHAR2,
   case_id_column_name IN VARCHAR2,
   target_column_name IN VARCHAR2,
   result_table_name IN VARCHAR2,
   data_schema_name IN VARCHAR2 DEFAULT NULL);

Where

Parameter Name Description
accuracy This output parameter from the procedure. You do not pass anything into this parameter. The Accuracy value returned is the predictive confidence of the model generated/used by the PREDICT procedure
data_table_name The name of the table that contains the data you want to use
case_id_column_name The case id for each record. This is unique for each record/case.
target_column_name The name of the column that contains the target column to be predicted
result_table_name The name of the table that will contain the results. This table should not exist in your schema, otherwise an error will occur.
data_schema_name The name of the schema where the table containing the input data is located. This is probably in your current schema, so you can leave this parameter NULL.

The PREDICT procedure will produce an output tables (result_table_name parameter) and will contain 3 attributes.

CASE_ID This is the Case Id of the record from the original data_table_name. This will allow you to link up the data in the source table to the prediction in the result_table_name
PREDICTION This will be the predicted value of the target attribute
PROBABILITY This is the probability of the prediction being correct

Using the sample example data set that I have given in previous blog posts and in the blog post on the EXPLAIN procedure, the following code illustrates how to use the PREDICT procedure.

set serveroutput on

DECLARE
   v_accuracy NUMBER(10,9);
BEGIN
   DBMS_PREDICTIVE_ANALYTICS.PREDICT(
      accuracy => v_accuracy,
      data_table_name => 'mining_data_build_v',
      case_id_column_name => 'cust_id',
      target_column_name => 'affinity_card',
      result_table_name => 'PA_PREDICT');
   DBMS_OUTPUT.PUT_LINE('Accuracy of model = ' || v_accuracy);
END;

image

This took about 15 seconds to run on my laptop, which is surprisingly quick given all the work that is doing internally. To see the predictions and the results from the PREDICT procedure, you will need to query the PA_PREDICT table.

image

The final step that you might be interested in is to compare the original target value with the prediction value.

SELECT v.cust_id,
       v.affinity_card,
       p.prediction,
       p.probability
FROM   mining_data_build_v  v,
       pa_predict p
WHERE  v.cust_id = p.cust_id
AND    rownum <= 12;

image

Remember we do not get to see how or what Oracle did to generate these results. We do not get the opportunity to tune the process and the model.

So you have to be careful when you use the PREDICT function and on what data. Would you use this as a way to explore your data and to see if predictive analytics/data mining might be useful for your? Yes it would. Would you use it in a production scenario? the answer is maybe but it depends on the scenario. In reality if you want to do this in a production environment you will put some work into developing data mining models that best fit your data. To do this you will need to move onto the ODM tool and the DBMS_DATA_MINING package. But the PREDICT function is a quick way to get some small data scored (in some way) based on your existing data. If your marketing department says they want to start a tele marketing campaign in a couple of hours then PREDICT is what you need to use. It may not give you the most accurate of results, but it does give you results that you can start using quickly.

Tuesday, March 5, 2013

Clustering in Oracle Data Miner–Part 4

This is a the fourth part of a five (5) part blog post on building and using Clustering in Oracle Data Miner. The following outlines the contents of each post in this series on Clustering.
  1. The first part we looked at what clustering features exist in ODM and how to setup the data that we will be using in the examples
  2. The second part will focus on how to building Clusters in ODM .
  3. The third post will focus on examining the clusters produced by ODM and how to use the Clusters to apply to new data using ODM.
  4. The fourth post will look at how you can build and evaluate a Clustering model using the ODM SQL and PL/SQL functions.
  5. The fifth and final post will look at how you can apply your Clustering model to new data using the ODM SQL and PL/SQL functions.
With Clustering we are trying to find hidden patterns in our data. Unlike classification we a not directing the algorithms on what areas/attributes to focus on.
In my preceding post on Clustering in Oracle Data Miner I showed how you can use the Oracle Data Miner (ODM) tool, that is part of SQL Developer, to build a cluster model and to apply it to new data.
As I have shown in previous blog posts we can do all of the same steps using SQL and PL/SQL. In this blog post I will show you how to to these SQL and PL/SQL functions to generate a Clustering model and how you can evaluate the model produced.
Step 1 – What Cluster Models to we already have
Oracle Data Miner comes with a number of data dictionary tables. We can use these to see what already exists in our schema. In the example below we will have a look to see the Cluster models that were produced in the Part 2 blog post.
format model_name format a20
format mining_function format a20
format algorithm format a20

SELECT model_name,
       mining_function, 
       algorithm, 
       build_duration, 
       model_size
FROM ALL_MINING_MODELS
WHERE mining_function = 'CLUSTERING';

image
We can also look at the model settings that ODM produced. We can look at the settings of the K-Means model. This model (in my case) is called CLUS_KM_1_25.
column setting_value format a20
select setting_name,
         setting_value,
         setting_type
from  all_mining_model_settings
where model_name = 'CLUS_KM_1_25'

image
We can also look to see what attributes are used in the K-Mean model.
SELECT attribute_name, 
       attribute_type, 
       usage_type, 
       target
from  all_mining_model_attributes
where model_name = 'CLUS_KM_1_25'

image
I will show you how to use this Clustering model or the one that we are about to generate in my next blog post.
Step 2 – Creating the Setting table
When creating an Oracle Data Mining model in SQL you need to create Setting table. This will contain a record for each setting they you need to set for the algorithm you want to run. The settings table has a very basic structure and only consists of two columns. It is a good idea to create a separate Settings table for each algorithm or type of data mining you want to do. To create the settings table we run the following:
CREATE TABLE CLUSTER_SETTINGS (
Setting_Name  VARCHAR2(30),
Setting_Value VARCHAR2(4000));

The next step is to define the setting we want to use. The first of these will be the actual algorithm you want to run. Oracle has two Clustering algorithms: algo_kmeans and algo_o_cluster. In the examples we are working through we are using the K-Mean algorithm and we want to set the number of clusters to 10. We are also going to set the Automatic Data Preparation (ADP) on. Here is the code.
BEGIN
   INSERT INTO CLUSTER_SETTINGS (setting_name, setting_value)
   VALUES (dbms_data_mining.algo_name, dbms_data_mining.ALGO_KMEANS);
   INSERT INTO CLUSTER_SETTINGS (setting_name, setting_value)
   VALUES (dbms_data_mining.prep_auto, dbms_data_mining.PREP_AUTO_ON);
   INSERT INTO CLUSTER_SETTINGS (setting_name, setting_value)
   VALUES (dbms_data_mining.clus_num_clusters, 10);
  
   COMMIT;
END;

We can check to see that these records were inserted.
column setting_value format a25
select * from CLUSTER_SETTINGS;
image
The other settings will be dependent on what clustering algorithm is being used. Each one has their own setting. If you do not define any additional setting Oracle will use the in-build default settings.
To see what the default settings:
column setting_value format a35
select *
from table(dbms_data_mining.get_default_settings)
where setting_name like 'KM%';

image
If you want to use a different value to the default, just create a new record in the CLUSTER_SETTINGS table with the new value.

Step 3 – Create the new Cluster Model
We have the algorithm settings ready, so the next steps is to create the Cluster model. We will be using the DBMS_DATA_MINING.CREATE_MODEL function and pointing it to our CLUSTER_SETTINGS table.
BEGIN
   DBMS_DATA_MINING.CREATE_MODEL(
      model_name          => 'CLUSTER_KMEANS_MODEL',
      mining_function     => dbms_data_mining.clustering,
      data_table_name     => 'INSURANCE_CUST_LTV',
      case_id_column_name => 'CUSTOMER_ID',
      target_column_name  => null,
      settings_table_name => 'CLUSTER_SETTINGS');
END;

This takes a few seconds to run on my laptop. When finished we can uses queries like we used in Step 1 above to find out the details of what was produced.
Step 4 – Examining the Cluster model
The first thing that we will need to check is that the model was created.
format model_name format a20
format mining_function format a15
format algorithm format a15

SELECT model_name, 
       mining_function,
       algorithm,
       build_duration,
       model_size
FROM ALL_MINING_MODELS
WHERE mining_function = 'CLUSTERING';

image
One thing that is a bit curious is the mode size. We have basically created two K-means models using the same settings, but the final models have different sizes. See below for an explanation.
Now lets look at the algorithms settings for our new Cluster model (Cluster KMeans Model)
select setting_name,
       setting_value,
       setting_type
from  all_mining_model_settings
where model_name = 'CLUSTER_KMEANS_MODEL'

image
We can also look at the attributes used in the clusters.
SELECT attribute_name, 
       attribute_type, 
       usage_type, 
       target
from  all_mining_model_attributes
where model_name = 'CLUSTER_KMEANS_MODEL'
image
If we compare this list back to the list of attributes that were part of the model that we developed in the ODM tool, you will see that we have some extra attributes listed for the new K-means model.
The reason for these extra attributes and the bigger size (although still small) will be due to us having applied a sampling in the ODM tool. We took a subset of the data in the ODM tool and built the model based on this data. For the model we just created in SQL we took all the data (no sampling). That is why it would have looked at including more attributes because it was looking at a much bigger data set, etc.
Step 5 – Examining the Cluster Details
In this step we are going to look at some of the clusters produced by our new Cluster model.
We can produce the information relating to the clusters that were produced and what level in the hierarchy each cluster belongs. We can use this information to create/draw our own cluster hierarchy diagram.
set lines 90
column child format a40

select id, 
       record_count, 
       parent, 
       tree_level, 
       child
from table(dbms_data_mining.get_model_details_km('CLUSTER_KMEANS_MODEL'))

image
To look at the Centroid details for a particular cluster e.g. Cluster ID = 7, we can run:
column attribute_name format a25
column mode_value format a25

select t.id,
       c.attribute_name, 
       c.mean,
       c.mode_value,
       c.variance
from table (dbms_data_mining.get_model_details_KM('CLUSTER_KMEANS_MODEL')) t,
     table(t.centroid) c
where t.id = 7
order by c.attribute_name

The results will contain details of the centroid for cluster number 7. The centroid details will include the attributes that were used to form the centroid and the values for each attribute.
image
The last thing we can look at is the rule for a cluster. The following query will give you the full details of the cluster rule for Cluster 7. This will give for each attribute what the values or ranges of values that is used for that cluster. I’ll let you run this as it returns a large number of records. Run it and have a look at some of the output.
select t.id, 
       a.attribute_name, 
       a.conditional_operator,
       nvl(a.attribute_str_value,
       a.attribute_num_value) AS value,
       a.attribute_support,
       a.attribute_confidence
from TABLE(dbms_data_mining.get_model_details_km('CLUSTER_KMEANS_MODEL'))  t, 
     TABLE(t.rule.antecedent)  a
where t.id = 7
ORDER BY t.id, a.attribute_name, attribute_support, attribute_confidence desc, value;


My next blog post on Clustering in Oracle, will be the final one in the series. It will show you how to apply a Clustering model to new data in your database.

Wednesday, February 27, 2013

Clustering in Oracle Data Miner–Part 3

This is a the third part of a five (5) part blog post on building and using Clustering in Oracle Data Miner. The following outlines the contents of each post in this series on Clustering.

  1. The first part we looked at what clustering features exist in ODM and how to setup the data that we will be using in the examples
  2. The second part will focus on how to building Clusters in ODM .
  3. The third post will focus on examining the clusters produced by ODM and how to use the Clusters to apply to new data using ODM.
  4. The fourth post will look at how you can build and evaluate a Clustering model using the ODM SQL and PL/SQL functions.
  5. The fifth and final post will look at how you can apply your Clustering model to new data using the ODM SQL and PL/SQL functions.

In my previous posts on Clustering in ODM we have setup our data, we have explored it, we have taken a sample of the data and used this sample as input to the Cluster Build node. Oracle Data Miner has two clustering algorithms and our Cluster Build node created a clustering model for each.

In this post we will look at the next steps. The first of these is that we will look at examining what clustering models ODM produced. In the second part we will look at how we can use one of these clustering models to apply and label new data.

image

Step 1 – View the Cluster Models

To view the the cluster modes we need to right click the Cluster Build node and select View Models from the drop down list. We get an additional down down menu that gives the names of the two cluster models that were developed.

In my case these are called CLUS_KM_1_25 and CLUS_OC_1_25. You may get different numbers on your model names. These numbers are generated internally in ODM

image

The first mode that we will look at will be the K-Mean Cluster Model (CLUS_KM_1_25). Select this from the menu.

Step 2 – View the Cluster Rules

The hierarchical K-Mean cluster mode will be displayed. You might need to readjust/resize some of the worksheets/message panes etc in ODM to get the good portion of the diagram to display.

image

With ODM you cannot change, alter, merge, split, etc any the clusters that were generated. Oracle take the view of, this is what we have found it it up to you now to decide how you are going to use it.

To see that the cluster rules are for each cluster you can click on a cluster. When you do this you should get a pane (under the cluster diagram) that will contain two tabs, Centroid and Cluster Rule.

The Centroid tab provides a list of the attributes that best define the selected cluster, along with the average value for each attribute and some basic statistical information.

image

The Cluster Rules tab contains a set of rules that define the cluster in a IF/THEN statement format.

image

For each cluster in the tree we can see the number of cases in each cluster the percentage of overall cases for this cluster.

Work your way down the tree exploring each of the clusters produced.

The further down the tree you go the smaller the percentage of cases will fall into each cluster. In some tools you can merge these clusters. Not so in ODM. What you have to do is to use an IF statement in your code. Something like IF cluster_num IN (16, 17, 18, 19) THEN …..

Step 3 – Compare Clusters

In addition to the cluster tree, ODM also has two addition tabs to allow us to explore the clusters. These are Detail and Compare tabs.

image

Click on the Detail tab. We now get a detailed screen that contain various statistical information for each attribute. We can for each attribute get a histogram of the values within each attribute for this cluster.

We can use this important to start building up a picture of what each cluster might represent based on the values (and their distribution) for each cluster.

image

Try this out for a few clusters.

Step 4 – Multi-Cluster - Multi-variable Comparison of Clusters

The next level of comparison and evaluation of the clusters can be found under the Compare tab.

This lets us compare two clusters against each other at an attribute level. For example let us compare cluster 4 and 9. The attribute and graphics section gets updated to reflect the data for each of cluster. These are colour coded to distinguish the two clusters.

image 

We can work our way down through each attribute and again we can use this information to help us to understand what each cluster might represent.

An additional feature here is that we can do multi-variable (attribute) comparison. Holding down the control button select LTV_BIN, SEX and AGE. With each selection we get a new graph appearing at the bottom of the screen. This shows the distribution of the values by attribute for each cluster.  We can learn a lot from this.

image

So one possible conclusion we could draw from this data would be that Cluster 4 could be ‘Short Term Value Customers’ and Cluster 9 could be ‘Long Term Value Customer’

Step 5 – Renaming Clusters

When you have discovered a possible meaning for a Cluster, you can give it a meaningful name instead of it having a number. In our example, we would like to re-label Cluster 4 to ‘Short Term Value Customers’. To do this click on the Edit button that is beside the drop down that has cluster 4. Enter the new label and click OK.

SNAGHTML1bf2277

In the drop down we will now get the new label appearing instead of the cluster number.

Similarly we can do this for the other cluster e.g. ‘Long Term Value Customer’.

image

 

We have just looked at how to explore our K-Means model. You can do similar exploration of the O-Cluster model. I’ll leave that for you to do.

 

We have now explored our clusters and we have decided which of our Clustering Models best suits our needs. In our scenario we are going to select the K-Mean model to apply and label our new data.

Step 1 – Create the Apply Node

We have already setup our sample of data that we are going to use as our Apply Data Set. We did this when we setup the two different Sample node.

We are going to use the Sample node that was set to 40%.

The first step requires us to create an Apply Node. This can be found under the Component Palette and Evaluate and Apply tab. Click on the Apply node and move the mouse to the workflow worksheet and click near the Sample Apply node.

To connect the two nodes, move the mouse to the Sample Apply node and right click. Select Connect from the drop down menu and then move the mouse to the Apply node and click again. An connection arrow will be created joining these nodes.

Step 2 – Specify which Clustering Model to use & Output Data

Next we need to specify which of the clustering model we want to use to apply to our new data.

We need to connect the Cluster Build node to the Apply node. Move the mouse to the Cluster Build node, right click and select connect from the drop down menu. Move the mouse to the Apply node and click. We get the connection arrow between the two node.

We now have joined the Data and the Cluster Build node to the Apply node.

The final step is to specify what clustering mode we would like to use. In our scenario we are going to specify the K-Mean model.

(Single) Click the Cluster Build node. We now need to use the Property Inspector to select the K-Means model for the apply set. In the Models tab of the Property Inspector we should have our two cluster models listed. Under the Output column click in the box for the O-Cluster model. We should now get a little red X mark appearing. The K-Mean model should still have the green arrow under the Output column.

Step 3 – Run the Apply Node

We have one last data setup to do on the Apply node. We need to specify what data from the apply data set we want to include in the output from the Apply node.  For simplicity we want to just include the primary key, but you could include all the attributes.  In addition to including the attributes from the apply data source, the Apply Node will also create some attributes based on the Cluster model we selected. In our scenario, the K-Means model will create two additional attributes. One of these will contain the Cluster ID and the other attribute will be the probability of the that cluster being valid.

To include the attributes from the source data, double click on the Apply node. This will open the Edit Apply Node window. You will see that it already contains the two attributes that will be created by the K-Mean model.

image

To add the attributes from the source data, click on the Data Columns tab and then click on the green ‘+’ symbol. For simplicity we are going to just select the CUSTOMER_ID. Click the OK button to finish.

image

Now we are ready to run the Apply node. To do this right click on the Apply Node and select Run from the drop down menu. When everything is finished you will get the little green tick mark on the top right hand corner of the Apply node.

image

Step 4 – View the Results

To view the results and the output produced by the Apply node, right click on the Apply node and select View Data from the drop down menu.

We get a new tab opened in SQL Developer that will contain the data. This will consist of the CUSTOMER_ID, the K-means Cluster ID and the Cluster Probability. You will see that the some of the clusters assigned will have a number and some will have the cluster labels that we assigned in a previous step.

image

It is now up to you to decide how you are going to use this clustering information in an operational or strategic way in your organisation.

 

In my next (fourth) blog post in the series on Clustering in Oracle Data Miner, I will show how you can perform similar steps, of building and evaluating clustering models, using the SQL and PL/SQL functions in the database. So we will not be using the ODM tool. We will be doing everything in SQL and SQL/PLSQL.