Brendan Tierney - Oralytics Blog: data mining blog

Showing posts with label data mining blog. Show all posts

Tuesday, January 3, 2012

ODM 11gR2–Using different data sources for Build and Testing a Model

There are 2 ways to connect a data source to the Model build node in Oracle Data Miner.

The typical method is to use a single data source that contains the data for the build and testing stages of the Model Build node. Using this method you can specify what percentage of the data, in the data source, to use for the Build step and the remaining records will be used for testing the model. The default is a 50:50 split but you can change this to what ever percentage that you think is appropriate (e.g. 60:40). The records will be split randomly into the Built and Test data sets.

The second way to specify the data sources is to use a separate data source for the Build and a separate data source for the Testing of the model.

To do this you add a new data source (containing the test data set) to the Model Build node. ODM will assign a label (Test) to the connector for the second data source.

If the label was assigned incorrectly you can swap what data sources. To do this right click on the Model Build node and select Swap Data Sources from the menu.

Tuesday, December 20, 2011

Updating your ODM (11g R2) model in production

In my previous blog posts on creating an ODM model, I gave the details of how you can do this using the ODM PL/SQL API.

But at some point you will have a fairly stable environment. What this means is that you will know what type of algorithm and its corresponding settings work best for for your data.

At this point you should be able to re-create your ODM model in the production database. The frequency of doing this update is dependent on number of new cases that you have. So you need to update your ODM model could be daily, weekly, monthly, etc.

To update your model you will need to:

- Creating a settings table for your model
- Create a new ODM model
- Rename your new ODM model to the production name

The following examples are based on the example data, model names, etc that I’ve used in my previous post.

Creating a Settings Table

The first step is to create a setting table for your algorithm. This will contain all the parameter settings needed to create the new model. You will have worked out these setting from your previous attempts at creating your models and you will know what parameters and their values work best.

-- Create the settings table
CREATE TABLE decision_tree_model_settings (
setting_name VARCHAR2(30),
setting_value VARCHAR2(30));

-- Populate the settings table
-- Specify DT. By default, Naive Bayes is used for classification.
-- Specify ADP. By default, ADP is not used.
BEGIN
    INSERT INTO decision_tree_model_settings (setting_name, setting_value)
    VALUES (dbms_data_mining.algo_name,
           dbms_data_mining.algo_decision_tree);

    INSERT INTO decision_tree_model_settings (setting_name, setting_value)
    VALUES (dbms_data_mining.prep_auto,dbms_data_mining.prep_auto_on);

    COMMIT;
END;

Create a new ODM Model

We will need to use the DBMS_DATA_MINING.CREATE_MODEL procedure. In our example we will want to create a Decision Tree based on our sample data, which contains the previously generated cases and the new cases since the last model rebuild.

BEGIN
    DBMS_DATA_MINING.CREATE_MODEL(
        model_name          => ‘Decision_Tree_Method2',
        mining_function     => dbms_data_mining.classification,
        data_table_name     => 'mining_data_build_v',
        case_id_column_name => 'cust_id',
        target_column_name => 'affinity_card',
        settings_table_name => ‘decision_tree_model_settings');
END;

Rename your ODM model to production name

The model we have create created above is not the name that is used in our production software. So we will need to rename it to our production name.

But we need to be careful about when we do this. If you drop a model or rename a model when it is being used then you can end up with indeterminate results.

What I suggest you do, is to pick a time of the day when your production software is not doing any data mining. You should drop the existing mode (or rename it) and the to rename the new model to the production model name.

DBMS_DATA_MINING.DROP_MODEL('CLAS_DECISION_TREE‘);

and then

DBMS_DATA_MINING.RENAME_MODEL('Decision_Tree_Method2', 'CLAS_DECISION_TREE');

Monday, December 19, 2011

Oracle Analytics Update & Plan for 2012

On Friday 16th December, Charlie Berger (Sr. Director, Product Management, Data Mining & Advanced Analytics) posted the following on the Oracle Data Mining forum on OTN.

“… soon you'll be able to use the new Oracle R Enterprise (ORE) functionality. ORE is currently in beta and is targeted to go General Availability in the near future. ORE brings additional functionality to the ODM Option, which will then be renamed to the Oracle Advanced Analytics Option to reflect the significant adv. analytical functionality enhancements. ORE will allow R users to write R scripts and run them inside the database and eliminate and/or minimize data movement in/out of the DB. ORE will provide R to SQL transparency for SQL push-down to in-DB SQL and and expanding library of Oracle in-DB statistical functions. Packages that cannot be pushed down will be run in embedded R mode while the DB manages all data flows to the multiple R engines running inside the DB.

In January, we'll open up a new OTN discussion forum specifically for Oracle R Enterprise focused technical discussions. Stay tuned.”

I’m looking forward to getting my hands on the new Oracle R Enterprise, in 2012. In particular I’m keen to see what additional functionality will be added to the Oracle Data Mining option in the DB.

So watch out for the rebranding to Oracle Advanced Analytics

Charlie – Any chance of an advanced copy of ORE and related DB bits and bobs.

Monday, December 12, 2011

My UKOUG Presentation on ODM PL/SQL API

On Wednesday 7th Dec I gave my presentation at the UKOUG conference in Birmingham. The main topic of the presentation was on using the Oracle Data Miner PL/SQL API to implement a model in a production environment.

There was a good turn out considering it was the afternoon of the last day of the conference.

I asked the attendees about their experience of using the current and previous versions of the Oracle Data Mining tool. Only one of the attendees had used the pre 11g R2 version of the tool.

From my discussions with the attendees, it looks like they would have preferred an introduction/overview type presentation of the new ODM tool. I had submitted a presentation on this, but sadly it was not accepted. Not enough people had voted for it.

For for next year, I will submit an introduction/overview presentation again, but I need more people to vote for it. So watch out for the vote stage next June and vote of it.

Here are the links to the presentation and the demo scripts (which I didn’t get time to run)

My Presentation

Demo Script 1 – Exploring and Exporting model

Demo Script 2 – Import, Dropping and Renaming the model. Plus Queries that use the model

Thursday, November 24, 2011

Applying an ODM Model to new data in Oracle – Part 2

This is the second of a two part blog posting on using an Oracle Data Mining model to apply it to or score new data. The first part looked at how you can score data the DBMS_DATA_MINING.APPLY procedure for scoring data batch type process.

This second part looks at how you can apply or score the new data, using our ODM model, in a real-time mode, scoring a single record at a time.

PREDICTION Function

The PREDICTION SQL function can be used in many different ways. The following examples illustrate the main ways of using it. Again we will be using the same data set with data in our (NEW_DATA_TO_SCORE) table.

The syntax of the function is

PREDICTION ( model_name, USING attribute_list);

Example 1 – Real-time Prediction Calculation

In this example we will select a record and calculate its predicted value. The function will return the predicted value with the highest probability

SELECT cust_id, prediction(clas_decision_tree using *)
FROM NEW_DATA_TO_SCORE
WHERE cust_id = 103001;

CUST_ID PREDICTION(CLAS_DECISION_TREEUSING*)
---------- ------------------------------------
103001 0

So a predicted class value is 0 (zero) and this has a higher probability than a class value of 1.

We can compare and check this results with the result that was produced using the DBMS_DATA_MINING.APPLY function (see previous blog post).

SQL> select * from new_data_scored
2 where cust_id = 103001;

   CUST_ID PREDICTION PROBABILITY
---------- ---------- -----------
    103001          0           1
    103001          1           0

Here we can see that the class value of 0 has a probability of 1 (100%) and the class value of 1 has a probability of 0 (0%).

Example 2 – Selecting top 10 Customers with Class value of 1

For this we are selecting from our NEW_DATA_TO_SCORE table. We want to find the records that have a class value of 1 and has the highest probability. We only want to return the first 10 of these

SELECT cust_id
FROM NEW_DATA_TO_SCORE
WHERE PREDICTION(clas_decision_tree using *) = 1
AND rownum <=10;

   CUST_ID
----------
    103005
    103007
    103010
    103014
    103016
    103018
    103020
    103029
    103031
    103036

Example 3 – Selecting records based on Prediction value and Probability

For this example we want to find our from what Countries do the customer come from where the Prediction is 0 (wont take up offer) and the Probability of this occurring being 1 (100%). This example introduces the PREDICTION_PROBABILITY function. This function allows use to use the probability strength of the prediction.

select country_name, count(*)
from new_data_to_score
where prediction(clas_decision_tree using *) = 0
and prediction_probability (clas_decision_tree using *) = 1
group by country_name
order by count(*) asc;

COUNTRY_NAME                               COUNT(*)
---------------------------------------- ----------
Brazil                                            1
China                                             1
Saudi Arabia                                      1
Australia                                         1
Turkey                                            1
New Zealand                                       1
Italy                                             5
Argentina                                        12
United States of America                        293

The examples that I have give above are only the basic examples of using the PREDICTION function. There are a number of other uses that include using the PREDICTION_COST, PREDICTION_SET, PREDICTION_DETAILS. Examples of these will be covered in a later blog post

Monday, November 21, 2011

Applying an ODM Model to new data in Oracle – Part 1

This is the first of a two part blog posting on using an Oracle Data Mining model to apply it to or score new data. This first part looks at the how you can score data using the DBMS_DATA_MINING.APPLY procedure in a batch type process.

The second part will be posted in a couple of days and will look how you can apply or score the new data, using our ODM model, in a real-time mode, scoring a single record at a time.

DBMS_DATA_MINING.APPLY

Instead of applying the model to data as it is captured, you may need to apply a model to a large number of records at the same time. To perform this bulk processing we can use the APPLY procedure that is part of the DBMS_DATA_MINING package. The format of the procedure is

DBMS_DATA_MINING.APPLY (
      model_name           IN VARCHAR2,
      data_table_name      IN VARCHAR2,
      case_id_column_name IN VARCHAR2,
      result_table_name    IN VARCHAR2,
      data_schema_name     IN VARCHAR2 DEFAULT NULL);

Parameter Name	Description
Model_Name	The name of your data mining model
Data_Table_Name	The source data for the model. This can be a tree or view.
Case_Id_Column_Name	The attribute that give uniqueness for each record. This could be the Primary Key or if the PK contains more than one column then a new attribute is needed
Result_Table_Name	The name of the table where the results will be stored
Data_Schema_Name	The schema name for the source data

The main condition for applying the model is that the source table (DATA_TABLE_NAME) needs to have the same structure as the table that was used when creating the model.

Also the data needs to be prepossessed in the same way as the training data to ensure that the data in each attribute/feature has the same formatting.

When you use the APPLY procedure it does not update the original data/table, but creates a new table (RESULT_TABLE_NAME) with a structure that is dependent on what the underlying DM algorithm is. The following gives the Result Table description for the main DM algorithms:

For a Classification algorithms

case_id VARCHAR2/NUMBER
prediction NUMBER / VARCHAR2 -- depending a target data type
probability NUMBER

For Regression

case_id VARCHAR2/NUMBER
prediction NUMBER

For Clustering

case_id VARCHAR2/NUMBER
cluster_id NUMBER
probability NUMBER

Example / Case Study

My last few blog posts on ODM have covered most of the APIs for building and transferring models. We will be using the same data set in these posts. The following code uses the same data and models to illustrate how we can use the DBMS_DATA_MINING.APPLY procedure to perform a bulk scoring of data.

In my previous post we used the EXPORT and IMPORT procedures to move a model from one database (Test) to another database (Production). The following examples uses the model in Production to score new data. I have setup a sample of data (NEW_DATA_TO_SCORE) from the SH schema using the same set of attributes as was used to create the model (MINING_DATA_BUILD_V). This data set contains 1500 records.

SQL> desc NEW_DATA_TO_SCORE
Name                                 Null?    Type
------------------------------------ -------- ------------
CUST_ID                              NOT NULL NUMBER
CUST_GENDER                          NOT NULL CHAR(1)
AGE                                           NUMBER
CUST_MARITAL_STATUS                           VARCHAR2(20)
COUNTRY_NAME                         NOT NULL VARCHAR2(40)
CUST_INCOME_LEVEL                             VARCHAR2(30)
EDUCATION                                     VARCHAR2(21)
OCCUPATION                                    VARCHAR2(21)
HOUSEHOLD_SIZE                                VARCHAR2(21)
YRS_RESIDENCE                                 NUMBER
AFFINITY_CARD                                 NUMBER(10)
BULK_PACK_DISKETTES                           NUMBER(10)
FLAT_PANEL_MONITOR                            NUMBER(10)
HOME_THEATER_PACKAGE                          NUMBER(10)
BOOKKEEPING_APPLICATION                       NUMBER(10)
PRINTER_SUPPLIES                              NUMBER(10)
Y_BOX_GAMES                                   NUMBER(10)
OS_DOC_SET_KANJI                              NUMBER(10)

SQL> select count(*) from new_data_to_score;

COUNT(*)
----------
1500

The next step is to run the the DBMS_DATA_MINING.APPLY procedure. The parameters that we need to feed into this procedure are

Parameter Name	Description
Model_Name	CLAS_DECISION_TREE -- we imported this model from our test database
Data_Table_Name	NEW_DATA_TO_SCORE
Case_Id_Column_Name	CUST_ID -- this is the PK
Result_Table_Name	NEW_DATA_SCORED -- new table that will be created that contains the Prediction and Probability.

The NEW_DATA_SCORED table will contain 2 records for each record in the source data (NEW_DATA_TO_SCORE). For each record in NEW_DATA_TO_SCORE we will have one record for the each of the Target Values (O or 1) and the probability for each target value. So for our NEW_DATA_TO_SCORE, which contains 1,500 records, we will get 3,000 records in the NEW_DATA_SCORED table.

To apply the model to the new data we run:

BEGIN
dbms_data_mining.apply(
model_name => 'CLAS_DECISION_TREE',
data_table_name => 'NEW_DATA_TO_SCORE',
case_id_column_name => 'CUST_ID',
result_table_name => 'NEW_DATA_SCORED');
END;
/

This takes 1 second to run on my laptop, so this apply/scoring of new data is really quick.

The new table NEW_DATA_SCORED has the following description

SQL> desc NEW_DATA_SCORED
Name                            Null?    Type
------------------------------- -------- -------
CUST_ID                         NOT NULL NUMBER
PREDICTION                               NUMBER
PROBABILITY                              NUMBER

SQL> select count(*) from NEW_DATA_SCORED;

COUNT(*)
----------
3000

We can now look at the prediction and the probabilities

SQL> select * from NEW_DATA_SCORED where rownum <=12;

   CUST_ID PREDICTION PROBABILITY
---------- ---------- -----------
    103001          0           1
    103001          1           0
    103002          0 .956521739
    103002          1 .043478261
    103003          0 .673387097
    103003          1 .326612903
    103004          0 .673387097
    103004          1 .326612903
    103005          1 .767241379
    103005          0 .232758621
    103006          0           1
    103006          1           0

12 rows selected.

Wednesday, November 9, 2011

ODM–PL/SQL API for Exporting & Importing Models

In a previous blog post I talked about how you can take a copy of a workflow developed in Oracle Data Miner, and load it into a new schema.
When you data mining project gets to a mature stage and you need to productionalise the data mining process and model updates, you will need to use a different set of tools.

As you gather more and more data and cases, you will be updating/refreshing your models to reflect this new data. The new update data mining model needs to be moved from the development/test environment to the production environment. As with all things in IT we would like to automate this updating of the model in production.
There are a number of database features and packages that we can use to automate the update and it involves the setting up of some scripts on the development/test database and also on the production database.

These steps include:

Creation of a directory on the development/test database
Exporting of the updated Data Mining model
Copying of the exported Data Mining model to the production server
Removing the existing Data Mining model from production
Importing of the new Data Mining model.
Rename the imported mode to the standard name

The DBMS_DATA_MINING PL/SQL package has 2 functions that allow us to export a model and to import a model. These functions are an API to the Oracle Data Pump. The function to export a model is DBMS_DATA_MINING.EXPORT_MODEL and the function to import a model is DBMS_DATA_MINING.IMPORT_MODEL.The parameters to these function are what you would expect use if you were to use Data Pump directly, but have been tailored for the data mining models.

Lets start with listing the models that we have in our development/test schema:

SQL> connect dmuser2/dmuser2
Connected.
SQL> SELECT model_name FROM user_mining_models;
MODEL_NAME
------------------------------
CLAS_DT_1_6
CLAS_SVM_1_6
CLAS_NB_1_6
CLAS_GLM_1_6

Create/define the directory on the server where the models will be exported to.

CREATE OR REPLACE DIRECTORY DataMiningDir_Exports AS 'c:\app\Data_Mining_Exports';

The schema you are using will need to have the CREATE ANY DIRECTORY privilege.

Now we can export our mode. In this example we are going to export the Decision Tree model (CLAS_DT_1_6)

DBMS_DATA_MINING.EXPORT_MODEL function
The function has the following structure

DBMS_DATA_MINING.EXPORT_MODEL (
     filename IN VARCHAR2,
     directory IN VARCHAR2,
     model_filter IN VARCHAR2 DEFAULT NULL,
     filesize IN VARCHAR2 DEFAULT NULL,
     operation IN VARCHAR2 DEFAULT NULL,
     remote_link IN VARCHAR2 DEFAULT NULL,
     jobname IN VARCHAR2 DEFAULT NULL);

If we wanted to export all the models into a file called Exported_DM_Models, we would run:

DBMS_DATA_MINING.EXPORT_MODEL('Exported_DM_Models', 'DataMiningDir');

If we just wanted to export our Decision Tree model to file Exported_CLASS_DT_Model, we would run:

DBMS_DATA_MINING.EXPORT_MODEL('Exported_CLASS_DT_Model', 'DataMiningDir', 'name in (''CLAS_DT_1_6'')');

DBMS_DATA_MINING.DROP_MODEL function
Before you can load the new update data mining model into your production database we need to drop the existing model. Before we do this we need to ensure that this is done when the model is not in use, so it would be advisable to schedule the dropping of the model during a quiet time, like before or after the nightly backups/processes.

DBMS_DATA_MINING.DROP_MODEL('CLAS_DECISION_TREE', TRUE)

DBMS_DATA_MINING.IMPORT_MODEL function
Warning : When importing the data mining model, you need to import into a tablespace that has the same name as the tablespace in the development/test database. If the USERS tablespace is used in the development/test database, then the model will be imported into the USERS tablespace in the production database.

Hint : Create a DATAMINING tablespace in your development/test and production databases. This tablespace can be used solely for data mining purposes.

To import the decision tree model we exported previously, we would run

DBMS_DATA_MINING.IMPORT_MODEL('Exported_CLASS_DT_Model', 'DataMiningDir', 'name=’CLAS_DT_1_6''', 'IMPORT', null, null, 'dmuser2:dmuser3');

We now have the new updated data mining model loaded into the production database.

DBMS_DATA_MINING.RENAME_MODEL function
The final step before we can start using the new updated model in our production database is to rename the imported model to the standard name that is being used in the production database.

DBMS_DATA_MINING.RENAME_MODEL('CLAS_DT_1_6', 'CLAS_DECISION_TREE');

Scheduling of these steps
We can wrap most of this up into stored procedures and have schedule it to run on a semi-regular bases, using the DBMS_JOB function. The following example schedules a procedure that controls the importing, dropping and renaming of the models.

DBMS_JOB.SUBMIT(jobnum.nextval, 'import_new_data_mining_model', trunc(sysdate), add_month(trunc(sysdate)+1);

This schedules the the running of the procedure to import the new data mining models, to run immediately and then to run every month.

Saturday, November 5, 2011

What Conference ? If I had the time and money

If I had lots of free time and enough money what conferences would I go to around the world. I regularly get asked for recommendations on what conferences should a person attend. It all depends on what you want to get out of your conference trip. Be is training, education, information building, networking, etc. or to enjoy the local attractions.

The table below is my preferred list of conferences to attend. All of the conferences below are focused on two main areas. The first area is Oracle and the second area is that of Data Mining/Predictive Analytics.

I hope you find the list useful. If you can recommend some others let me know.

Month	Conference
January
February
March	Annual Ireland Oracle Conference – Dublin, Ireland Predictive Analytics World – USA (San Francisco) Text Analytics World Hotsos Symposium
April	Collaborate (IOUG Conference USA) Enterprise Data World (USA) Miracle OpenWorld (Denmark)
May	OUG Harmony (Finland)
June	Oracle Development Tools User Group Kaleidoscope (Kscope) Data Governance – Summer Conference Oracle Benelux User Group Conference
July	VirtaThon – Online Oracle Conference
August	ACM SIGKDD Conference on KDD & Data Mining
September
October	Oracle Open World – San Francisco, USA Predictive Analytics World – USA (New York) SAS Analytics Conference
November	TDWI World Conference Data Governance – Winter Conference (USA) Predictive Analytics World – UK International Conference on Data Mining & Engineering (ICDMKE) Australia Oracle User Group Conference Germany Oracle User Group Conference (DOAG)
December	Annual UKOUG Conference – Birmingham, UK IEEE International Conference on Data Mining (ICDM) Oracle Open World Latin America

There is a lot of conferences in the October, November and December months. Some of these are on overlapping dates, which is a pity. Perhaps the organisers of some of these conferences. Also during the January and February months there does not seem to be any conferences in the areas.

If you would like to sponsor a trip to one or more of these then drop me an email Smile

Thursday, November 3, 2011

ODM 11.2 Data Dictionary Views.

The Oracle 11.2 database contains the following Oracle Data Mining views. These allow you to query the database for the metadata relating to what Data Mining Models you have, what the configurations area and what data is involved.

ALL_MINING_MODELS

Describes the high level information about the data mining models in the database. Related views include DBA_MINING_MODELS and USER_MINING_MODELS.

Attribute	Data Type	Description
OWNER	Varchar2(30) NN	Owner of the mining model
MODEL_NAME	Varchar2(30) NN	Name of the mining model
MINING_FUNCTION	Varchar2(30)	What data mining function to use CLASSIFICATION REGRESSION CLUSTERING FEATURE_EXTRACTION ASSOCIATION_RULES ATTRIBUTE_IMPORTANCE
ALGORITHM	Varchar2(30)	Algorithm used by the model NAIVE_BAYES ADAPTIVE_BAYES_NETWORK DECISION_TREE SUPPORT_VECTOR_MACHINES KMEANS O_CLUSTER NONNEGATIVE_MATRIX_FACTOR GENERALIZED_LINEAR_MODEL APRIORI_ASSOCIATION_RULES MINIMUM_DESCRIPTION_LENGTH
CREATION_DATE	Date NN	Date model was created
BUILD_DURATION	Number	Time in seconds for the model build process
MODEL_SIZE	Number	Size of model in MBytes
COMMENTS	Varchar2(4000)

Lets query the my DMUSER2 data mining schema. This was created during a previous post where we exported some ODM models from schema and loaded them into DMUSER2 schema

SELECT model_name,
       mining_function,
       algorithm,
       build_duration,
       model_size
FROM ALL_MINING_MODELS;

MODEL_NAME     MINING_FUNCTION ALGORITHM                      BUILD_DURATION MODEL_SIZE
------------- ---------------- -------------------------- -------------- ----------
CLAS_SVM_1_6   CLASSIFICATION    SUPPORT_VECTOR_MACHINES                     3      .1515
CLAS_DT_1_6    CLASSIFICATION    DECISION_TREE                               2      .0842
CLAS_GLM_1_6   CLASSIFICATION    GENERALIZED_LINEAR_MODEL                    3      .0877
CLAS_NB_1_6    CLASSIFICATION    NAIVE_BAYES                                 2      .0459

ALL_MINING_MODEL_ATTRIBUTES

Describes the attributes of the data mining models. Related views are DBA_MINING_MODEL_ATTRIBUTES and USER_MINING_MODEL_ATTRIBUTES.

Attribute	Data Type	Description
OWNER	Varchar2(30) NN	Owner of the mining model
MODEL_NAME	Varchar2(30) NN	Name of the mining mode
ATTRIBUTE_NAME	Varchar2(30) NN	Name of the attribute
ATTRIBUTE_TYPE	Varchar2(11)	Logical type of attribute NUMERICAL – numeric data CATEGORICAL – character data
DATA_TYPE	Varchar2(12)	Data type of attribute
DATA_LENGTH	Number	Length of data type
DATA_PRECISION	Number	Precision of a fixed point number
DATA_SCALE	Number	Scale of the fixed point number
USAGE_TYPE	Varchar2(8)	Indicated if the attribute was used to create the model (ACTIVE) or not (INACTIVE)
TARGET	Varchar2(3)	Indicates if the attribute is the target

If we take one of our data mining models that was listed about and select what attributes are used by that model;

SELECT attribute_name,
       attribute_type,
       usage_type,
       target
from all_mining_model_attributes
where model_name = 'CLAS_DT_1_6';

ATTRIBUTE_NAME                 ATTRIBUTE_T USAGE_TY TAR
------------------------------ ----------- -------- ---
AGE                            NUMERICAL   ACTIVE   NO
CUST_MARITAL_STATUS            CATEGORICAL ACTIVE   NO
EDUCATION                      CATEGORICAL ACTIVE   NO
HOUSEHOLD_SIZE                 CATEGORICAL ACTIVE   NO
OCCUPATION                     CATEGORICAL ACTIVE   NO
YRS_RESIDENCE                  NUMERICAL   ACTIVE   NO
Y_BOX_GAMES                    NUMERICAL   ACTIVE   NO
AFFINITY_CARD                  CATEGORICAL ACTIVE   YES

The first thing to note here is that all the attributes are listed as ACTIVE. This is the default and will be the case for all attributes for all the algorithms, so we can ignore this attribute in our queries, but it is good to check just in case.

The second thing to note is for the last row we have the AFFINITY_CARD has a target attribute value of YES. This is the target attributes used by the classification algorithm.

ALL_MINING_MODEL_SETTINGS

Describes the setting of the data mining models. The settings associated with a model are algorithm dependent. The Setting values can be provided as input to the model build process. Alternatively, separate settings table can used. If no setting values are defined of provided, then the algorithm will use its default settings.

Attribute	Data Type	Description
OWNER	Varchar2(30) NN	Owner of the mining model
MODEL_NAME	Varchar2(30) NN	Name of the mining model
SETTING_NAME	Varchar2(30) NN	Name of the Setting
SETTING_VALUE	Varchar2(4000)	Value of the Setting
SETTING_TYPE	Varchar2(7)	Indicates whether the default value (DEFAULT) or a user specified value (INPUT) is used by the model

Lets take our previous example of the 'CLAS_DT_1_6' model and query the database to see what the setting are.

column setting_value format a30
select setting_name,
setting_value,
setting_type
from all_mining_model_settings
where model_name = 'CLAS_DT_1_6';

SETTING_NAME            SETTING_VALUE                SETTING
----------------------- ---------------------------- -------
ALGO_NAME               ALGO_DECISION_TREE           INPUT
PREP_AUTO               ON                           INPUT
TREE_TERM_MINPCT_NODE   .05                          INPUT
TREE_TERM_MINREC_SPLIT 20                           INPUT
TREE_IMPURITY_METRIC    TREE_IMPURITY_GINI           INPUT
CLAS_COST_TABLE_NAME    ODMR$15_42_50_762000JERWZYK INPUT
TREE_TERM_MINPCT_SPLIT .1                           INPUT
TREE_TERM_MAX_DEPTH     7                            INPUT
TREE_TERM_MINREC_NODE   10                           INPUT

Monday, October 31, 2011

ODM 11.2–Data Mining PL/SQL Packages

The Oracle 11.2 database contains 3 PL/SQL packages that allow you to perform all (well almost all) of your data mining functions.

So instead of using the Oracle Data Miner tool you can write some PL/SQL code that will you to do the same things.

Before you can start using these PL/SQL packages you need to ensure that the schema that you are going to use has been setup with the following:

Create a schema or use and existing one
Grant the schema all the data mining privileges: see my earlier posting on how to setup an Oracle schema for data mining – Click here and YouTube video
Grant all necessary privileges to the data that you will be using for data mining

The first PL/SQL package that you will use is the DBMS_DATA_MINING_TRANSFORM. This PL/SQL package allows you to transform the data to make it suitable for data mining. There are a number of functions in this package that allows you to transform the data, but depending on the data you may need to write your own code to perform the transformations. When you apply your data model to the test or the apply data sets, ODM will automatically take the transformation functions defined using this package and apply them to the new data sets.

The second PL/SQL package is DBMS_DATA_MINING. This is the main data mining PL/SQL package. It contains functions to allow you to:

To create a Model
Describe the Model
Exploring and importing of Models
Computing costs and text metrics for classification Models
Applying the Model to new data
Administration of Models, like dropping, renaming, etc

The next (and last) PL/SQL package is DBMS_PREDICTIVE_ANALYTICS.The routines included in this package allows you to prepare data, build a model, score a model and return results of model scoring. The routines include EXPLAIN which ranks attributes in order of influence in explaining a target column. PREDICT which predicts the value of a target attribute based on values in the input. PROFILE which generates rules that describe the cases from the input data.

Over the coming weeks I will have separate blog posts on each of these PL/SQL packages. These will cover the functions that are part of each packages and will include some examples of using the package and functions.

Saturday, October 29, 2011

ODM PL/SQL API 11.2 New Features

The PL/SQL API interface for Oracle Data Miner has had a number of new features. These are listed below along with the new API features added with the 11.1 release.

Support for Native Transactional Data with Association Rules: you can build association rule models without first transforming the transactional data.
SVM class weights specified with CLAS_WEIGHTS_TABLE_NAME: including the GLM class weights
FORCE argument to DROP_MODEL: you can now force a drop model operation even if a serious system error has interrupted the model build process
GET_MODEL_DETAILS_SVM has a new REVERSE_COEF parameter: you can obtain the transformed attribute coefficients used internally by an SVM model by setting the new REVERSE_COEF parameter to 1

11.1g API New Features

Mining Model schema objects: previous releases, DM models were implemented as a collection of tables and metadata within the DMSYS schema. in 11.1 models are implemented as data dictionary objects in the SYS schema. A new set of DD views present DM models and their properties
Automatic and Embedded Data Preparation: previously data preparation was the responsibility of the user. Now it can be automated
Scoping of Nested Data: supports nested data types for both categorical and numerical data. Most algorithms require multi-record case data to the presented as columns of nested rows, each containing an attribute name/value pair. ODM processes each nested row as a separate attribute.
Standardised Handling of Sparse Data & Missing Values: standardised across all algorithms.
Generalised Linear Models: has a new algorithm and supports classification (logistic regression) and regression (linear regression)
New SQL Data Mining Function: PREDICTION_BOUNDS has been introduced for Generalised Linear Models. This returns the confidence bounds on predicted values (regression models) or predicted probabilities (classification)
Enhanced Support for Cost-Sensitive Decision Making: can be added or removed using DATA_MINING.ADD_COST_MATRIX and DBMS_DATA_MINING_REMOVE_COST_MATRIX.

Friday, October 21, 2011

Interesting quotes from Predictive Analytics World

The Predictive Analytics World conference is finishing up today in New York. Over the past few days the conference has had some of the leading analytic type people presenting at it.

Twitter, as usual, has been busy and there has been some very interesting and important quotes.

The list of tweets (#pawcon) below are the ones I found most interesting:

Manu Sharma from LinkedIn: "Guru" job title is down, "Ninja" is up.

Despite the "data science" buzz, the biggest skill among #pawcon attendees is " #DataMining

Andrea Medinaceli: Visualization is very powerful for making analytics results accessible to upper management (and for buy-in)

Social Network Analytics (SNA) with Zynga, 20M daily active users, 90M monthly active users; 10K nodes, 45K edges (big!)

Vertica: Zynga is an analytics company in the disguise of a gaming company; graph analytics find users/influencers

Colin Shearer: Find me something interesting in my data is a question from hell (analysis should be guided by business goals)

John Elder advocates ensemble methods - usually improve analytics results

Tom Davenport: to get real value, #analytics need to move from one-time craft to industrialized activity

10 years from now all Fortune 500 companies will have a Chief Analytics Officer at the level of COO or CFO

Must be a sign of the economy, so much of the focus on the value of predictive is on retaining customers. #PAWCON.

Tom Davenport: #Analytics is not about math, it is about relationships (with your business client) - says Intel Chief Mathematician

Karl Rexer: companies with higher analytic capabilities are doing better than their peers

Wednesday, October 19, 2011

ODM API Demos in PL/SQL (& Java)

If you have been using Oracle Data Miner to develop your data mining workflows and models, at some point you will want to move away from the tool and start using the ODM APIs.

Oracle Data Mining provides a PL/SQL API and a Java API for creating supervised and unsupervised data mining models. The two APIs are fully interoperable, so that a model can be created with one API and then modified or applied using the other API.

I will cover the Java APIs in a later post, so watch out for that.

To help you get started with using the APIs there are a number of demo PL/SQL programs available. These were available as part of the the pre-11.2g version of the tool. But they don’t seem to packaged up with the 11.2 (SQL Developer 3) application.

The following table gives a list of the PL/SQL demo programs that are available. Although these were part of the pre-11.2g tool, they still seem to work on your 11.2g database.

You can download a zip of these files from here.

The sample PL/SQL programs illustrate each of the algorithms supported by Oracle Data Mining. They include examples of data transformations appropriate for each algorithm.

I will be exploring the main APIs, how to set them up, the parameters, etc., over the next few weeks, so check back for these posts.

Tuesday, October 18, 2011

Book Donation by Oracle

Today I received two boxes, containing 48 books of

The Performance Management Revolution by Howard Dresner

These books have been kindly donated by Duncan Fitter, UK Business Development Director at Oracle.

I will be distributing these books to my MSc Data Mining students over the next week.

Thanks Duncan and Oracle

Wednesday, October 12, 2011

SQL Developer 3.1 EA & Bug

The new/updated SQL Developer 3.1 Early Adopter has just been released.

For the Data Miner, there are no major changes and it appears that there has been some bug fixes and some minor enhancements to so parts.

The main ODM features, apart from bug fixes, in this release include:

Globalization support, including translated error messages and GUI for all languages supported by SQL Developer
Improved accessibility features including the addition of a Structure navigator that lists all the nodes and links displayed in a workflow

Bug / Feature

After unzipping the download I opened SQL Developer. With each new release you will have to upgrade the existing ODM repository. The easiest way of doing this is to open the ODM connections pane and double click on one of your ODM schemas. SQL Developer will then run the necessary scripts to upgrade the repository.

I discovered a bug/feature with SQL Developer 3.1 EA1 upgrade script. The repository upgrade does not complete and an error is report.

I logged this error on the ODM forum on OTN. Mark Kelly who is the Development Manager for ODM and monitors the ODM forum, and his team, were quickly onto investigating the error. Mark has posted an update on the ODM form and give a script that needs to be run before you upgrade your existing repository.

You can download the pre-upgrade script from here.

If you don’t have an existing repository then you don’t have to run the script.

Check out the message on the ODM forum.

https://forums.oracle.com/forums/ann.jspa?annID=1678

https://forums.oracle.com/forums/thread.jspa?threadID=2296374&tstart=0

How to Upgrade SQL Developer & ODM

You will have to download the new SQL Developer 3.1 EA install files.

http://www.oracle.com/technetwork/developer-tools/sql-developer/sqldev-ea-download-486950.html

Unzip this into your SQL Developer directory
Create a shortcut for sqldeveloper.exe on your desktop and relabel it SQL Developer 3.1 EA
Double-click this short cut

You should be presented with the above window. Select the Yes button to migrate you previous install settings
SQL Developer should now open and contains all your previous connections

If you have an existing ODM repository, you need to run the pre-upgrade script (see above) at this point

You will now have to upgrade the ODM repository in the database. The simplest way of doing this is to allow SQL Developer to run the necessary scripts.
From the View Menu, select Oracle Data Miner –> Connections
In the ODM Connections pane double click one of your ODM schemas. Enter the username and password and click OK

You will then be prompted to migrate/update the ODM repository to the new version. Click Yes.
Enter the SYS username and Password

Click Start button, to start the migrate/upgrade scripts
On my laptop this migrate/upgrade step took less than 1 minute
The upgrade is now finished and you can start using ODM.

ODM – SQL Developer 3.1 EA – Release Notes

The ODM release notes can be found at

http://www.oracle.com/technetwork/database/options/odm/dataminer-31-relnotes-489144.html

Saturday, August 6, 2011

New Frontiers for Oracle Data Miner

Oracle Data Miner functionality is now well established and proven over the years. In particular with the release of the ODM 11gR2 version of the tool. But how will Oracle Data Miner develop into the future.

There are 4 main paths or Frontiers for future developments for Oracle Data Miner:

Oracle Data Miner Tool

The new ODM 11gR2 tool is a major development over the previous version of the tool. With the introduction of workflows and some added functionality for some of the features. the tool is now comparable with the likes of SAS Enterprise Miner and SPSS.

But the new tool is not complete and still needs a bit of fine tuning of most of the features. In particular with the usability and interactions. Some of the colour schemes needs to be looked at or to allow users to select their own colours.

Apart from the usability improvements for the tool another major development that is needed, is the ability to translate the workflow and the underlying database objects into usable code. This code can then be incorporated into our applications and other tools. The tool does allow you to produce shell code of the nodes, but there is still a lot of effort needed to make this usable. Under the previous version of the tool there was features available in JDeveloper and SQL Developer to produced packaged code that was easy to include in our applications.

“A lot done – More to do”

Oracle Applications

Over the past couple of months there has been a few postings on how Oracle Data Miner (11gR2) has been, or will be, incorporated in various Oracle Applications. For example Oracle Fusion Human Capital Management and Oracle Real Time Decision (RTD). Watch out of other applications that will be including Oracle Data Miner.

“A bit done – Lots more to do”

Oracle Business Intelligence

One of the most common places where ODM can be used is with OBIEE. OBIEE is the core engine for the delivery of the BI needs for an organisation. OBIEE coordinates the gathering of data from various sources, the defining of the business measures and then the delivery of this information in various forms to the users. Oracle Data Miner can be included in this process and can add significant value to the BI needs and report.

“A lot done – Need to publicise more”

Customized Projects

Most data mining projects are independent of various Applications and BI requirements. They are projects that are hoping to achieve a competitive insight into their organisational data. Over time as the success of some pilot projects become know they need for more data mining projects will increase. This will lead to organisations have a core data mining team to support these project. With this, the team will need tools to support them in the delivery of their project and with the delivery. This is were OBIEE and Oracle Fusion Apps will come increasingly important.

“A lot done – more to do”

Wednesday, July 20, 2011

Data Exploration using Oracle Data Miner 11gR2

Before beginning any data mining task we need to performs some data investigation. This will allow us to explore the data and to gain a better understanding of the data values. We can discover a lot by doing this can it can help us to identify areas for improvement in the source applications, as well as identifying data that does not contribute to our business problem (this is called feature reduction), and it can allow us to identify data that needs reformatting into a number of additional features (feature creation). A simple example of this is a date of birth field provides no real value, but by creating a number of additional attributes (features) we can now use the date of birth field to determine what age group they fit into.

As with most of the interface in Oracle Data Miner 11gR2, there is a new Data Exploration interface. In this blog post I will talk you through how to set-up and use the new Data Exploration interface and show you how you can use the data exploration features to gain an understanding of the data before you begin using the data mining algorithms.

The examples given here are based on my previous blog posts and we will use the same sample data sets, that were set-up as part of the install and configuration.

See my other blog post and videos on installing and setting up Oracle Data Miner.

Data Set-up

Before we can begin the data exploration we need to identify data we are going to use. To do this we need to select the Data tab from the Component Palette, and then select Data Source.

To create the Data Node on our Workflow we need to click and drag the Data Source onto the workflow. Select the MINING_DATA_BUILD_V and select all the data.

The next step is to create the Explore Data node on our workflow. From the Data tab in the Component Palette, select and drag the Explore Data node onto the workflow. Now we need to link the Data node to the Explore Data node.

Right-click on the Explore Data mode and click Run. This will make the ODM tool go to the database and analyse the data that is specified in our Data node. The analyse results will be used in the Explore Data note.

Exploring the Data

When the Explore Data node has finished we can look at the data it has generated. Right-click the Explore Data node and select View Data.

A lot of statistical information has been generated for each of the attributes in our Data node. In addition to the statistical information we also get a histogram of the attribute distributions.

We can work through each attribute taking the statistical data and the histograms to build up a picture of the data.

The data we are using is for an Electronics Goods store.

A few interesting things in the data are:

90% of the data comes from the United States of America
PRINTER_SUPPLIES attribute only has one value. We can eliminate this from our data set as it will not contribute to the data mining algorithms
Similarly for OS_DOC_SET_KENJI, which also has one one value

The histograms are based on predetermined number of bins. This is initially set to 10, but you may need to changed this value up or down to see if a pattern exists in the data.

An example of this is if we select AGE and set the number of bins to 10. We get a nice histogram showing that most of our customers are in the 31 to 46 age ranges. So maybe we should be concentrating on these.

Now if we change the number of bins to 30 can get a completely different picture of what is going on in the data.

To change the number of bin we need to go to the Workflow pane and select the Property Inspector. Scroll down to the Histogram section and change the Numerical Bins to 25. You then need to rerun the Explore Data node.

Now we can see that there are a number of important age groups what stand out more than others. If we look at the 31 to 46 age range, in the first histogram we can see that there is not much change between each of the age bins. But when we look at the second histogram for the 25 bins for the same 21 to 34 age range we get a very different view of the data. In this second histogram we see that that the ages of the customers vary a lot. What does mean. Well it can mean lots of different things and it all depends on the business scenario. In our example we are looking at an electronic goods store. What we can deduce from this second histogram is that there are a small number of customers up to about age 23. Then there is an increase. Is this due to people having obtained their main job after school having some disposable income. This peak is followed by a drop off in customers followed by another peak, drop off, peak, drop off etc. Maybe we can build a profile of our customer based on their age just like what our financial organisations due to determine what products to sell to use based on our age and life stage.

Conclusions on the data

From this histogram we can maybe categorise the customers into the follow

• Early 20s – out of education, fist job, disposable income
• Late 20s to early 30s – settling down, own home
• Late 30s – maybe kids, so have less disposable income
• 40s – maybe people are trading up and need new equipment. Or maybe the kids have now turned into teenagers and are encouraging their parents to buy up todate equipment.
• Late 50s – These could be empty nesters where their children have left home, maybe setting up home by themselves and their parents are building things for their home. Or maybe the parents are treating themselves with new equipment as they have more disposable income
• 60s + – parents and grand-parents buying equipment for their children and grand-children. Or maybe we have very techie people who have just retired
• 70+ – we have a drop off here.

As you can see we can discover a lot in the day by changing the number of bins and examining the data. The important part of this examination is trying to relate what you are seeing from the graphical representation of the data on the screen, back to the type of business we are examining. A lot can be discovered but you will have to spend some time looking for it.

ODM 11gR2 Extra Data Exploration Functionality

In ODM 11gR2 we now have an extra feature for our data analysis feature. We can now produce the histograms that are grouped by one of the other attributes. Typically this would be the Target or Class attribute but you can also use it with the other attributes.

To set this extra feature, double click on the Explore Data node. The Group By drop down lets you to select the attribute you want to group the other attributes by.

Using our example data, the target variable is AFFINITY_CARD. Select this in the drop down and run the Explore Data node again. When you look at the newly generated histograms you will now see each bin has two colours. If you hover the mouse of each coloured part you will be able to get the number of records in each group. You can use other attributes, such as the CUST_GENDER, COUNTRY_NAME, etc. Only use the attributes where it would make sense to analyse the data by.

This is a powerful new feature that allows you to gain a deeper level of insight into the data you are analysing

Brendan Tierney

Pages