Brendan Tierney - Oralytics Blog: Applying an ODM Model to new data in Oracle

This is the first of a two part blog posting on using an Oracle Data Mining model to apply it to or score new data. This first part looks at the how you can score data using the DBMS_DATA_MINING.APPLY procedure in a batch type process.

The second part will be posted in a couple of days and will look how you can apply or score the new data, using our ODM model, in a real-time mode, scoring a single record at a time.

DBMS_DATA_MINING.APPLY

Instead of applying the model to data as it is captured, you may need to apply a model to a large number of records at the same time. To perform this bulk processing we can use the APPLY procedure that is part of the DBMS_DATA_MINING package. The format of the procedure is

DBMS_DATA_MINING.APPLY (
      model_name           IN VARCHAR2,
      data_table_name      IN VARCHAR2,
      case_id_column_name IN VARCHAR2,
      result_table_name    IN VARCHAR2,
      data_schema_name     IN VARCHAR2 DEFAULT NULL);

Parameter Name	Description
Model_Name	The name of your data mining model
Data_Table_Name	The source data for the model. This can be a tree or view.
Case_Id_Column_Name	The attribute that give uniqueness for each record. This could be the Primary Key or if the PK contains more than one column then a new attribute is needed
Result_Table_Name	The name of the table where the results will be stored
Data_Schema_Name	The schema name for the source data

The main condition for applying the model is that the source table (DATA_TABLE_NAME) needs to have the same structure as the table that was used when creating the model.

Also the data needs to be prepossessed in the same way as the training data to ensure that the data in each attribute/feature has the same formatting.

When you use the APPLY procedure it does not update the original data/table, but creates a new table (RESULT_TABLE_NAME) with a structure that is dependent on what the underlying DM algorithm is. The following gives the Result Table description for the main DM algorithms:

For a Classification algorithms

case_id VARCHAR2/NUMBER
prediction NUMBER / VARCHAR2 -- depending a target data type
probability NUMBER

For Regression

case_id VARCHAR2/NUMBER
prediction NUMBER

For Clustering

case_id VARCHAR2/NUMBER
cluster_id NUMBER
probability NUMBER

Example / Case Study

My last few blog posts on ODM have covered most of the APIs for building and transferring models. We will be using the same data set in these posts. The following code uses the same data and models to illustrate how we can use the DBMS_DATA_MINING.APPLY procedure to perform a bulk scoring of data.

In my previous post we used the EXPORT and IMPORT procedures to move a model from one database (Test) to another database (Production). The following examples uses the model in Production to score new data. I have setup a sample of data (NEW_DATA_TO_SCORE) from the SH schema using the same set of attributes as was used to create the model (MINING_DATA_BUILD_V). This data set contains 1500 records.

SQL> desc NEW_DATA_TO_SCORE
Name                                 Null?    Type
------------------------------------ -------- ------------
CUST_ID                              NOT NULL NUMBER
CUST_GENDER                          NOT NULL CHAR(1)
AGE                                           NUMBER
CUST_MARITAL_STATUS                           VARCHAR2(20)
COUNTRY_NAME                         NOT NULL VARCHAR2(40)
CUST_INCOME_LEVEL                             VARCHAR2(30)
EDUCATION                                     VARCHAR2(21)
OCCUPATION                                    VARCHAR2(21)
HOUSEHOLD_SIZE                                VARCHAR2(21)
YRS_RESIDENCE                                 NUMBER
AFFINITY_CARD                                 NUMBER(10)
BULK_PACK_DISKETTES                           NUMBER(10)
FLAT_PANEL_MONITOR                            NUMBER(10)
HOME_THEATER_PACKAGE                          NUMBER(10)
BOOKKEEPING_APPLICATION                       NUMBER(10)
PRINTER_SUPPLIES                              NUMBER(10)
Y_BOX_GAMES                                   NUMBER(10)
OS_DOC_SET_KANJI                              NUMBER(10)

SQL> select count(*) from new_data_to_score;

COUNT(*)
----------
1500

The next step is to run the the DBMS_DATA_MINING.APPLY procedure. The parameters that we need to feed into this procedure are

Parameter Name	Description
Model_Name	CLAS_DECISION_TREE -- we imported this model from our test database
Data_Table_Name	NEW_DATA_TO_SCORE
Case_Id_Column_Name	CUST_ID -- this is the PK
Result_Table_Name	NEW_DATA_SCORED -- new table that will be created that contains the Prediction and Probability.

The NEW_DATA_SCORED table will contain 2 records for each record in the source data (NEW_DATA_TO_SCORE). For each record in NEW_DATA_TO_SCORE we will have one record for the each of the Target Values (O or 1) and the probability for each target value. So for our NEW_DATA_TO_SCORE, which contains 1,500 records, we will get 3,000 records in the NEW_DATA_SCORED table.

To apply the model to the new data we run:

BEGIN
dbms_data_mining.apply(
model_name => 'CLAS_DECISION_TREE',
data_table_name => 'NEW_DATA_TO_SCORE',
case_id_column_name => 'CUST_ID',
result_table_name => 'NEW_DATA_SCORED');
END;
/

This takes 1 second to run on my laptop, so this apply/scoring of new data is really quick.

The new table NEW_DATA_SCORED has the following description

SQL> desc NEW_DATA_SCORED
Name                            Null?    Type
------------------------------- -------- -------
CUST_ID                         NOT NULL NUMBER
PREDICTION                               NUMBER
PROBABILITY                              NUMBER

SQL> select count(*) from NEW_DATA_SCORED;

COUNT(*)
----------
3000

We can now look at the prediction and the probabilities

SQL> select * from NEW_DATA_SCORED where rownum <=12;

   CUST_ID PREDICTION PROBABILITY
---------- ---------- -----------
    103001          0           1
    103001          1           0
    103002          0 .956521739
    103002          1 .043478261
    103003          0 .673387097
    103003          1 .326612903
    103004          0 .673387097
    103004          1 .326612903
    103005          1 .767241379
    103005          0 .232758621
    103006          0           1
    103006          1           0

12 rows selected.

Brendan Tierney - Oralytics Blog

Pages

Monday, November 21, 2011

Applying an ODM Model to new data in Oracle – Part 1

No comments:

Post a Comment