Wednesday, February 13, 2013

Clustering in ODM–Part 1

This is a the first part of a five (5) part blog post on building and using Clustering in Oracle Data Miner. The following outlines the contents of each post in this series on Clustering.

  1. This post part we will look at what clustering features exist in ODM and how to setup the data that we will be using in the examples
  2. The second part will focus on how to building Clusters and examining the clusters produced in ODM .
  3. The third post will focus on using the Clusters to apply to new data using ODM.
  4. The fourth post will look at how you can build and evaluate a Clustering model using the ODM SQL and PL/SQL functions.
  5. The fifth and final post will look at how you can apply your Clustering model to new data using the ODM SQL and PL/SQL functions.

Clustering is an unsupervised technique designed groupings of related data that are more similar to each other and are less similar to other groups.  Typically clustering is used in customer segmentation analysis to try an better understand what type of customers you have.

Like with all data mining techniques, Clustering will not tell you or give you some magic insight into your data. Instead it gives you more information for you to interpret and add the business meaning to them. With Clustering you can explore the data that forms each cluster to understand what it really means.

The Clusters give by Oracle Data Miner are just patterns that it has found in the data.

image

Oracle has two Clustering algorithms:

K-Means : Oracle Data Miner runs an enhanced version of the typical k-means algorithm. ODM builds models in a hierarchical manner, using a top-down approach with binary splits and refinements of all nodes at the end. The centroid of the inner nodes in the hierarchy are updated to reflect changes as the tree grows. The tree grows one node at a time. The node wit the largest variance is split to increase the size of the tree until the desired number of clusters is reached.

O-Cluster : O-Cluster is an Orthogonal Partitioning Clustering that creates a hierarchical grid based clustering model. It operates recursively, generating a hierarchical structure. The resulting clusters define dense areas.

The Data Set for out Clustering examples

I’m going to use a data set that is available on OTN (somewhere) and has been used for demos in the prior versions of ODM before the 11gR2 version (SQL Developer 3). It has gone by many names but the table name we care going to use is INSURANCE_CUST_LTV.

The file is in CSV format and we will use the Import feature in SQL Developer to import it.

1. In the schema you are using for Oracle Data Miner, right click Tables in the Connections tab. The Import option will appear on the menu. Select this.

image

2. Go to the directory where you saved the file, select it and then click on the Open button.

SNAGHTML60a28d3

3. You need to set the file Format to be ‘Delimited’ and the Delimiter set to ‘|’

SNAGHTML60c6d04

4. In the next step give the table name as INSURANCE_CUST_LTV

5.In the next step Select all the Attributes. It should default to this. Click next.

6. In Step 4 of the Wizard you can set the data types for each attribute. The simplest way is to set the character attributes to VARCHAR2 (50) :

CUSTOMER_ID,  LAST,  FIRST,  STATE,  REGION,  SEX,  PROFESSION,  BUY_INSURANCE (set this one to 3), MARITAL_STATUS, LTV_BIN

Set all the number attributes (all the others) to NUMBER without any precision or scale.

7. Click the next button and then the finish button.  SQL Developer will now load 15,342 records into the INSURANCE_CUST_LTV table, with no errors (hopefully!)

 

We are now ready to start our work with the Clustering algorithms in ODM.

In the next blog post we will look at exploring the data, building our Clustering models and examining the clusters that were produced by ODM.

2 comments:

  1. Hi Brendan,

    Firstly thanks for this blog. It is one of the useful blog that I follow.

    I have a problem I've searched and couldn't find any way to solve. I want to split my data to build and test. I'll use Naive Bayes algorithm on my build data, and I'll apply it on my test data.

    I easily split my data "TableName->Right Click->Transforms->Split" with ODM 11g R1. It creates build and test tables under Tables. But I want to do this with ODM 11g R2. Is there any way to do this?

    Could you please help?

    Kind Regards,
    Busra

    ReplyDelete
  2. Hi Busra

    So you are using the Classification Node to build a Naive Bayes model.

    When you create the Classification Node and connect it with the data source, you can set the split percentage using the Property Inspector. This should appear below the Work-flow editor pane/window. The default for the split will be 60% for build and 40% for testing. If you do not change this value then ODM will use the defaults. Or you an change the split value.

    To change the split value, go to the Property Inspector pane. Click on the Test tab and you can change the percentage there.

    If you want to use a different table for building your model and a different table of data for testing your model, check out the following blog post that explains how to do this.

    Regards
    Brendadn

    ReplyDelete