Thursday, June 30, 2016

googleVis R package for creating google charts in R

I've recently come across the 'googleVis' R package. This allows you to create a variety of different (typical and standard) charts in R but with the look and feel of the charts we can get from a number of different Google sites.

I won't bore you with some examples in the post but I'll point you to a good tutorial on the various charts.

Here is the link to the mini-tutorial.

Before you can use the package you will need to install it. The simplest way is to run the following in your R session.

> install.packages("googleVis")

Depending on your version of R you may need to upgrade.

Here is a selection of some of the charts you can create, and there are many, many more.

NewImage

Some of you might be familiar with the presenting that Hans Rosling gives. Some of the same technology is behind these bubble charts from Google, as they bought the software years ago. Hans typically uses a data set that consists of GDP, Population and Life Expectancy for countries around the World. You too can use this same data set and is available from rdatamarket. The following R codes will extract this data set to you local R session and you can then use it as input to the various charts in the googleVis functions.

install.packages("rdatamarket")
library(rdatamarket)
dminit(NULL)

# Pull in life expectancy and population data
life_expectancy <- dmlist("15r2!hrp")
population <- dmlist("1cfl!r3d")

# Pull in the yearly GDP for each country
gdp <- dmlist("15c9!hd1")

# Load in the plyr package
library("plyr")

# Rename the Value for each dataset
names(gdp)[3] <- "GDP"

# Use plyr to join your three data frames into one: development 
gdp_life_exp <- join(gdp, life_expectancy)
names(gdp_life_exp)[4] <- "LifeExpectancy"
development <- join(gdp_life_exp, population)
names(development)[5] <- "Population"

Here is an example of the bubble chart using this data set.

NewImage

There are a few restrictions with using this package. All the results will be displayed in a web browser, so you need to make sure that this is possible. Some of the charts are require flash. Again you need to make sure you are the latest version and/or you many have restrictions in your workplace on using it.

Thursday, June 23, 2016

Cluster Sets using SQL with Oracle Data Mining - Part 3

This is the third blog post on my series on examining the Clusters that were predicted by an Oracle Data Mining model. Check out the previous blog posts.

In the previous posts we were able to list the predicted cluster for each record in our data set. This is the cluster that the records belonged to the most. I also mentioned that a record could belong to many clusters.

So how can you list all the clusters that the a record belongs to?

You can use the CLUSTER_SET SQL function. This will list the Cluster Id and a probability measure for each cluster. This function returns a array consisting of the set of all clusters that the record belongs to.

The following example illustrates how to use the CLUSTER_SET function for a particular cluster model.

SELECT t.customer_id, s.cluster_id, s.probability
FROM   (select customer_id, cluster_set(clus_km_1_37 USING *) as Cluster_Set
        from   insur_cust_ltv_sample 
        WHERE  customer_id in ('CU13386', 'CU100')) T,
      TABLE(T.cluster_set) S
order by t.customer_id, s.probability desc; 

The output from this query will be an ordered data set based on the customer id and then the clusters listed in descending order of probability. The cluster with the highest probability is what would be returned by the CLUSTER_ID function. The output from the above query is shown below.

NewImage

If you would like to see the details of each of the clusters and to examine the differences between these clusters then you will need to use the CLUSTER_DETAILS function (see previous blog post).

You can specify topN and cutoff to limit the number of clusters returned by the function. By default, both topN and cutoff are null and all clusters are returned.

- topN is the N most probable clusters. If multiple clusters share the Nth probability, then the function chooses one of them.

- cutoff is a probability threshold. Only clusters with probability greater than or equal to cutoff are returned. To filter by cutoff only, specify NULL for topN.

You may want to use these individually or combined together if you have a large number of customers. To return up to the N most probable clusters that are greater than or equal to cutoff, specify both topN and cutoff.

The following example illustrates using the topN value to return the top 4 clusters.

SELECT t.customer_id, s.cluster_id, s.probability
FROM   (select customer_id, cluster_set(clus_km_1_37, 4, null USING *) as Cluster_Set
        from   insur_cust_ltv_sample 
        WHERE  customer_id in ('CU13386', 'CU100')) T,
      TABLE(T.cluster_set) S
order by t.customer_id, s.probability desc;

and the output from this query shows only 4 clusters displayed for each record.

NewImage

Alternatively you can select the clusters based on a cut off value for the probability. In the following example this is set to 0.05.

SELECT t.customer_id, s.cluster_id, s.probability
FROM   (select customer_id, cluster_set(clus_km_1_37, NULL, 0.05 USING *) as Cluster_Set
        from   insur_cust_ltv_sample 
        WHERE  customer_id in ('CU13386', 'CU100')) T,
      TABLE(T.cluster_set) S
order by t.customer_id, s.probability desc;

and the output this time looks a bit different.

NewImage

Finally, yes you can combine these two parameters to work together.

SELECT t.customer_id, s.cluster_id, s.probability FROM (select customer_id, cluster_set(clus_km_1_37, 2, 0.05 USING *) as Cluster_Set from insur_cust_ltv_sample WHERE customer_id in ('CU13386', 'CU100')) T, TABLE(T.cluster_set) S order by t.customer_id, s.probability desc;

Thursday, June 16, 2016

Cluster Details with Oracle Data Mining - Part 2

This is the second blog post of my series on examining the clusters that are predicted for by an Oracle Data Mining model for your data. In my previous blog post I should you how to use CLUSTER_ID and CLUSTER_PROBABILITY functions. These are the core of what you will be used when working with clusters and automating the process.

In this blog post I will look at what details are used by the clustering model to make the prediction. The function that you can use is called CLUSTER_DETAILS. I had an earlier blog post on using PREDICTION_DETAILS to see some of the details that are produced when performing classification.

CLUSTER_DETAILS returns the cluster details for each row in the selection. The return value is an XML string that describes the attributes of the highest probability cluster.

Here is an example of using the CLUSTER_DETAILS function in a SELECT statement.

select cluster_details(clus_km_1_37, 14 USING *) as Cluster_Details
from   insur_cust_ltv_sample 
where  customer_id = 'CU13386';

The output is an XML string and the easiest way to view this is in SQL Developer. It will list the top 5 highest weighted attributes for the cluster centroid.

NewImage NewImage

The returned attributes are ordered by weight. The weight of an attribute expresses its positive or negative impact on cluster assignment. A positive weight indicates an increased likelihood of assignment. A negative weight indicates a decreased likelihood of assignment. By default, CLUSTER_DETAILS returns the attributes with the highest positive weights in defending order.

Tuesday, June 7, 2016

Examining predicted Clusters and Cluster details using SQL

In a previous blog post I gave some details of how you can examine some of the details behind a prediction made using a classification model. This seemed to spark a lot of interest. But before I come back to looking at classification prediction details and other information, this blog post is the first in a 4 part blog post on examining the details of Clusters, as identified by a cluster model created using Oracle Data Mining.

The 4 blog posts will consist of:

  • 1 - (this blog post) will look at how to determine the predicted cluster and cluster probability for your record.
  • 2 - will show you how to examine the details behind and used to predict the cluster.
  • 3 - A record could belong to many clusters. In this blog post we will look at how you can determine what clusters a record can belong to.
  • 4 - Cluster distance is a measure of how far the record is from the cluster centroid. As a data point or record can belong to many clusters, it can be useful to know the distances as you can build logic to perform different actions based on the cluster distances and cluster probabilities.

Right. Let's have a look at the first set of these closer functions. These are CLUSTER_ID and CLUSTER_PROBABILITY.

CLUSER_ID : Returns the number of the cluster that the record most closely belongs to. This is measured by the cluster distance to the centroid of the cluster. A data point or record can belong or be part of many clusters. So the CLUSTER_ID is the cluster number that the data point or record most closely belongs too.

CLUSTER_PROBABILITY : Is a probability measure of the likelihood of the data point or record belongs to a cluster. The cluster with the highest probability score is the cluster that is returned by the CLUSTER_ID function.

Now let us have a quick look at the SQL for these two functions. This first query returns the cluster number that each record most strong belongs too.

SELECT customer_id, 
       cluster_id(clus_km_1_37 USING *) as Cluster_Id, 
FROM   insur_cust_ltv_sample
WHERE  customer_id in ('CU13386', 'CU6607', 'CU100');

NewImage

Now let us add in the cluster probability function.

SELECT customer_id, 
       cluster_id(clus_km_1_37 USING *) as Cluster_Id,
       cluster_probability(clus_km_1_37 USING *) as cluster_Prob       
FROM   insur_cust_ltv_sample
WHERE  customer_id in ('CU13386', 'CU6607', 'CU100');

NewImage

These functions gives us some insights into what the cluster predictive model is doing. In the remaining blog posts in this series I will look at how you can delve deeper into the predictions that the cluster algorithm is make.