Brendan Tierney - Oralytics Blog: January 2015

Friday, January 30, 2015

Evaluating Classification Models in ODM (Part 2)

In a previous blog post I talked about and showed some of the typical statistical methods to evaluate the classification models that you develop. Click to see this (first) blog post.

In this blog post I want to show you how you can go about evaluating your classification models that you develop using Oracle Data Miner (part of SQL Developer).

What I'm not going to show you here is how to develop classification models using Oracle Data Mining :-( I've had several blog posts over the years on this topics. So you can go and search of those posts or alternately this topic is cover in a lot more detail in my Oracle Data Miner book :-)

After you have developed your ODM models in Oracle Data Miner you have 2 levels of details available to you. The first of these is the Compare Test Results. You can find this by right clicking on the Classification node of your ODM Workflow, as showing below.

Viewing the Test Results of all ODM Models

When you select the Compare Test Results a new (worksheet) tab will open. This will display summary statistics and graphics for the summary statistics for each Oracle Data Ming model created. In the following image an ODM model was created for each In-Database Classification algorithm in the Oracle Database.

Blog odm test results 2

Here we get to see 2 of the statistical measures that I talked about in my previous blog post, the (average) Accuracy and the Overall Accuracy. We can look at and examine this in a bit more detail in a minute. A new measure that I haven't mentioned before is the Predictive Confidence.

The Predictive Confidence measure provides an estimate of the overall goodness of the model. Predictive Confidence is a number between 0 and 1. Data Miner displays Predictive Confidence as a percent.

If Predictive Confidence=0, then it indicates that the predictions of the model are no better than the predictions made by using the naive model.
If Predictive Confidence=1, then it indicates that the predictions are perfect.
If Predictive Confidence=0.5, then it indicates that the model has cut the error of a naive model by 50%./li>

So the higher the value for Predictive Confidence the better the model. Particularly when it is higher than 50%.

After evaluation these summary statistical measures you will want to drill down on these to see the lower level statistical measures, for example you will want to see the confusion matrix and the corresponding statistical measures. To view the confusion matrix all you need to do is to click on the Performance Matrix tab. Before you can really start evaluating the models you will need to click on the Display drop down and select 'Show Detail' from the drop down list. Another thing you will need to do is to click/check the 'Show totals and codes' check box on the lower part of the screen. This will give you some of the statistical measures that I outlined in my previous blog post.

Blog odm test results 3

When you examine the statistical measures displayed on the screen you will notice that some of the statistical measures I outlined in my previous blog post are missing. Some of these missing measures are ones that you will want to consider and use as part of your evaluation of you ODM models.

So what how do you find out what these missing statistical measures are? Well ODM does not display these so the only real option open to you is to go and calculate them yourself :-( This is not ideal but these are relatively easy to calculate and you can do this on a piece of paper or you can open your spreadsheet software and let it calculate them for you (once you have defined to formula for each). Here is an example of the completed/extended confusion matrix based on the results from the CLAS_SVM_1_59 model shown in the above image.

In my next blog post I will look at how you can evaluate a classification model that was developed using the in-database Oracle Data Mining algorithms (Oracle Data Miner GUI was not used). The evaluation criteria that I will show will be based on the statistical methods that I highlighted in my first blog post on this topic.

Tuesday, January 20, 2015

Evaluating Classification Results

When you are working on building classification models you will need some ways of measuring the effectiveness of each model that you will build. This measurement/evaluation is perform during the model build process.

Typically the model build process consists of 2 steps (I'm assuming all data preparation etc has been completed:

Build the model: During this step you will feed in a portion of your data set to the data mining algorithm. Typical this data will be a subset of your data set and will typically consist of 60% to 70% of the data. This data is used to by the data mining algorithm to build the model.
Test the model: After the model has been built you will need to test the model to see how efficient it is at making the predictions. This is where we use the data that was not used to build the model. For this data we already know the outcome. So after we have applied the model to this data subset we can measure the predicted values against the actual values.

Most of the data mining tools will automate these two steps, specifically the splitting the data into the build and test data sets. But if you are using a language like R, etc then you will need to manually perform these steps.

The most common way of collating the test results is to use the Confusion Matrix. This allows us to layout the correct predictions, the incorrect predictions and to perform a number of other statistical measurements.

True Positives

True Negatives

False Positives

False Negatives

The last two of the above values are also commonly referred to in statistics as Type 1 (false positive) and Type 2 (false negative) errors.

Depending on your project you will concentrate on a combination of the true and false values of either the Positives or the negatives.

For example, in Medical Diagnostics for cancer, you will be looking to keep the False Negatives to a minimum. This is where you have predicted someone does not have cancer, but actually does. The consequence of this is that the person is not brought back for addition testing and we all know what will happen. On the other hand it is OK to have a hight False Positive in this case. In this scenario you bring the person back for additional tests and discover that they are all clear :-)

Precision = How many of the selected items are relevant? (as a percentage)

Recall = How many of the relevant items are selected? (as a percentage)

Accuracy = How many did we correctly predict? (as a percentage)

The following table illustrates these measurements and tests.

There are lots of other statistical tests that can be performed on your results. Everyone will have their own preferences. What I have highlighted here are the main statistical test for you to look at.

You cannot use one or a few of the statistical tests to make a decision on what data mining model works best for your data. It is a combination of these statistical test, your understanding of the data and you understanding of the business project that need to be considered.

In my next 2 blog posts I will show you how you can perform these tests on the results generated by the Oracle Data Miner tool and then on the Oracle Data Miner models produced using PL/SQL.

Friday, January 16, 2015

Pulling Large Database tables in R

As the volume of the data in your tables grows, particularly in the big data world, you may run into some memory issues or package restrictions with pulling down the tables to your R environment.

Some of the R packages and drivers have some recommended numbers or limits for the number of records that can be fetched.

Caveate: My laptop is a Mac and at this point in time the ROracle package is unavailable for a Mac. It is for Windows, Solaris and AIX.

In the following example I'm looking at downloading a table with 300K records from an Oracle Database. I've already setup my DB connection using the Oracle JDBC driver. But when I run the following command I get an error.

> res<-dbSendQuery(jdbcConnection, "select * from my_large_table")

> dbFetch(res)

Error in .jcall(rp, "I", "fetch", stride) :

java.lang.OutOfMemoryError: Java heap space

I also get a similar error if I run the following command.

> train_data <- dbReadTable(jdbcConnection, "MY_LARGE_TABLE")

How can you pull down a large table in R? So that you are not restricted to memory restrictions or limits on the number of records.

One way to do this is to loop through the data, pull the records down in chunks (a certain fetch size), put these into an array, and then merge them all together into a data frame. The following code illustrates how to do this.

> res<-dbSendQuery(jdbcConnection, "select * from my_large_table")

> dbFetch(res)

> rm(result)

> result<-list()

> i=1

> result[[i]]<-dbFetch(res,n=1000)

> while(nrow(chunk <- dbFetch(res, n = 1000))>0){

+ i<-i+1

+ result[[i]]<-chunk

+ }

> train_data<-do.call(rbind,result)

The above code runs surprisingly quickly, generate no errors and I now have all the data I need in my R environment.

The fetch size in the above example is set to 1000. This is a bit small really and is only set to that for illustration purposes here. You will need to play with this size to find out what size works best for your environment.

As with all programming languages and with R too there can be many different ways of performing the same thing.

Saturday, January 10, 2015

Creating a Word Cloud of Oracle's OAA webpages in R

The following is not something new but something that I have put together this evening, and I mainly make the following available as a note to myself and what I did. If you find it useful or interesting then you are more than welcome to use and share. You will also find lots of similar solutions on the web.

This evening I was playing around the the Text Mining (tm) package in R. So I decided to create a Word Cloud of the Advanced Analytics webpages on Oracle.com. These webpages contain the Overview webpage for the Advanced Analytics webpage, the Oracle Data Mining webpages and the Oracle R Enterprise webpages.

I've broken the R code into a number of sections.

1. Setup

The first thing that you need to do is to install four R packages these are "tm", "wordcloud" , "Curl" and "XML". The first two of these packages are needed for the main part of the Text processing and generating the word cloud. The last two of these packages are needed by the function "htmlToText". You can download the htmlToText function on github.

install.packages (c ( "tm", "wordcloud", "RCurl", "XML", "SnowballC")) # install 'tm'' package

library (tm)

library (wordcloud)

library (SnowballC)

# load htmlToText

source("/Users/brendan.tierney/htmltotext.R")

2. Read in the Oracle Advanced Analytics webpages using the htmlToText function

data1 <- htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/overview/index.html")

data2 <- htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html")

data3 <- htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-technologies/overview/index.html")

data4 <- htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html")

You will need to combine each of these webpages into one for processing in later steps.

data <- c(data1, data2)

data <- c(data, data3)

data <- c(data, data4)

3. Convert into a Corpus and perfom Data Cleaning & Transformations

To convert our web documents into a Corpus.

txt_corpus <- Corpus (VectorSource (data)) # create a corpus

We can use the summary function to get some of the details of the Corpus. We can see that we have 4 documents in the corpus.

> summary(txt_corpus)

A corpus with 4 text documents

The metadata consists of 2 tag-value pairs and a data frame

Available tags are:

create_date creator

Available variables in the data frame are:

MetaID

Remove the White Space in these documents

tm_map <- tm_map (txt_corpus, stripWhitespace) # remove white space

Remove the Punctuations from the documents

tm_map <- tm_map (tm_map, removePunctuation) # remove punctuations

Remove number from the documents

tm_map <- tm_map (tm_map, removeNumbers) # to remove numbers

Remove the typical list of Stop Words

tm_map <- tm_map (tm_map, removeWords, stopwords("english")) # to remove stop words(like ‘as’ ‘the’ etc….)

Apply stemming to the documents

If needed you can also apply stemming on your data. I decided to not perform this as it seemed to trunc some of the words in the word cloud.

# tm_map <- tm_map (tm_map, stemDocument)

If you do want to perform stemming then just remove the # symbol.

Remove any addition words (would could add other words to this list)

tm_map <- tm_map (tm_map, removeWords, c("work", "use", "java", "new", "support"))

If you want to have a look at the output of each of the above commands you can use the inspect function.

inspect(tm_map)

4. Convert into a Text Document Matrix and Sort

Matrix <- TermDocumentMatrix(tm_map) # terms in rows

matrix_c <- as.matrix (Matrix)

freq <- sort (rowSums (matrix_c)) # frequency data

freq #to view the words and their frequencies

5. Generate the Word Cloud

tmdata <- data.frame (words=names(freq), freq)

wordcloud (tmdata$words, tmdata$freq, max.words=100, min.freq=3, scale=c(7,.5), random.order=FALSE, colors=brewer.pal(8, "Dark2"))

and the World Clould will look something like the following. Everything you generate the Word Cloud you will get a slightly different layout of the words.

Pages