Friday, January 30, 2015

Evaluating Classification Models in ODM (Part 2)

In a previous blog post I talked about and showed some of the typical statistical methods to evaluate the classification models that you develop. Click to see this (first) blog post.

In this blog post I want to show you how you can go about evaluating your classification models that you develop using Oracle Data Miner (part of SQL Developer).

What I'm not going to show you here is how to develop classification models using Oracle Data Mining :-( I've had several blog posts over the years on this topics. So you can go and search of those posts or alternately this topic is cover in a lot more detail in my Oracle Data Miner book :-)

After you have developed your ODM models in Oracle Data Miner you have 2 levels of details available to you. The first of these is the Compare Test Results. You can find this by right clicking on the Classification node of your ODM Workflow, as showing below.

Viewing the Test Results of all ODM Models

When you select the Compare Test Results a new (worksheet) tab will open. This will display summary statistics and graphics for the summary statistics for each Oracle Data Ming model created. In the following image an ODM model was created for each In-Database Classification algorithm in the Oracle Database.

Blog odm test results 2

Here we get to see 2 of the statistical measures that I talked about in my previous blog post, the (average) Accuracy and the Overall Accuracy. We can look at and examine this in a bit more detail in a minute. A new measure that I haven't mentioned before is the Predictive Confidence.

The Predictive Confidence measure provides an estimate of the overall goodness of the model. Predictive Confidence is a number between 0 and 1. Data Miner displays Predictive Confidence as a percent.

  • If Predictive Confidence=0, then it indicates that the predictions of the model are no better than the predictions made by using the naive model.
  • If Predictive Confidence=1, then it indicates that the predictions are perfect.
  • If Predictive Confidence=0.5, then it indicates that the model has cut the error of a naive model by 50%./li>

So the higher the value for Predictive Confidence the better the model. Particularly when it is higher than 50%.

After evaluation these summary statistical measures you will want to drill down on these to see the lower level statistical measures, for example you will want to see the confusion matrix and the corresponding statistical measures. To view the confusion matrix all you need to do is to click on the Performance Matrix tab. Before you can really start evaluating the models you will need to click on the Display drop down and select 'Show Detail' from the drop down list. Another thing you will need to do is to click/check the 'Show totals and codes' check box on the lower part of the screen. This will give you some of the statistical measures that I outlined in my previous blog post.

Blog odm test results 3

When you examine the statistical measures displayed on the screen you will notice that some of the statistical measures I outlined in my previous blog post are missing. Some of these missing measures are ones that you will want to consider and use as part of your evaluation of you ODM models.

So what how do you find out what these missing statistical measures are? Well ODM does not display these so the only real option open to you is to go and calculate them yourself :-( This is not ideal but these are relatively easy to calculate and you can do this on a piece of paper or you can open your spreadsheet software and let it calculate them for you (once you have defined to formula for each). Here is an example of the completed/extended confusion matrix based on the results from the CLAS_SVM_1_59 model shown in the above image.

Blog odm test results 4

In my next blog post I will look at how you can evaluate a classification model that was developed using the in-database Oracle Data Mining algorithms (Oracle Data Miner GUI was not used). The evaluation criteria that I will show will be based on the statistical methods that I highlighted in my first blog post on this topic.

Tuesday, January 20, 2015

Evaluating Classification Results

When you are working on building classification models you will need some ways of measuring the effectiveness of each model that you will build. This measurement/evaluation is perform during the model build process.

Typically the model build process consists of 2 steps (I'm assuming all data preparation etc has been completed:

  • Build the model: During this step you will feed in a portion of your data set to the data mining algorithm. Typical this data will be a subset of your data set and will typically consist of 60% to 70% of the data. This data is used to by the data mining algorithm to build the model.
  • Test the model: After the model has been built you will need to test the model to see how efficient it is at making the predictions. This is where we use the data that was not used to build the model. For this data we already know the outcome. So after we have applied the model to this data subset we can measure the predicted values against the actual values.

Most of the data mining tools will automate these two steps, specifically the splitting the data into the build and test data sets. But if you are using a language like R, etc then you will need to manually perform these steps.

The most common way of collating the test results is to use the Confusion Matrix. This allows us to layout the correct predictions, the incorrect predictions and to perform a number of other statistical measurements.

True Positives

True Negatives

False Positives

False Negatives

The last two of the above values are also commonly referred to in statistics as Type 1 (false positive) and Type 2 (false negative) errors.

Depending on your project you will concentrate on a combination of the true and false values of either the Positives or the negatives.

For example, in Medical Diagnostics for cancer, you will be looking to keep the False Negatives to a minimum. This is where you have predicted someone does not have cancer, but actually does. The consequence of this is that the person is not brought back for addition testing and we all know what will happen. On the other hand it is OK to have a hight False Positive in this case. In this scenario you bring the person back for additional tests and discover that they are all clear :-)

Precision = How many of the selected items are relevant? (as a percentage)

Recall = How many of the relevant items are selected? (as a percentage)

Accuracy = How many did we correctly predict? (as a percentage)

The following table illustrates these measurements and tests.

Confusion Matrix

There are lots of other statistical tests that can be performed on your results. Everyone will have their own preferences. What I have highlighted here are the main statistical test for you to look at.

You cannot use one or a few of the statistical tests to make a decision on what data mining model works best for your data. It is a combination of these statistical test, your understanding of the data and you understanding of the business project that need to be considered.

In my next 2 blog posts I will show you how you can perform these tests on the results generated by the Oracle Data Miner tool and then on the Oracle Data Miner models produced using PL/SQL.

Friday, January 16, 2015

Pulling Large Database tables in R

As the volume of the data in your tables grows, particularly in the big data world, you may run into some memory issues or package restrictions with pulling down the tables to your R environment.

Some of the R packages and drivers have some recommended numbers or limits for the number of records that can be fetched.

Caveate: My laptop is a Mac and at this point in time the ROracle package is unavailable for a Mac. It is for Windows, Solaris and AIX.

In the following example I'm looking at downloading a table with 300K records from an Oracle Database. I've already setup my DB connection using the Oracle JDBC driver. But when I run the following command I get an error.

> res<-dbSendQuery(jdbcConnection, "select * from my_large_table")

> dbFetch(res)

Error in .jcall(rp, "I", "fetch", stride) :

    java.lang.OutOfMemoryError: Java heap space

I also get a similar error if I run the following command.

> train_data <- dbReadTable(jdbcConnection, "MY_LARGE_TABLE")

How can you pull down a large table in R? So that you are not restricted to memory restrictions or limits on the number of records.

One way to do this is to loop through the data, pull the records down in chunks (a certain fetch size), put these into an array, and then merge them all together into a data frame. The following code illustrates how to do this.

> res<-dbSendQuery(jdbcConnection, "select * from my_large_table")

> dbFetch(res)

> rm(result)

> result<-list()

> i=1

> result[[i]]<-dbFetch(res,n=1000)

> while(nrow(chunk <- dbFetch(res, n = 1000))>0){

+     i<-i+1

+     result[[i]]<-chunk

+ }

> train_data<-do.call(rbind,result)

The above code runs surprisingly quickly, generate no errors and I now have all the data I need in my R environment.

The fetch size in the above example is set to 1000. This is a bit small really and is only set to that for illustration purposes here. You will need to play with this size to find out what size works best for your environment.

As with all programming languages and with R too there can be many different ways of performing the same thing.

Saturday, January 10, 2015

Creating a Word Cloud of Oracle's OAA webpages in R

The following is not something new but something that I have put together this evening, and I mainly make the following available as a note to myself and what I did. If you find it useful or interesting then you are more than welcome to use and share. You will also find lots of similar solutions on the web.

This evening I was playing around the the Text Mining (tm) package in R. So I decided to create a Word Cloud of the Advanced Analytics webpages on Oracle.com. These webpages contain the Overview webpage for the Advanced Analytics webpage, the Oracle Data Mining webpages and the Oracle R Enterprise webpages.

I've broken the R code into a number of sections.

1. Setup

The first thing that you need to do is to install four R packages these are "tm", "wordcloud" , "Curl" and "XML". The first two of these packages are needed for the main part of the Text processing and generating the word cloud. The last two of these packages are needed by the function "htmlToText". You can download the htmlToText function on github.

install.packages (c ( "tm", "wordcloud", "RCurl", "XML", "SnowballC")) # install 'tm'' package

library (tm)

library (wordcloud)

library (SnowballC)

# load htmlToText

source("/Users/brendan.tierney/htmltotext.R")

2. Read in the Oracle Advanced Analytics webpages using the htmlToText function

data1 <- htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/overview/index.html")

data2 <- htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html")

data3 <- htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-technologies/overview/index.html")

data4 <- htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html")

You will need to combine each of these webpages into one for processing in later steps.

data <- c(data1, data2)

data <- c(data, data3)

data <- c(data, data4)

3. Convert into a Corpus and perfom Data Cleaning & Transformations

To convert our web documents into a Corpus.

txt_corpus <- Corpus (VectorSource (data)) # create a corpus

We can use the summary function to get some of the details of the Corpus. We can see that we have 4 documents in the corpus.

> summary(txt_corpus)

A corpus with 4 text documents

The metadata consists of 2 tag-value pairs and a data frame

Available tags are:

    create_date creator

Available variables in the data frame are:

    MetaID

Remove the White Space in these documents

   tm_map <- tm_map (txt_corpus, stripWhitespace) # remove white space

Remove the Punctuations from the documents

   tm_map <- tm_map (tm_map, removePunctuation) # remove punctuations

Remove number from the documents

   tm_map <- tm_map (tm_map, removeNumbers) # to remove numbers

Remove the typical list of Stop Words

   tm_map <- tm_map (tm_map, removeWords, stopwords("english")) # to remove stop words(like ‘as’ ‘the’ etc….)

Apply stemming to the documents

If needed you can also apply stemming on your data. I decided to not perform this as it seemed to trunc some of the words in the word cloud.

  # tm_map <- tm_map (tm_map, stemDocument)

If you do want to perform stemming then just remove the # symbol.

Remove any addition words (would could add other words to this list)

   tm_map <- tm_map (tm_map, removeWords, c("work", "use", "java", "new", "support"))

If you want to have a look at the output of each of the above commands you can use the inspect function.

   inspect(tm_map)

4. Convert into a Text Document Matrix and Sort

   Matrix <- TermDocumentMatrix(tm_map) # terms in rows

   matrix_c <- as.matrix (Matrix)

   freq <- sort (rowSums (matrix_c)) # frequency data


   freq #to view the words and their frequencies

5. Generate the Word Cloud

   tmdata <- data.frame (words=names(freq), freq)

   wordcloud (tmdata$words, tmdata$freq, max.words=100, min.freq=3, scale=c(7,.5), random.order=FALSE, colors=brewer.pal(8, "Dark2"))

and the World Clould will look something like the following. Everything you generate the Word Cloud you will get a slightly different layout of the words.

OAA Word Cloud

Monday, December 22, 2014

2014 A review of the year as an ACED

As 2014 draws to a close I working on finishing off a number of tasks and projects. One of these tasks is an annual one for me. The task is to list all the things I've done as an Oracle ACE Director. If has been a very busy year, not just with ACE activities but also work wise too. That will explain why I have been a bit quiet on the blogging side of things in recent months.

In 2014 I one major highlight. It was the publication of my book Predictive Analytics using Oracle Data Miner by Oracle Press. Many thanks for everyone involved in writing this book, especially my family and the people in Oracle Press who gave me the opportunity.

Here is my summary.

Conferences

  • January : BIWA Summit : 2 presentations (San Francisco, USA) **

  • March : OUG Ireland Conference (Dublin, Ireland)

  • April : OUG Norway : 2 presentations (Oslo, Norway) **

  • June : OUG Finland : 2 presentations (Helsinki, Finland) **

  • June : Oracle EMEA Data Warehousing Global Leaders Forum (Dublin, Ireland)

  • August : OUG Panama : 2 presentations (Panama City, Panama) **

  • August : OUG Costa Rica : 3 presentations (San Jose, Costa Rica) **

  • August : OUG Mexico : 2 presentations (Mexico City, Mexico) **

  • September : Oracle Open World (San Francisco, USA) **

  • December : UKOUG Tech15 : 2 presentations (Liverpool, UK)

  • December : UKOUG Apps15 (Liverpool, UK)

That is 19 hours of presenting this year.

** Many thanks to the Oracle ACE Director programme for funding the flights and hotels for these conferences. All other expenses and conferences I paid for out of my own pocket.

My ODM Book

On the 8th August my book titled Predictive Analytics using Oracle Data Miner, was published by Oracle Press. It all began 12 months and 2 weeks previously. I had the book written and the technical edits done by the middle of February (2014). Between March and June the Copy edits and layouts where completed. The book is ideal for any data scientist, Oracle developer, Oracle architect and Oracle DBA, who want to use the in-databse data mining functionality. That way they can use and build upon their existing SQL and PL/SQL skills to perform predictive analytics.

The book is available on Amazon and comes in Print and eBook formats

Book Cover

Oracle Open World

This year I got to go to my second ACE Director briefing. This is held on the Thursday and Friday before OOW. At the briefing we get lots of Very senior people coming in to tell us what is happening with the products in their area and what the plans are over the next 12 to 18 months. Lots of what we are told is all under NDA. My favourite part of this briefing is when Thomas Kurian comes in and talks for about 90 minutes. No slides, no notes. The first 15 minutes is him telling us what Larry & Co are going to announce at OOW, that are the main product directions etc. Then he opens to the floor for questions. You can ask him anything about the set of Oracle products (>3000) and he will explain in detail what is happen. He even commented on the plans for the Oracle Games Console this year!!!

This year I had the opportunity to present at OOW again. It was a joint presentation with Roel Hartman and we had the pleasure of being one of the first presentations at OOW at 8:30am on the Sunday morning. Despite the early start we have really good turn out for our presentation.

Then I got to enjoy OOW with all the various activities, presentations, entertainment and hanging out at the OTN lounge with the other ACEs and ACEDs.

Blog Posts

One of the things I really like doing is playing with various features of Oracle and then writing some blog posts about them. Most of what I blog about evolves around the SQL & PL/SQL Statistics functions and the Advanced Analytics Option, comprising Oracle Data Mining and Oracle R Enterprise. In addition to these blog posts I also have posts relating to various Oracle User Group activities. So there is a good mixture of material on the blog.

In 2014 I have written 60 blog posts (including this one). This number is a little be less than previous years and perhaps the main reason for that is due to me being extremely busy with various project work this year.

OTN Articles

OTN has accepted three articles from me in 2014. I was delighted about these acceptances and I'm looking forward to writing some more articles in 2015 for them.

  • Sentiment Analysis using Oracle Data Miner
  • ROracle : How to get Started and Commands you need to Know
  • Predictive Queries in 12c

I have a few more ideas for articles and I will be writing these in 2015. We will have to wait and see if OTN will accept them.

My Oracle Magazine Collection & Reviews

You may or may not be aware that I've been collecting Oracle Magazine for over 20 years now. I have nearly the entire collection of Oracle Magazine going back to the very first edition. Check out the collection here. You will see that I'm missing a few and these are highlighted by the grey boxes. If you do have any of these and you would like to donate to my collection then please get in touch.

One of the things I like to blog about is on some of these old Oracle Magazines. If you go to my Oracle Magazine collection page you will see the past editions that I have writing a review of. Click on the links to view the blog post review an edition.

In 2014 I have written reviews of the following:

OUG Activities

The Oracle User Group in Ireland (OUG Ireland) has continued to grow this year in membership but also with the number of attendees at our events. In March of each year we have our flagship event which is our annual conference. This year we have almost 300 people and unfortunately people had to turned away at the door because we had headed the maximum limit on the number of attendees for the event. Planing has already commenced for 2015 and the call for presentations is now open. Hopefully 2015 will be bigger and better that 2014. We had a second day at the conference this year where we had Tom Kyte give a full day seminar. Again this was fully booked out for weeks/months before hand. In March 2015 we will be having a second day of the conference with Maria Colgan giving a one day workshop/seminar on the In-Memory option and the Optimiser. You cannot book your place on this seminar yet but then it does open make sure to book your place quick as I'm sure it will book out very quickly.

We also had a number of TECH and BI SIGs and the number of attendees has significantly grown over the past few years. This is fantastic and hopefully this will continue. If it does then maybe we might be able to put on more SIG events.

In the editor of Oracle Scene Magazine which is published by the UKOUG. This was my first full year as editor after spending many years as deputy editor. In 2014 we have published three editions of Oracle Scene and I would like to thanks everyone who has submitted an article. You have helped grow the quality of the contents and also grow the readership numbers. The calls for articles for the Spring edition is now open.

My Oracle Data Science newsletter & My Oracle User Group Weekly newsletter

A couple of years ago I set up a news aggregator based on Twitter feeds and on updates from certain websites. I've divided these into two different newsletters. The first is My Oracle Data Science News and as you might guess it is focused on the worlds of Data Science, Predictive Analytics and related developments with a bit of a focus towards the Oracle world. This newsletter gets published each day.

My second newsletter is focused on Oracle User Group activities around the World and is again based on the various Twitter handles of the Oracle User Groups. I'm include over 40 OUG Twitter handles in the aggregator so I should be picking up almost everyone. If you discover your OUG is not being included then drop me an email and I'll add you to the list. This newsletter goes out every Friday.

Plans for 2015 so far

The start of 2015 is already very busy and I'm already booked for 3 conferences BIWA Summit (CA, USA), OUG Norway and OUG Ireland.

Planing for OUG Ireland is under ways and we are hoping to build on the successes we have had over the past few years.

So as editor of Oracle Scene magazine we are planning for our first issue in 2015, the call for articles is open and we have been busy recruiting authors of articles on specifics.


I'm sure I've forgotten a few things, I usually do.

It has been a fun year. I've made lots of friends around the World and I look forward to meeting you all at some conference in 2015.

Tuesday, December 16, 2014

ODMr 4.1 EA1 Repository Upgrade

If you are downloading the EA1 of SQL Developer that includes Oracle Data Miner (ODMr), and you intend to use Oracle Data Miner then you will need to update the ODMr Repository.

You could do it the hard way and run the upgrade repository sql scripts that are located in the ...\sqldeveloper-4.1.0.17.29-no-jre\sqldeveloper\dataminer\scripts directory.

Or you could do it the easy way and let the inbuilt functionality in Oracle Data Miner do it for you.

To do it the easy way all you need to do is to open the ODMr Connections window and the double click on one of your ODM connections.

ODMr will check the version of the repository you have installed and if needed it will prompt you about upgrading the repository. Select Yes and you will be prompted to enter the SYS password. So talk kindly with your DBA for them to enter the password for you. Then click on the Start button. They will lick off the OMDr Repository Upgrade scripts.

NB: Make sure you have a backup of your workflows before you do this. A little think happened to me during the SQL Dev / ODMr 4.0 upgrade back in September 2013 where all my workflows disappeared. You can imagine how happy I was about that. Since then the ODMr team have added some functionality to ensure something like this doesn't happen again. But you never know.

To backup your ODMr workflows use the Export Workflow option.

When the repository upgrade has finished you will get a 'Task Complete Successfully' message in the upgrade window. Click on the close button and away you go with this updated version.

Check out this blog post for details of what is new in ODMr 4.1.

Friday, December 12, 2014

Oracle Data Miner (SQL Dev) 4.1 EA1

A few days ago the first Early Adaptor release of SQL Developer 4.1 (EA1) was made available. You can go ahead and download it from here and make sure to check out the blog post by Jeff Smith on some install and setup that is required around the latest version of Java.

I've been using SQL Developer since its very first release, so getting my hands on a new release is very exciting. There are lots and lots of new features in the tool. Again check out the blog posts by Jeff Smith and Kris Rice on some of these new features. I really like the new DBA screens :-) But this screen really needs some scroll bars and not everything fits on my screen. So Jeff and Kris if you are reading this, can you add some scroll bars.

Sqldev4 1

In addition they have been working on "new" SQL*Plus that is called SDSQL. This is a new command line tool that is supposed to be bigger and better than SQL*Plus but still gives us a command line tool to run our scripts and demos. To download and install the tool go to here.

As you know I'm a bit of an Oracle Data Miner/Mining fan. There are now new in-database features, but there are a lot of new features in the GUI tool (aka ODMr) along with some improvements and bug fixes. Here is a list of the ODMr 4.1 EA1 new and updated features (taken from the ODMr Help in SQL Dev)

JSON Data Support for Oracle Database 12.1.0.2 and above

In response to the growing popularity of JSON data and its use in Big Data configurations, Data Miner now provides an easy to use JSON Query node. The JSON Query node allows you to select and aggregate JSON data without entering any SQL commands. The JSON Query node opens up using all of the existing Data Miner features with JSON data. The enhancements include:

Data Source Node

Automatically identifies columns containing JSON data by identifying those with the IS_JSON constraint.

Generates JSON schema for any selected column that contain JSON data.

Imports a JSON schema for a given column.

JSON schema viewer.

Create Table Node

Ability to select a column to be typed as JSON.

Generates JSON schema in the same manner as the Data Source node.

JSON Data Type

Columns can be specifically typed as JSON data.

JSON Query Node

Ability to utilize any of the selection and aggregation features without having to enter SQL commands.

Ability to select data from a graphical layout of the JSON schema, making data selection as easy as it is with scalar relational data columns.

Ability to partially select JSON data as standard relational scalar data while leaving other parts of the same JSON document as JSON data.

Ability to aggregate JSON data in combination with relational data. Includes the Sub-Group By option, used to generate nested data that can be passed into mining model build nodes.

General Improvements

Improved database session management resulting in less database sessions being generated and a more responsive user interface.

Filter Columns Node

Combined primary Editor and associated advanced panel to improve usability.

Explore Data Node

Allows multiple row selection to provide group chart display.

Classification Build Node

Automatically filters out rows where the Target column contains NULLs or all Spaces. Also, issues a warning to user but continues with Model build.

Workflow

Enhanced workflows to ensure that Loading, Reloading, Stopping, Saving operations no longer block the UI.

Online Help

Revised the Online Help to adhere to topic-based framework.

Selected Bug Fixes (does not include 4.0 patch release fixes)

GLM Model Algorithm Settings: Added GLM feature identification sampling option (Oracle Database 12.1 and above).

Filter Rows Node: Custom Expression Editor not showing all possible available columns.

WebEx Display Issues: Fixed problems affecting the display of the Data Miner UI through WebEx conferencing.


Denny Wong of the ODM team in Oracle has made available a tutorial on importing JSON data for use with ODMr. Check it out here.

I've been told there will be a couple of tutorials on the new features coming out (from the ODMr team) over the next few weeks. So keep an eye out of these.


Check out my blog post on what you need to do to get started/using ODMr 4.1 EA1.

Friday, December 5, 2014

UKOUG 2015 Conferences

The UKOUG annual conferences commence on Sunday 7th December and run until Wednesday 10th.

Like previous years there are two conferences, one called TECH15 and the other is called APPS15. You might guess what each conference is about!!.

This year these conferences are being held at the same time and in the same venue. But they are separate conferences!.

This year I've been very lucky (or very unlucky) to have 3 presentations at these conferences. Two of these will on part of the TECH15 conference and one will be part of the APPS15 conference.

Just in case you interested in what I'm presenting about and you might want to attend them, here is the list with the room numbers.

Monday

10:30-11:20 : Oracle Advanced Analytics in Oracle Fusion Apps & Beyond (Apps) (Room : Ex1)

11:30-12:20 : Predictive Queries in Oracle 12c (TECH) (Room : Hall 6)

Wednesday

11:30-12:20 : What are they thinking? With APEX and Oracle Data Miner. (TECH) (Room : Ex4)

(this is a joint presentation with Roel Hartman)

Yes on the Monday I have 2 back-to-back presentation with a 10 minute gap to get from one side of the conference centre to the other side :-( I'm not looking forward to that transition, but I'm sure it will be fine.

Friday, November 14, 2014

OUG Ireland 2015 : Now open for Submissions

OUG Ireland Call for submissions is now open.

The closing date for submissions is 5th January, 2015.

and the submission webpage can be found here.

Ougire15 hp cfp v2

The OUG Ireland conference will be on Thursday 19th March. Yes it is only a one day conference :-( but we will be 5 or 6 or more streams. So there will be something for everyone and plenty of choice.

On Friday 20th March we will have Maria Colgan, formally the Optimizer Lady and now the In-Memory Queen (or something like that), giving a full day workshop on the In-Database option and the Optimizer. She will also be about for the main conference on the 19th, so you can expect a presentation or two from her on the Thursday.

Agenda selection day is the 8th January, 2015. So hopefully you will be getting the acceptance emails soon after that or during week of 12th January.

There is a committee of about 10 people who are involved in selecting presentations and setting the agenda. If it was up to me then I would accept everything/everyone. So if your presentation is not accepted this time, please don't blame me :-) I said YES to your presentation, I really, really did. I fought so hard to have your presentation included. If your presentation is not accepted then the blame is down to the other committee members :-)

The conference will be held in Croke Park, and is a 15-20 minute taxi ride from the Airport.

You can follow the Conference and other OUG Ireland events using the twitter tag #oug_ire

Wednesday, November 12, 2014

Approximate Count Distinct (12.1.0.2 new feature)

With the release of the Oracle Database 12.1.0.2 there was a number of new features and options. Most of the publicity has been around the in-Memory option. But there was lots of other features for the DBA and a few for the developer.

One of the new SQL functions is the APPROX_COUNT_DISTINCT(). This function is different to the tradition count distinct, COUNT(DISTINCT expression), in that is performs an approximate count distinct. The theory is that this approximate count is a lot more efficient than performing the full count distinct.

The APPROX_COUNT_DISTINCT() function is really only suitable when you are processing very large volumes of data and when the data set contains a large number of distinct values.

The general syntax of the function is:

... APPROX_COUNT_DISTINCT(expression) ...

and returns a Number.

The function returns the approximate number of records that contain distinct value for the expression.

SELECT approx_count_distinct(cust_id)

FROM mining_data_build_v;

The APPROX_COUNT_DISTINCT() function ignores records that contain a null value for the expression. Plus is performs less work on the sorting and aggregations. Just run and Explain Plan and you can see the differences.

In some of the material from Oracle the APPROX_COUNT_DISTINCT() function can be 5x to 50x++ times faster. But it depends on the number of distinct values and the complexity of the SQL query.

As the result / returned value from the function may not be 100% accurate, Oracle says that the functions has an accuracy of >97% (with 95% confidence).

The function cannot be used on the following data types: BFILE, BLOB, CLOB, LONG, LONG RAW and NCLOB

Friday, November 7, 2014

ODMr : Graph Node: Zooming in on Graphs

When Oracle Data Miner (ODMr) 4.0 (which is part of SQL Developer) came out back in late 2013 there was a number of new features added to the tool. One of these was a Graph node that allows us to create various graphs and charts that include Line, Scatter, Bar, Histogram and Box plot.

I've been using this node recently to produce graphs and particularly scatter plots. I've been using the scatter plots to graph the Actual values in a data set against the Predicted values that was generated by ODMr. In this scenario I had a separate data set for training my ODM data mining models and another testing data set for, well testing how well the model performed against an unseen data set.

In general the graphs produced by the Graph node look good and gives you the information that you need. But what I found was that as you increased the size of the data set, the scatter plot can look a messy. This was in part due to the size of the square used to represent a data point. As the volume of data increased then your scatter plot could just look like a coloured in area of blue squares. This is illustrated in the following image.

Graph node 1

What I discovered today is that you can zoom in on this graph to explore different regions and data point on it. This do this you need to select an data that is within the x-axis and y-axis area. When you do this you will see a box form on your graph that selects the area that you indicate by moving your mouse. After you have finished selecting the area, the Graph Node will zooms into this part of the graph and shows the data points. For example if I select the area from about 1000 on the x-axis and 1000 on the y-axis, I will get the following.

Graph node 2

Again if I select a similar are area of 350 on the x-axis and 400 on the y-axis I get the following zoomed area.

Graph node 3

You can keep zooming in on various areas.

At some point you will have finished zooming in and you will want to return to the original graph. To zoom back outward all you need to do in the graph is to click on it. When you do this you will go back to the previous step or image of the graph. You can keep doing this until you get back to the original graph. Alternatively you can zoom in and out on various parts of the graph.

Hopefully you will find this feature useful.

Wednesday, October 29, 2014

Something new in 12c: FETCH FIRST x ROWS

In this post I want to show some example of using a new feature in 12c for selecting the first X number of records from the results set of a query.

See the bottom of this post for the background and some of the reasons for this post.

Before we had the 12c Database if we only wanted to see a subset or the initial set of records from the results of a query we could add something like the following to our query

...

AND ROWNUM <= 5;

The could use the pseudo column ROWNUM to restrict the number of records that would be displayed. This was particularly useful when the results many 10s, 100s, or millions of records. It allowed us to quickly see a subset and to see if the results where what we expected.

In my book (Predictive Analytics Using Oracle Data Miner) I had lots of examples of using ROWNUM.

What I wasn't aware of when I was writing my book was that there was a new way of doing this in 12c. We now have something like the following:

...

FETCH FIRST x ROWS ONLY;

There is an example:

SELECT * FROM mining_data_build_v

FETCH FIRST 10 ROWS ONLY;

Fetch first 1

There are a number of different ways you can use the row limiting feature. Here is the syntax for it:

[ OFFSET offset { ROW | ROWS } ]

[ FETCH { FIRST | NEXT } [ { rowcount | percent PERCENT } ]

{ ROW | ROWS } { ONLY | WITH TIES } ]

In most cases you will probably use the number of rows. But there many be cases where you might what to use the PERCENT. In previous versions of the database you would have used SAMPLE to bring back a certain percentage of records.

select CUST_GENDER from mining_data_build_v

FETCH FIRST 2 PERCENT ROWS ONLY;

This will set the first 2 percent of the records.

You can also decide from what point in the result set you want the records to be displayed from. In the previous examples above the results displayed will befing with the first records. In the following example the results set will be processed to record 60 and then the first 5 records will be selected and displayed. This will be records 61, 62, 63, 64 and 65. So the first record processed will be the OFFSET record + 1.

select CUST_GENDER from mining_data_build_v

OFFSET 60 ROWS FETCH FIRST 5 ROWS ONLY;

Similar to the PERCENT example above you can use the OFFSET value, for example.

select CUST_GENDER from mining_data_build_v

OFFSET 60 ROWS FETCH FIRST 2 PERCENT ROWS ONLY;

This query will go to records 61 and return the next 2 percent of the records.


The background to this post

There are a number of reasons that I really love attending Oracle User Group conferences. One of the challenges I set myself is to go to presentations on topics that I think I know or know very well. I can list many, many reasons for this but there are 2 main points. The first is that you are getting someone elses perspective on the topic and hence you might learn something new or understand it better. The second is that you might actually learn something new, like some new command, parameter setting or something else like that.

At Oracle Open World recently I attended the EMEA 12 things about 12c set of presentations that Debra Lilly arranged during the User Group Forum on the Sunday. During these session Alex Nuijten gave an overview of some 12c new SQL features. One of these was the command FETCH FIRST x ROWS. This blog post illustrates some of the different ways of using this command.