Showing posts with label oraclebigdata. Show all posts
Showing posts with label oraclebigdata. Show all posts

Thursday, July 9, 2015

Oracle Architect's Guides to Big Data

Over the past couple of years we have had a lot of information about Big Data presented to us. But one of the things that still stands out is that there is still a bit of confusion on what Big Data is. Depending on who you are talking to you will get a different definition and interpretation of what Big Data is and what you can do with it.

For example there is one company I know of who are talking about their Big Data project. For them this involves processing approx. 1 million records. That is Big for them. For others that is tiny.

Oracle has recently put together a series of articles that talk about what architectural changes are needed to your technical infrastructure to support Big Data. In this case it is more about the volume of data rather than different types of data. Although this is covered by the architecture that Oracle gives.

As part of the Oracle Enterprise Architecture section of the Oracle website, they have put together a series of articles on how you can include Big Data within your Enterprise Information Architecture.

These are a good read and a great place to get a better understanding of what you need to be considering as you move to an architecture that includes Big Data.

NewImage

Monday, June 2, 2014

ore.parallel

In ORE there are a number ways to get you R scripts to run in parallel in the database. One way is to enable the Parallel option in ORE. This is what will be shown in this post. There are other methods of running various ORE commands/scripts in parallel. With these the scripts are divided out and several parallel R processes are started on the server.

But what if you want to use the database parallel feature on some of your ORE other commands?

Why would you want to do this?

Well the main answer is that you might want to use the parallel option of the database for the creation on objects (tables etc) and for selecting and manipulating the data in the database.

How can you enable your ORE connection to use the in-database parallel feature?

ORE 1.4 has a new option that enables the parallel option for your ORE connection in the database. This option is called ore.parallel.

When you enable or set the ore.parallel option, it seems to be the equivalent of running the following:

ALTER SESSION ENABLE PARALLEL DDL;

ALTER SESSION ENABLE PARALLEL DML;

ALTER SESSION ENABLE PARALLEL QUERY;

The exact details is a little unclear, but it seems to be above commands.

The following commands illustrates some options for using the ore.parallel option.

> #

> # Check to see if the ore.parallel is enabled for your ORE connection

> options("ore.parallel")

$ore.parallel

NULL

The NULL returned value tells us that your ORE connections does not have the Parallel option enabled. If the schema had Parallel enabled by default then we would have have a response of TRUE.

The following command turns on the Parallel option for your ORE connection / schema.

> options("ore.parallel" = TRUE)

> options("ore.parallel")

$ore.parallel

[1] TRUE

When the Parallel option is enabled (TRUE above) the database will use the degree of parallel that is set as default for the schema or the degree of parallel that is defined for the table when it is being used in your ORE commands.

You can changed the degree of parallelism by passing the required degree as a value to the ore.parallel command. In the following, the degree of parallelism is set to 8. We then as ORE what the degree is set to and it tells us that it is 8. So it was set correctly.

> options("ore.parallel" = 8)

> options("ore.parallel")

$ore.parallel

[1] 8

Thursday, March 27, 2014

Oracle BigDataLite version 2.5.1 is now available

Back at the end of January Oracle finally go round to releasing the updated version of the Oracle BigDataLite virtual machine. Check out my previous blog post of this.

Yesterday (27th March) I say on Facebook that a new updated versions of the BigDataLite VM was released. I must have missed the tweet and other publicity on this somewhere :-(

This is a great VM that allows you to play with the various Big Data technologies without the hassle of going through the who install and configuration thing.

If you are interested in this then here are the details of what it contains and where you can find more details.

The following components are included on Oracle Big Data Lite Virtual Machine v 2.5:

Oracle Enterprise Linux 6.4

Oracle Database 12c Release 1 Enterprise Edition (12.1.0.1)

Cloudera’s Distribution including Apache Hadoop (CDH4.6)

Cloudera Manager 4.8.2

Cloudera Enterprise Technology, including:

   Cloudera RTQ (Impala 1.2.3)

   Cloudera RTS (Search 1.2)

Oracle Big Data Connectors 2.5

   Oracle SQL Connector for HDFS 2.3.0

   Oracle Loader for Hadoop 2.3.1

   Oracle Data Integrator 11g

   Oracle R Advanced Analytics for Hadoop 2.3.1

   Oracle XQuery for Hadoop 2.4.0

Oracle NoSQL Database Enterprise Edition 12cR1 (2.1.54)

Oracle JDeveloper 11g

Oracle SQL Developer 4.0

Oracle Data Integrator 12cR1/

Oracle R Distribution 3.0.1


Go to the Oracle Big Data Lite Virtual Machine landing page on OTN to download the latest release.

Thursday, November 29, 2012

Association Rules in ODM-Part 3

This is a the third part of a four part blog post on building and using Association Rules in Oracle Data Miner. The following outlines the contents of each post in the series on Association Rules

  1. This first part will focus on how to building an Association Rule model
  2. The second post will be on examining the Association Rules produced by ODM – This blog post
  3. The third post will focus on using the Association Rules on your data.
  4. The final post will look at how you can do some of the above steps using the ODM SQL and PL/SQL functions.

In my previous posts I showed how you can go about setting up for Association Rule analysis in Oracle Data Miner and how to examine the rules that are generated.

This post will focus on how we can extract and use these rules in Oracle Data Miner.

Step 1 – Model Details

Association Rules are an unsupervised method of data mining. In Oracle Data Miner we cannot use the Apply node to to score new data. What we have to do is to generate the Model Details. These in turn can then be used.

The Model Details node is used when we do unsupervised learning to extract the rules that are generated.

To do this we need to click on the Model Details node in the Models section of the Component Palette and then click on our workspace, just to the right of the Association Rule node.

The Edit Model Selection window will open. Connect the Association Rule node to the Model Details node. Then Run the node. This will then generate the Association Rules in a format what we can reuse.

image

When you get the small green tick on the Model Details node you can then view what was generated.

Right click on the Model Details node and click on View Details from the menu.

image

The output is similar to what we would have seen under the Association Rule node with the addition of a few more attributes that include the schema name and model name.

We can order the rules based on the Confidence level by double clicking on the Confidence column header. You might need to do this twice to get the rule appearing based on a descending confidence value.

At this point we can no look at persisting the Association Rules. See step 2 below.

We can also view the SQL that was used to generate the Association Rules that we see in the Model Details node. While still viewing the rules, click on the SQL tab.

image

Step 2 – Persisting the Association Rules

To make the rules persist and be useable outside of ODM we can persist the Association Rules in a table. The first step to do this is to create a new Table Node. This can be found under the Data section of the Component Palette. Click this Create Table or View node in the component palette and then click on the workspace, just to the right of the Model Details node.

Connect the Model Details node to the Output node, by right clicking on the Model Details node, select Connect from the menu and then click on the Output Node.

We can now edit the format of the Output i.e. specify what attributes are to be in our Output table. Double click on the Output node or right click and select Edit from the menu. We now get the Edit Create Table or View Node.

SNAGHTML18801036

We can give the output a meaningful name e.g. AR_OUTPUT_RULES. We can also specify what rule properties we can to export to attributes in out table.

We will need to un-tick the Auto Input Columns Selection tick box before we can remove any of the output attributes. In my case I only want to have ANTECENDENT_ITEMS, CONSEQUENT_ITEMS, ID, LENGTH, CONFIDENCE and SUPPORT in my out put. So I need to select and highlight all the other attributes (holding the control button). After selecting all the attributes I do not want included in the final output table, I need to click on the red X icon.

SNAGHTML18859128

When complete click on the OK button to go back to the workflow.

To generate the table right click on the AR_OUTPUT_RULES node and select Run from the menu. When you get the green tick mark on the AR_OUTPUT_RULES node the table has been created with records containing the details of each rules.

image

To view the contents of the AR_OUTPUT_RULES table we can right click on this node and select view data from the menu.

image

We can now use these rules in our applications.

 

Check out the next post in the series (Part 4) where we will look at the functionality available in the ODM SQL & PL/SQL functions to perform Association Rule analysis.

Tuesday, November 27, 2012

Association Rules in ODM–Part 2

This is a the second part of a four part blog post on building and using Association Rules in Oracle Data Miner.  The following outlines the contents of each post in the series on Association Rules

  1. This first part will focus on how to building an Association Rule model
  2. The second post will be on examining the Association Rules produced by ODM – This blog post
  3. The third post will focus on using the Association Rules on your data.
  4. The final post will look at how you can do some of the above steps using the ODM SQL and PL/SQL functions. 

In the previous post I looked at the steps needed to setup a data source and to setup the Association Rule node. When everything was setup we ran the workflow.

Step 1 – Viewing the Model

We the workflow has finished running we will have the green tick marks on each node. This is where we left thing at the end of the previous post (Part 1). To view the model details, right click on the Association Role Node and select View Models from the menu.

image

There are 3 main concepts that are important in relation to Association Rules:

  • Support: is the proportion of transactions in the data set that contain the item set i.e. the number of times the rule occurs
  • Confidence: is the proportion of the occurrences of the antecedent that result in the consequent e.g. how many times do we get C when we have A and B  {A, B} => C
  • Lift: indicates the strength of a rule over the random co-occurrence of the antecedent and the consequent

Support and Confidence are the primary measures that are used to access the usefulness of an association rule.

In our example we can see that the the antecedent and the consequent has numbers separated by the word AND. These numbers correspond to the product numbers.

Step 2 – Examining the Model Rules

To read the antecedent and the consequent for the first rule in our example we have:

Antecedent: 137 AND 143 AND 128

Consequent: 144

To read this association rule we would say that if a Customer bought product 137 and product 143 and product 128, then we have a Confidence value of almost 71%. This is a strong association.

We can check the ordering of the rules by changing the Sort By criteria. As Confidence and Support are the main ways to evaluate the rules, we can change the Sort By criteria to be Confidence. Then click on the Query button to refresh the rules section.

image

Here get a list of the strongest rules listed in descending order.

Below the section of the screen that has the Rules, we have the Rule Details section.

image 

Here we can see that the rule gets formatted into an IF statement. The first rule in the list has a confidence of almost 97%. As it is a simple IF statement it can be easily implemented in our applications.

We want use the information that these rules provides in a number of ways. One such consequence of these rules is that we can look at improving the ordering and distribution of these products to ensure that we have sufficient numbers of each. Another consequence is that we can enhance the front end selling mechanism to make sure that if a customer is buying product 114, 118 and 115 then we can remind the customer of product 119. We can also ensure that all these products are not located beside each other, so that the customer will have to walk past many other products in order to find them. That is why we never see milk and bread beside each other in a grocery store.

Step 3 – Applying Filters to the Model Rules

In the previous step we were able to sort our rules based on some of the measures of our Association Rules and to see how these rules are structured.

Association Rule Analysis can generate many thousands of possible rules for a small data set. In some cases the similar rules can appear and we can have lots of rules that occur so infrequently that they are perhaps meaningless.

ODM provides us with a number of filters that we can apply to the rules that enables use to look for the rules that are of must interest to use. We can access these filters by clicking on the More button, that is located just under the Query button.

We can refine our query on the rules based on the various measures and the number if items in the rule. In addition to this we can also filter based on the values of the items. This is particularly useful if we want to concentrate on specific items (in our example Products). To illustrate this use focus on the rules that involve Product 115. Click on the green + symbol on the right hand side of the window. Select 115 from the list provided. Next we need to decide if we want Product 115 involved in the Antecedent or the Consequent. In our example select the Consequent. This is located to the bottom right of the window. Then click the OK button and then click on the Query button to update the list of rules that correspond with the new filter.

image

We can see that we only have rules that have Product 115 in the Consequent column.

We can also see that we have 134 rules for this scenarios out of a total of 20,988 (your results might differ slightly to mine and that’s OK. It really depends on what version of the sample data you are using)

 

Check out the next post in the series (Part 3) where we will look at how you can use the Association Rules produced by ODM.

Saturday, October 20, 2012

Oracle Advanced Analytics Option in Oracle 12c

At Oracle Open World a few weeks ago there was a large number of presentations on Big Data and Analytics.  Most of these were marketing type presentations, with a couple of presentations on using R and how it can not be integrated into the Oracle Database 11.2.

In addition this these there was one presentation that focused on the Oracle Advanced Analytics (OAA) Option.

The Oracle Advanced Analytics Option covers the Oracle Data Mining features and the Oracle R Enterprise features in the Database.

The purpose of this blog post is to outline and summarise what was mentioned at these presentations, and will include what changes are/may be coming in the “Next Release” of the database i.e. Oracle 12c.

Health Warning: As with all the presentations at OOW that talked about what may be in or may be in the next release, there is not guarantee that the features will actually be in the release version of the database. Here is the slide that gives the Safe Harbor statement.

image

  • 12c will come with R embedded into it. So there will be no need for any configurations.
  • Oracle R client will come as part of the server install.
  • Oracle R client will be able to use the Analytics functions that exist in the database.
  • Will be able to run R code in the database.
  • The database (12c) will be able to spawn multiple R engines.
  • Will be able to emulate map-reduce style algorithms.
  • There will be new PREDICTION function, replacing the existing (11g) functionality. This will combine a number of steps of building a model and applying it to the data to be scored into one function.  But we will still need the functionality of the existing PREDICTION function that is in 11g. So it will be interesting to see how this functionality will be kept in addition to the new functionality being proposed in 12c.
  • Although the Oracle Data Miner tool will still exits and will have many new features. It was also referred to as the ‘OAA Workflow’.  So those this indicate a potential name change?  We will have to wait and see.
  • Oracle Data Miner will come with a new additional graphing feature. This will be in addition to the Explore Node and will allow us to produce more typical attribute related graphs. From what I could see these would be similar to the type of box plot, scatter, bar chart, etc. graphs that you can get from R.
  • There will be a number of new algorithms too, including a useful One Class Support Vector Machine. This can be used when we have a data set with just one class value. This algorithm will work out what records/cases are more important and others.
  • There will be a new SQL node. This will allow us to write our own data transformation code.
  • There will be a new node to allow the calling of R code.
  • The tool also comes with a slightly modified layout and colour scheme.

Again, the points that I have given above are just my observations. They may or may not appear in 12c, or maybe I misunderstood what was being said.

It certainly looks like we will have a integrate analytics environment in 12c with full integration of R and the ODM in-database features.

Wednesday, October 17, 2012

Extracting the rules from an ODM Decision Tree model

One of the most interesting of important aspects of a Decision Model is that we as a user can get to see what rules the machine learning algorithm has generated for our data.

I’ve give a number of examples in various blog posts over the past few years on how to generate a number of classification models. An example of the workflow is below.

SNAGHTML207172c9

In the Class Build node we get four models being generated. These include a Generalised Linear Model, Support Vector Machine, Naive Bayes and a Decision Tree model.

We can explore the Decision Tree model by right clicking on the Class Build Node, selecting View Models and then the Decision Tree model, which will be labelled with a ‘DT’ in the name.

image

As we explore the nodes and branches of the Decision Tree we can see the rule that was generated for a node in the lower pane of the applications. So by clicking on each node we get a different rule appearing in this pane

image

Sometimes there is a need to extract this rules so that they can be presented to a number of different types of users, to explain to them what is going on.

How can we extract the Decision Tree rules?

To do this, you will need to complete the following steps:

  • From the Models section of the Component Palette select the Model Details node.
  • Click on the Workflow pane and the Model Details node will be created
  • Connect the Class Build node to the Model Details node. To do this right click on the Class Build node and select Connect. Then move the mouse to the Model Details node and click. The two nodes should now be connected.
  • Edit the Model Details node, uncheck the Auto Settings, select Model Type to be Decision Tree, Output to be Full Tree and all the columns.

SNAGHTML2093297b

  • Run the Model Details node. Right click on the node and select run. When complete you you will have the little green box with a tick mark, on the top right hand corner.
  • To view the details produced, right click on the Model Details node and select View Data
  • The rules for each node will now be displayed. You will need to scroll to the right of this pane to get to the rules and you will need to expand the columns for the rules to see the full details

image

Friday, October 12, 2012

My Presentations on Oracle Advanced Analytics Option

I’ve recently compiled my list of presentation on the Oracle Analytics Option. All these presentations are for a 45 minute period.

I have two versions of the presentation ‘How to do Data Mining in SQL & PL/SQL’, one is for 45 minutes and the second version is for 2 hour.

I have given most of these presentations at conferences or SIGS.

Let me know if you are interesting in having one of these presentations at your SIG or conference.

  • Oracle Analytics Option - 12c New Features - available 2013
  • Real-time prediction in SQL & Oracle Analytics Option - Using the 12c PREDICTION function - available 2013
  • How to do Data Mining in SQL & PL/SQL
  • From BIG Data to Small Data and Everything in Between
  • Oracle R Enterprise : How to get started
  • Oracle Analytics Option : R vs Oracle Data Mining
  • Building Predictive Analysts into your Forms Applications
  • Getting Real Business Value from OBIEE and Oracle Data Mining  (This is a cut down and merged version of the follow two presentations)
  • Getting Real Business Value from OBIEE and Oracle Data Mining - Part 1 : The Oracle Data Miner part
  • Getting Real Business Value from OBIEE and Oracle Data Mining - Part 2 : The OBIEE part
  • How to Deploying and Using your Oracle Data Miner Models in Production
  • Oracle Analytics Option 101
  • From SQL Programmer to Data Scientist: evolving roles of an Oracle programmer
  • Using an Oracle Oracle Data Mining Model in SQL & PL/SQL
  • Getting Started with Oracle Data Mining
  • You don't need a PhD to do Data Mining

Check out the ‘My Presentations’ page for updates on new presentations.

Tuesday, June 26, 2012

Analytics Sessions at Oracle Open World 2012

The content catalog for Oracle Open World 2012 was made public during the week. OOW is on between 30th September and 4th October.

The following table gives a list of most of the Data Analytics type sessions that are currently scheduled.

Why did I pick these sessions? If I was able to go to OOW then these are the sessions I would like to attend. Yes there would be many more sessions I would like to attend on the core DB technology and Development streams.

Session Title Presenters
CON6640 - Database Data Mining: Practical Enterprise R and Oracle Advanced Analytics Husnu Sensoy
CON8688 - Customer Perspectives: Oracle Data Integrator Gurcan Orhan - Software Architect & Senior Developer, Turkcell Technology R&D
Julien Testut - Product Manager, Oracle
HOL10089 - Oracle Big Data Analytics and R George Lumpkin - Vice President, Product Management, Oracle
CON8655 - Tackling Big Data Analytics with Oracle Data Integrator Mala Narasimharajan - Senior Product Marketing Manager, Oracle
Michael Eisterer - Principal Product Manager, Oracle
CON8436 - Data Warehousing and Big Data with the Latest Generation of Database Technology George Lumpkin - Vice President, Product Management, Oracle
CON8424 - Oracle’s Big Data Platform: Settling the Debate Martin Gubar - Director, Oracle
Kuassi Mensah - Director Product Management, Oracle
CON8423 - Finding Gold in Your Data Warehouse: Oracle Advanced Analytics Charles Berger - Senior Director, Product Management, Data Mining and Advanced Analytics, Oracle
CON8764 - Analytics for Oracle Fusion Applications: Overview and Strategy Florian Schouten - Senior Director, Product Management/Strategy, Oracle
CON8330 - Implementing Big Data Solutions: From Theory to Practice Josef Pugh - , Oracle
CON8524 - Oracle TimesTen In-Memory Database for Oracle Exalytics: Overview Tirthankar Lahiri - Senior Director, Oracle
CON9510 - Oracle BI Analytics and Reporting: Where to Start? Mauricio Alvarado - Principal Product Manager, Oracle
CON8438 - Scalable Statistics and Advanced Analytics: Using R in the Enterprise Marcos Arancibia Coddou - Product Manager, Oracle Advanced Analytics, Oracle
CON4951 - Southwestern Energy’s Creation of the Analytical Enterprise Jim Vick - , Southwestern Energy
Richard Solari - Specialist Leader, Deloitte Consulting LLP
CON8311 - Mining Big Data with Semantic Web Technology: Discovering What You Didn’t Know Zhe Wu - Consultant Member of Tech Staff, Oracle
Xavier Lopez - Director, Product Management, Oracle
CON8428 - Analyze This! Analytical Power in SQL, More Than You Ever Dreamt Of Hermann Baer - Director Product Management, Oracle
Andrew Witkowski - Architect, Oracle
CON6143 - Big Data in Financial Services: Technologies, Use Cases, and Implications Omer Trajman - , Cloudera
Ambreesh Khanna - Industry Vice President, Oracle
Sunil Mathew - Senior Director, Financial Services Industry Technology, Oracle
CON8425 - Big Data: The Big Story Jean-Pierre Dijcks - Sr. Principal Product Manager, Oracle
CON10327 - Recommendations in R: Scaling from Small to Big Data Mark Hornick - Senior Manager, Oracle

Wednesday, June 20, 2012

Part 2 of the Leaning Tower of Pisa problem in ODM

In previous post I gave the details of how you can use Regression in Oracle Data Miner to predict/forecast the lean of the tower in future years. This was based on building a regression model in ODM using the known lean/tilt of the tower for a range of years.

In this post I will show you how you can do the same tasks using the Oracle Data Miner functions in SQL and PL/SQL.

Step 1 – Create the table and data

The easiest way to do this is to make a copy of the PISA table we created in the previous blog post. If you haven’t completed this, then go to the blog post and complete step 1 and step 2.

create table PISA_2
as select * from PISA;

image

Step 2 – Create the ODM Settings table

We need to create a ‘settings’ table before we can use the ODM API’s in PL/SQL. The purpose of this table is to store all the configuration parameters needed for the algorithm to work. In our case we only need to set two parameters.

BEGIN
delete from pisa_2_settings;
INSERT INTO PISA_2_settings (setting_name, setting_value) VALUES
(dbms_data_mining.algo_name, dbms_data_mining.ALGO_GENERALIZED_LINEAR_MODEL);
INSERT INTO PISA_2_settings (setting_name, setting_value) VALUES
(dbms_data_mining.prep_auto,dbms_data_mining.prep_auto_off );
COMMIT;
END;

Step 3 – Build the Regression Model

To build the regression model we need to use the CREATE_MODEL function that is part of the DBMS_DATA_MINING package. When calling this function we need to pass in the name of the model, the algorithm to use, the source data, the setting table and the target column we are interested in.

BEGIN
      DBMS_DATA_MINING.CREATE_MODEL(
        model_name          => 'PISA_REG_2',
        mining_function     => dbms_data_mining.regression,
        data_table_name     => 'pisa_2_build_v',
        case_id_column_name => null,
        target_column_name  => 'tilt',
        settings_table_name => 'pisa_2_settings');
END;

After this we should have our regression model.

Step 4 – Query the Regression Model details

To find out what was produced as in the previous step we can query the data dictionary.

SELECT model_name, 
       mining_function,
       algorithm,
       build_duration,
       model_size
from USER_MINING_MODELS
where model_name like 'P%';

image

select setting_name, 
       setting_value,
       setting_type
from all_mining_model_settings
where model_name like 'P%';

image

Step 5 – Apply the Regression Model to new data

Our final step would be to apply it to our new data i.e. the years that we want to know what the lean/tilt would be.

SELECT year_measured, prediction(pisa_reg_2 using *)
FROM   pisa_2_apply_v;

image

Tuesday, June 19, 2012

Using ODM Regression for the Leaning Tower of Pisa tilt problem

This blog post will look at how you can use the Regression feature in Oracle Data Miner (ODM) to predict the lean/tilt of the Leaning Tower of Pisa in the future.

This is a well know regression exercise, and it typically comes with a set of know values and the year for these values. There are lots of websites that contain the details of the problem. A summary of it is:

The following table gives measurements for the years 1975-1985 of the "lean" of the Leaning Tower of Pisa. The variable "lean" represents the difference between where a point on the tower would be if the tower were straight and where it actually is. The data is coded as tenths of a millimetre in excess of 2.9 meters, so that the 1975 lean, which was 2.9642.

Given the lean for the years 1975 to 1985, can you calculate the lean for a future date like 200, 2009, 2012.

Step 1 – Create the table

Connect to a schema that you have setup for use with Oracle Data Miner. Create a table (PISA) with 2 attributes, YEAR_MEASURED and TILT. Both of these attributes need to have the datatype of NUMBER, as ODM will ignore any of the attributes if they are a VARCHAR or you might get an error.

CREATE TABLE PISA
  (
    YEAR_MEASURED NUMBER(4,0),
    TILT          NUMBER(9,4)
);

Step 2 – Insert the data

There are 2 sets of data that need to be inserted into this table. The first is the data from 1975 to 1985 with the known values of the lean/tilt of the tower. The second set of data is the future years where we do not know the lean/tilt and we want ODM to calculate the value based on the Regression model we want to create.

Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1975,2.9642);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1976,2.9644);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1977,2.9656);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1978,2.9667);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1979,2.9673);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1980,2.9688);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1981,2.9696);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1982,2.9698);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1983,2.9713);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1984,2.9717);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1985,2.9725);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1986,2.9742);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1987,2.9757);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1988,null);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1989,null);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1990,null);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1995,null);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (2000,null);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (2005,null);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (2010,null);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (2009,null);

Step 3 – Start ODM and Prepare the data

Open SQL Developer and open the ODM Connections tab. Connect to the schema that you have created the PISA table in. Create a new Project or use an existing one and create a new Workflow for your PISA ODM work.

Create a Data Source node in the workspace and assign the PISA table to it. You can select all the attributes..

The table contains the data that we need to build our regression model (our training data set) and the data that we will use for predicting the future lean/tilt (our apply data set).

We need to apply a filter to the PISA data source to only look at the training data set. Select the Filter Rows node and drag it to the workspace. Connect the PISA data source to the Filter Rows note. Double click on the Filter Row node and select the Expression Builder icon. Create the where clause to select only the rows where we know the lean/tilt.

image

image

Step 4 – Create the Regression model

Select the Regression Node from the Models component palette and drop it onto your workspace. Connect the Filter Rows node to the Regression Build Node.

image

Double click on the Regression Build node and set the Target to the TILT variable. You can leave the Case ID at <None>.  You can also select if you want to build a GLM or SVM regression model or both of them. Set the AUTO check box to unchecked. By doing this Oracle will not try to do any data processing or attribute elimination.

image

You are now ready to create your regression models.

To do this right click the Regression Build node and select Run. When everything is finished you will get a little green tick on the top right hand corner of each node.

image

Step 5 – Predict the Lean/Tilt for future years

The PISA table that we used above, also contains our apply data set

image

We need to create a new Filter Rows node on our workspace. This will be used to only look at the rows in PISA where TILT is null.  Connect the PISA data source node to the new filter node and edit the expression builder.

image

Next we need to create the Apply Node. This allows us to run the Regression model(s) against our Apply data set. Connect the second Filter Rows node to the Apply Node and the Regression Build node to the Apply Node.

image

Double click on the Apply Node.  Under the Apply Columns we can see that we will have 4 attributes created in the output. 3 of these attributes will be for the GLM model and 1 will be for the SVM model.

Click on the Data Columns tab and edit the data columns so that we get the YEAR_MEASURED attribute to appear in the final output.

Now run the Apply node by right clicking on it and selecting Run.

Step 6 – Viewing the results

Where we get the little green tick on the Apply node we know that everything has run and completed successfully.

image

To view the predictions right click on the Apply Node and select View Data from the menu.

image

We can see the the GLM mode gives the results we would expect but the SVM does not.

Wednesday, June 13, 2012

Data Science Is Multidisciplinary

[Update :October 2016.  There appears to be some discussion about the Venn diagram I've proposed below. The central part of this diagram is not anything I can up with. It was a commonly used Venn diagram for Data Mining. Thanks to Polly Michell-Guthrie for providing the original reference for the Venn. I just added the outer ring of additional skills needed for the new area of Data Science. This was just my view of things back in 2012. Things have moved on a bit since then]

A few weeks ago I had a blog post called Domain Knowledge + Data Skills = Data Miner.
In that blog post I was saying that to be a Data Scientist all you needed was Domain Knowledge and some Data Skills, which included Data Mining.
The reality is that the skill set of a Data Scientist will be much larger. There is a saying ‘A jack of all trades and a master of none’. When it comes to being a data scientist you need to be a bit like this but perhaps a better saying would be ‘A jack of all trades and a master of some’.
I’ve put together the following diagram, which includes most of the skills with an out circle of more fundamental skills. It is this outer ring of skills that are fundamental in becoming a data scientist. The skills in the inner part of the diagram are skills that most people will have some experience in one or more of them. The other skills can be developed and learned over time, all depending on the type of person you are.
image
Can we train someone to become a data scientist or are they born to be a data scientist. It is a little bit of both really but you need to have some of the fundamental skills and the right type of personality. The learning of the other skills should be easy(ish)
What do you think?  Are their Skill that I’m missing?

Friday, May 11, 2012

Domain Knowledge + Data Skills = Data Miner

Over the past few weeks I have been talking to a lot of people who are looking at how data mining can be used in their organisation, for their projects and to people who have been doing data mining for a log time.

What comes across from talking to the experienced people, and these people are not tied to a particular product, is that you need to concentrate on the business problem. Once you have this well defined then you can drill down to the deeper levels of the project. Some of these levels will include what data is needed (not what data you have), tools, algorithms, etc.

Statistics is only a very small part of a data mining project. Some people who have PhDs in statistics who work in data mining say you do not use or very rarely use their statistics skills.

Some quotes that I like are:

"Focus hard on Business Question and the relevant target variable that captures the essence of the question." Dean Abbott PAW Conf April 2012

"Find me something interesting in my data is a question from hell. Analysis should be guided by business goals." Colin Shearer PAW Conf Oct 2011

There has need a lot of blog posting and articles on what are the key skills for a Data Miner and the more popular Data Scientist. What is very clear from all of these is that you will spend most of your time looking at, examining, integrating, manipulating, preparing, standardising and formatting the data. It has been quoted that all of these tasks can take up to 70% to 85% of a Data Mining/Data Scientist time. All of these tasks are commonly performed by database developers and in particular the developers and architects involved in Data Warehousing projects. The rest of the time for the running of the data mining algorithms, examining the results, and yes some stats too.

Every little time is spent developing algorithms!!! Why is this ? Would it be that the algorithms are already developed (for a long time now and are well turned) and available in all the data mining tools. We can almost treat these algorithms as a black box. So one of the key abilities of a data miner/data scientist would be to know what the algorithms can do, what kind of problems they can be used for, know what kind of outputs they produce, etc.

Domain knowledge is important, no matter how little it is, in preparing for and being involved in a data mining project. As we define our business problem the domain expert can bring their knowledge to the problem and allows us separate the domain related problems from the data related problems. So the domain expertise is critical at that start of a project, but the domain expertise is also critical when we have the outputs from the data mining algorithms. We can use the domain knowledge to tied the outputs from the data mining algorithms back to the original problem to bring real meaning to the original business problem we are working on.

So what is the formula of skill sets for a data mining or data scientist. Well it is a little like the title of this blog;

Domain Knowledge + Data Skills + Data Mining Skills + a little bit of Machine Learning + a little bit of Stats = a Data Miner / Data Scientist

Tuesday, April 24, 2012

2 Day Oracle Data Miner course material

Last week I managed to get my hands on the training material for the 2 Day Oracle Data Miner course. This course is run by Oracle University.

Many thanks to Michael O’Callaghan who is a BI Sales person here in Ireland and Oracle University, for arranging this.

The 2 days are pretty packed with a mixture of lecture type material, lots of hands on exercises and some time for open discussions. In particular, day 2 will be very busy day.

Check out the course outline and published schedule – click here

You can have this course on site at your organisation. If this is something that interests you then contact your Oracle University account manager. There is also the traditional face-to-face delivery and the newer online delivery, where people from around the world come together for the online class.

Monday, April 23, 2012

Oracle Analytics Sessions at COLLABORATE12

There are a number of Oracle Advanced Analytics and related topics taking place this week at COLLABORATE12 in Las Vegas (http://collaborate12.com).

Date Time Presentation Presenter
Sun 22nd 9:00-3pm Oracle Business Intelligence Application Journey  
Mon 23rd 9:45-10:45 Managing Unstructured Data using Hadoop, Oracle 11g and Oracle Exadata Database Machine Jim Steiner
Mon 23rd 9:45-10:45 Environmental Data Management and Analytics-a Real World Perspective Angela Miller
Mon 23rd 11-12 Public Safety and Environmental Real-Time Analytics using Oracle Business Intelligence Raghav Venkat
Therese Arguelles
Mon 23rd 11-12 BI is more than slice and dice Peter Scott
Mon 23rd 14:30-15:30 In-Database Analytics: Predictive Analytics, Data Mining, Exadata & Business Intelligence Jacek Myczkowski
Mon 23rd 15:45-16:45 Big Data Analytics, R you ready Mark Hornick
Shyam Nath
Tues 24th 10:45-11:45 BI Analytics and Oracle NoSQL. The Future of Now Manish Khera
Wed. 25th 8:15-9:15 Oracle Data Mining – A Component of the Oracle Advanced Analytics Option-Hands-on Lab Charlie Berger
Wed 25th 9:30-10:30 Oracle R Enterprise – A Component of the Oracle Advanced Analytics Option-Hands-on Lab Mark Hornick

Here are the abstracts from the two main Oracle Advanced Analytics presentations by Charlie Berger and Mark Hornick

Oracle Data Mining – A Component of the Oracle Advanced Analytics Option

This Hands-on Lab provides an introduction to Oracle Data Mining and the Oracle Data Miner GUI.

Oracle Data Mining (ODM), now part of Oracle Advanced Analytics, provides an extensive set of in-database data mining algorithms that solve a wide range of business problems. It can predict customer behavior, detect fraud, analyze market baskets, segment customers, and mine text to extract sentiments. ODM provides powerful data mining algorithms that run as native SQL functions for in-database model building and model deployment. There is no need for the time delays and security risks of data movement.

The free Oracle Data Miner GUI is an extension to Oracle SQL Developer 3.1 that enables data analysts to work directly with data inside the database, explore the data graphically, build and evaluate multiple data mining models, apply ODM models to new data, and deploy ODM’s predictions and insights throughout the enterprise. Oracle Data Miner work flows capture and document the user's analytical methodology and can be saved and shared with others to automate advanced analytical methodologies.

Oracle R – A component of the Oracle Advanced Analytics Option

This Hands-on Lab provides an introduction to Oracle R Enterprise.

Oracle R Enterprise, a part of the Oracle Advanced Analytics Option, makes the open source R statistical programming language and environment ready for the enterprise by integrating R with Oracle Database. R users can interactively and transparently execute R scripts for statistical and graphical analyses on data stored in Oracle Database. R scripts can be executed in Oracle Database using potentially multiple database-managed R engines - resulting in data parallel execution. ORE also provides a rich set of statistical functions and advanced analytics techniques.

In this lab, attendees will be introduced to Oracle's strategy for R, including the Oracle R Distribution, Oracle R Enterprise (ORE), and Oracle R Connector for Hadoop (ORCH). We will focus on Oracle R Enterprise with hands-on exercises exploring the transparency layer, embedded R execution, and statistics engine.

Tuesday, April 10, 2012

Oracle Advanced Analytics Video by Charlie Berger

Charlie Berger (Sr. Director Product Management, Data Mining & Advanced Analytics) as produced a video based on a recent presentation called ‘Oracle Advanced Analytics: Oracle R Enterprise & Oracle Data Mining’.

This is a 1 hour video, including some demos, of product background, product features, recent developments and new additions, examples of how Oracle is including Oracle Data Mining into their fusion applications, etc.

Oracle has 2 data mining products, with main in-database Oracle Data Mining and the more recent extensions to R to give us Oracle R Enterprise.

Check out the video – Click here.

Check out Charlie’s blog at https://blogs.oracle.com/datamining/

Oracle University : 2 Day Oracle Data Mining training course

Friday, February 10, 2012

ODM–Attribute Importance using PL/SQL API

In a previous blog post I explained what attribute importance is and how it can be used in the Oracle Data Miner tool (click here to see blog post).

In this post I want to show you how to perform the same task using the ODM PL/SQL API.

The ODM tool makes extensive use of the Automatic Data Preparation (ADP) function. ADP performs some data transformations such as binning, normalization and outlier treatment of the data based on the requirements of each of the data mining algorithms. In addition to these transformations we can specify our own transformations.  We do this by creating a setting tables which will contain the settings and transformations we can the data mining algorithm to perform on the data.

ADP is automatically turned on when using the ODM tool in SQL Developer. This is not the case when using the ODM PL/SQL API. So before we can run the Attribute Importance function we need to turn on ADP.

Step 1 – Create the setting table

CREATE TABLE Att_Import_Mode_Settings (
  setting_name  VARCHAR2(30),
  setting_value VARCHAR2(30));

Step 2 – Turn on Automatic Data Preparation

BEGIN
   INSERT INTO Att_Import_Mode_Settings (setting_name, setting_value)
   VALUES (dbms_data_mining.prep_auto,dbms_data_mining.prep_auto_on);
  COMMIT;
END;

Step 3 – Run Attribute Importance

BEGIN
  DBMS_DATA_MINING.CREATE_MODEL(
    model_name => 'Attribute_Importance_Test',
    mining_function  => DBMS_DATA_MINING.ATTRIBUTE_IMPORTANCE,
    data_table_name  > 'mining_data_build_v',
    case_id_column_name => 'cust_id',
    target_column_name  => 'affinity_card',
    settings_table_name => 'Att_Import_Mode_Settings');
END;

Step 4 – Select Attribute Importance results

SELECT *
FROM TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_AI('Attribute_Importance_Test'))
ORDER BY RANK;

ATTRIBUTE_NAME       IMPORTANCE_VALUE       RANK
-------------------- ---------------- ----------
HOUSEHOLD_SIZE             .158945397          1
CUST_MARITAL_STATUS        .158165841          2
YRS_RESIDENCE              .094052102          3
EDUCATION                  .086260794          4
AGE                        .084903512          5
OCCUPATION                 .075209339          6
Y_BOX_GAMES                .063039952          7
HOME_THEATER_PACKAGE       .056458722          8
CUST_GENDER                .035264741          9
BOOKKEEPING_APPLICAT       .019204751         10
ION

CUST_INCOME_LEVEL                   0         11
BULK_PACK_DISKETTES                 0         11
OS_DOC_SET_KANJI                    0         11
PRINTER_SUPPLIES                    0         11
COUNTRY_NAME                        0         11
FLAT_PANEL_MONITOR                  0         11