Wednesday, May 8, 2013

New website for my blog

A few days ago I moved my blog to a new domain name

www.oralytics.com

Check it out. Wait you already are if you are reading this Smile

The domain name is a merger of Oracle and Analytics, and has a familiar ring to it for those of you who know Oracle.

The old web link still works (for now)

brendantierneydatamining.blogspot.com

I’ll be look to update the look and feel over the coming months.

Thursday, April 25, 2013

Oracle Magazine-September/October 1999

The headline articles in the September/October 1999 edition of Oracle Magazine focused on how the Oracle technology can be used to educate staff and to keep their skills up to date. either on site or remote via on-demand training resources.

image

Other articles included:

  • Oracle announce that they have acquired Thinking Machine’s data mining business. This data mining product was called Darwin and is now called Oracle Data Mining. I will have a separate blog post for this announcement.
  • Oracle 8i Lite has shipped and comes with three component: Oracle Lite a single user (50K to 750K foot print), Web-to-Go allows users to access the same data and web applications both online and offline, iConnect that was a flexible architecture that enables reliable and scalable bi-directional synchronization of data and applications. Oracle 8i Lite was supported on MS Windows 95, 98 and NT, Windows CE, Palm OS and EPOC 32.
  • Oracle XML Parser for C and Oracle XML Parser for C++ are released and supports DOM and Simple API for XML (SAX) interfaces.
  • Oracle XML SQL utilities and XSQL Servlet facilitates the reading and writing of XML information from and to the Oracle database.
  • Siemens announce that they plan to build an Oracle 8i Applicance on its Primergy line of servers, based on Intel Pentium II Xeon processors.
  • Singapore Telecom’s Magix Server delivers the World’s first nationwide video on demand service. Their 12,000 subscriber were able to use a web-browser to select a video from the Megix Web side and SingTel automates the streaming of them to their computer.
  • Oracle 8i comes with some improvements in PL/SQL. These included Autonomous Transactions, Native Dynamic SQL, Invoker rights procedures, user-defined operators, new operators, bulk binds.
  • Part 2 of the article on exporting an Oracle Database to a Flat File. In this part of the article it looks at how you can use the UTL_FILE package.
  • How you can speed up query response times by using a Materialized Views. The article suggests the following steps to analyze the performance impact:
    • Configure the server parameters
    • Grant privileges to the appropriate schema
    • Create a materialized view
    • Refresh the optimizer statistics
    • Confirm that the materialized view is being used
    • Manually refresh a materialized view
  • Oracle introduces Oracle Log Miner to allow a DBA to analyze the REDO log files

Tuesday, April 23, 2013

Oracle buys Darwin back in 1999

The following is an extract from 1999 September/October edition of Oracle Magazine, about Oracle buying Thinking Machines. Their data mining software Darwin was integrated into the Oracle Database and renamed Oracle Data Miner.

Oracle Corporation’s recent acquisition of Thinking Machines’ data mining business extends Oracle’s data warehouse platform and business intelligence solution to include enterprise reporting, ad hoc query, advanced analysis and data mining software based on a common internet platform.

Oracle plans to incorporate the data mining software as an integral feature of Oracle Applications Customer Relationship Management site, which will facilitate the implementation of the e0business solutions developed by Oracle customers. In addition o the software technology, Oracle will receive rights to the domains think.com and thinkingmachines.com.

About Thinking Machines

Originally founded in 1983, Thinking Machines Corporation revolutionized high performance computing with its massively parallel supercomputing technology. The company has since evolved to focus exclusively on its Darwin data mining software for database marketing in the financial services and telecommunications industries. Darwin analyzes massive volumes of customer transaction, demographic and psychographic data, which can often amount to hundreds of millions of customer data records.

These advanced analyses help companies profile and target customers with greater accuracy, which allows companies to reduce customer attrition, assess customer profitability, cross sell to existing customers and detect fraud.

Darwin puts powerful data mining techniques in the hands of general business users and experienced analysts alike. Each to use wizards automate data mining while providing advanced users with full control over all options and parameters. The Darwin software combines advanced analytics - including neural networks, decisions trees and memory based reasoning, with impressive power and performance.

The solution’s one button model code generation, powerful scripting language and robust software development kit bring prediction capabilities to sales, call center, marking and the web.

Platforms and Languages

Darwin runs on Sun Microsystems and Hewlett-Packard servers and exports data mining models in C, C++ and Java for execution within Oracle Databases. A Microsoft Windows NT release is planned for later this year.”

Friday, April 19, 2013

Part 2–Getting start with Statistics for Oracle Data Science projects

This is the second blog on getting started with Statistics for Oracle Data Science projects.

In this blog post I will look at 3 more useful statistical functions that are available in the Oracle database. Remember these come are standard with the database. The first function I will look at is the WIDTH_BUCKET function. This can be used to create some histograms of the data. A common task in analytics projects is to produce some cross tabs of the data. Oracle has the STATS_CROSSTAB. The last function I will look the different ways you an sample the data.

Histograms using WIDTH_BUCKET

When exploring your data it is useful to group values together into a number of buckets. Typically you might want to define the width of each bucket yourself before passing the data into your data mining tools, but before you can decide what these are you need to do some exploring using a variety of widths. A good way to do this is to use the WIDTH_BUCKET function. This takes the following inputs:

Expression: This is the expression or attribute on which the you want to build the histogram.

Min Value: This is the lower or starting value of the first bucket

Max Value: This is the last or highest value for the last bucket

Num Buckets: This is the number of buckets you want created.

Typically the Min Value and the Max Value can be calculated using the MIN and MAX functions. As a starting point you generally would select 10 for the number of buckets. This is the number you will change, downwards as well as upwards, to if a particular pattern exists in the attribute.

Using the example scenario that I used in the first blog post, let us start by calculating the MIN and MAX for the AGE attribute.

image

Lets say that we wanted to create 10 buckets. This would create a bucket width of 7.3 for each bucket, giving us the following.

Bucket 1 : 17-24.3
Bucket 2: 24.3-31.6
Bucket 3: 31.6-38.8
Bucket 4: 38.8-46.1
Bucket 5: 46.1-53.4
Bucket 6: 53.4-60.7
Bucket 7: 60.7-68
Bucket 8: 68-75.3
Bucket 9: 75.3-82.6
Bucket 10: 82.6-90

These are the buckets that the WIDTH_BUCKET function gives us in the following:

SELECT cust_id,
       age,
       width_bucket(age,
                    (SELECT min(age) from mining_data_build_v),
                    (select max(age)+1 from mining_data_build_v),
                    10)  bucket
from mining_data_build_v
where rownum <=12
group by cust_id, age

image

An additional level of detail that is needed to allow us to plot the histograms for AGE, we need to aggregate up for all the records by bucket.

select intvl, count(*) freq
from (select width_bucket(salary,
(select min(salary) from employees),
(select max(salary)+1 from employees), 10) intvl
from HR.employees)
group by intvl
order by intvl;

image

We can take this code and embed it into the GATHER_DATA_STATS procedure that I gave in my Part 1 blog post.

Cross Tabs using STATS_CROSSTAB

Typically cross tabulation (or crosstabs for short) is a statistical process that summarises categorical data to create a contingency table. They provide a basic picture of the interrelation between two variables and can help find interactions between them.

Because Crosstabs creates a row for each value in one variable and a column for each value in the other, the procedure is not suitable for continuous variables that assume many values.

In Oracle we can perform crosstabs using one of their reporting tools. But if you don’t have one of these we will need to use the in-database function STATS_CROSSTAB. This function takes three parameters, the first two of these are the attributes you want to compare and the third is what test we want to perform. The tests available include:

  • CHISQ_OBS: Observed value of chi-squared
  • CHISQ_SIG: Significance of observed chi-squared
  • CHISQ_DF: Degree of freedom for chi-squared
  • PHI_COEFFICIENT: Phi coefficient
  • CRAMERS_V: Cramer’s V statistic
  • CONT_COEFFICIENT: Contingency coefficient
  • COHENS_K: Cohen’s kappa

CHISQ_SIG is the default.

Now let us look at some examples using our same data set.

image

Sampling Data

When our datasets are of relatively small size consisting of a few hundred thousand records we can explore the data is a relatively short period of time. But if your data sets are larger that that you may need to explore the data by taking a sample of it. What sampling does is that it takes a “random” selection of records from our data set up to the new number of records we have specified in the sample.

In Oracle the SAMPLE function takes a percentage figure. This is the percentage of the entire data set you want to have in the Sampled result. 

image

There is also a variant called SAMPLE BLOCK and the figure given is the percentage of records to select from each block.

image

Each time you use the SAMPLE function Oracle will generate a random seed number that it will use as a Seed for the SAMPLE function. If you omit a Seed number (like in the above examples), you will get a different result set in each case and the result set will have a slightly different number of records. If you run the sample code above over and over again you will see that the number of records returned varies by a small amount.

If you would like to have the same Sample data set returned each time then you will need to specify a Seed value. The Seed much be an integer between 0 and 4294967295.

image

In this case because we have specified the Seed we get the same “random” records being returned with each execution.

Thursday, April 11, 2013

Part 1–Getting started with Statistics for Oracle Data Science projects

With all analytics or data science projects one of the first steps typically involves the extraction of data from various sources, merging the data and then performing various statistics.

The extraction and merging of the data is well covered by lots of other people blogging about how to use Oracle Data Integration (ODI), Oracle Warehouse Builder (OWB), among many others.

What I’m going to look at in this series of blog posts will be what statistical functions you might look at using in the Oracle and how to use them.

  • This the first blog post in the series will look at the DBMS_STAT_FUNCS PL/SQL package, what it can be used for and I give some sample code on how to use it in your data science projects. I also give some sample code that I typically run to gather some additional stats.
  • The second blog post will look at some of the other statistical functions that exist in SQL that you will/may use regularly in your data science projects.
  • The third blog post will provide a summary of the other statistical functions that exist in the database.

These statistical functions can be grouped into 2 main types. The first is the descriptive statistics that are available by using the DBMS_STAT_FUNCS PL/SQL package and then there is the extensive list of other SQL stats functions.  It is worth mentioning at this point that all these stats packages and functions come as standard in the database (i.e. they are FREE, you do not have to pay for an add on option for the database to use them). So once you have you Oracle database installed you can start using them. There is no need to spend money buying another stats package to do stats. All you need to know is some SQL and what the stats functions are.

DBMS_STAT_FUNCS

One of the stats package that I use a lot is the SUMMARY function. This is part of the DBMS_STAT_FUNC PL/SQL package. This package calculates a number of common statistics for an attribute in a table. Yes that’s correct, it only gather statistics for just one attribute. So you will have to run it for all the numeric attributes in the table.

For does people who are familiar with the Oracle Data Miner tool, the explore data node produces a lot of these statistics that the SUMMARY function produces. See below for details of how to produce the Histograms.

The SUMMARY function has the following parameters

image

Although you will probably be running this this function on the data in your schema you still have to give the schema name. The table name is the name of the table where the data exists, the column name is the name of the column that contains the actual data you want to analyse, and the ‘s’ is the record that will be returned by the SUMMARY function that contains all the summary information.

An example of the basic script to run the SUMMARY function is given below. It will use the data that is available in the sample schemas and the views that where setup for the Oracle Data Mining sample schemas. The table (or in this case the view) that we are going to use is the MINING_DATA_BUILD_V. What we are going to do is to replicate some of what the Explore Node does in the Oracle Data Miner tool.

set serveroutput on

declare
   s         DBMS_STAT_FUNCS.SummaryType;
begin
 
   DBMS_STAT_FUNCS.SUMMARY('DMUSER', 'MINING_DATA_BUILD_V', 'AGE', 3, s);

   dbms_output.put_line('SUMMARY STATISTICS');
   dbms_output.put_line('Count  : '||s.count);
   dbms_output.put_line('Min    : '||s.min);
   dbms_output.put_line('Max    : '||s.max);
   dbms_output.put_line('Range  : '||s.range);
   dbms_output.put_line('Mean   : '||round(s.mean));
   dbms_output.put_line('Mode Count : '||s.cmode.count);
   dbms_output.put_line('Mode        : '||s.cmode(1));
   dbms_output.put_line('Variance    : '||round(s.variance));
   dbms_output.put_line('Stddev      : '||round(s.stddev));
   dbms_output.put_line('Quantile 5  : '||s.quantile_5);
   dbms_output.put_line('Quantile 25 : '||s.quantile_25);
   dbms_output.put_line('Median      : '||s.median);
   dbms_output.put_line('Quantile 75 : '||s.quantile_75);
   dbms_output.put_line('Quantile 95 : '||s.quantile_95);
   dbms_output.put_line('Extreme Count : '||s.extreme_values.count);
   dbms_output.put_line('Extremes      : '||s.extreme_values(1));
   dbms_output.put_line('Top 5 : '||s.top_5_values(1)||','||
                                                s.top_5_values(2)||','||
                                                s.top_5_values(3)||','||
                                                s.top_5_values(4)||','||
                                                s.top_5_values(5));
   dbms_output.put_line('Bottom 5 : '||s.bottom_5_values(5)||','||
                                                     s.bottom_5_values(4)||','||
                                                     s.bottom_5_values(3)||','||
                                                     s.bottom_5_values(2)||','||
                                                     s.bottom_5_values(1));
end;
/

image

We can compare this to what is produced by the Explore Node in ODM

image

image

We can see that the Explore Node gives us more statistics to help us with understanding the data.

What Statistics does the Explore Node produce

We can see the actual SQL code that the Explore Node runs to get the statistics that are displayed in the Explore Node View Data window. To do this you will need to right-click on the Explore Node and move the mouse down to the Deploy option. The submenu will open and select ‘SQL to Clipboard’ from the list. Open a text editor and past the code. You  will need to tidy up some of this code to point it at the actual data source you want. You will get the following

SELECT /*+ inline */  ATTR, 
       DATA_TYPE, 
       NULL_PERCENT, 
       DISTINCT_CNT, 
       DISTINCT_PERCENT, 
       MODE_VALUE,
       AVG,
       MIN,
       MAX,
       STD,
       VAR,
       SKEWNESS,
       KURTOSIS,
       HISTOGRAMS
FROM OUTPUT_1_23;

Where OUTPUT_1_23 is a working table that ODM has created to store intermediate results from some of its processing. In this case the Explore Node. You will need to change this to the ODM working table in your schema.

This query does not perform any of the statistics gathering. It just presents the results.

Creating our own Statistics gathering script – Part 1

The attribute names in the above SQL query tells us what statistics functions within Oracle that were used. We can replicate this statistics gathering task using the following script. There are four parts to this script. The first part gathers most of the common statistics for the attribute. The second and third parts calculate the Skewness and Kurtosis for the attribute. The final (fourth) part combines the first three parts and lists the outputs.

The one statistic function that we are not including at this point is the Histogram information. I will cover this in the next (second) blog post on statistics.

The following script has the data source table/view name included (MINING_DATA_BUILD_V) and the attribute we are going to use (AGE).  You will need to modify this script to run it for each attribute.

WITH
    basic_statistics AS (select (sum(CASE WHEN age IS NULL THEN 1 ELSE 0 END)/COUNT(*))*100 null_percent,
          count(*)    num_value,
          count(distinct age)   distinct_count,
          (count(distinct age)/count(*))*100     distinct_percent,
          avg(age)      avg_value,
          min(age)      min_value,
          max(age)     max_value,
          stddev(age)  std_value,
          stats_mode(age)   mode_value,
          variance(age)       var_value
        from   mining_data_build_v),
    skewness AS (select avg(SV) S_value
                 from (select power((age - avg(age) over ())/stddev(age) over (), 3) SV
                       from mining_data_build_v) ),
    kurtosis AS (select avg(KV) K_value
                 from (select power((age - avg(age) over ())/stddev(age) over (), 4) KV
                       from mining_data_build_v) )
SELECT null_percent,
       num_value,
       distinct_percent,
       avg_value,
       min_value,
       max_value,
       std_value,
       mode_value,
       var_value,
       S_value,
       K_value
from basic_statistics,
     skewness,
     kurtosis;

image

Part 2 – Lets do it for all the attributes in a table

In the code above I’ve shown how you can gather the statistics for one particular attribute of one table.But in with an analytics project you will want to gather the statistics on all the attributes.

What we can do is to take the code above and put it into a procedure. This procedure accepts a table name as input, loops through the attributes for this table and calculates the various statistics. The statistics are saved in a table called DATA_STATS (see below).

drop table data_stats;

create table DATA_STATS (
table_name VARCHAR2(30) NOT NULL,
column_name VARCHAR2(30) NOT NULL,
data_type VARCHAR2(106) NOT NULL,
data_length NUMBER,
data_percision NUMBER,
data_scale NUMBER,
num_records NUMBER,
distinct_count NUMBER,
null_percent NUMBER,
distinct_percent NUMBER,
avg_value NUMBER,
min_value NUMBER,
max_value NUMBER,
std_value NUMBER,
mode_value VARCHAR2(1000),
var_value NUMBER,
s_value NUMBER,
k_value NUMBER,
PRIMARY KEY (table_name, column_name));

This is one of the first things that I do when I start on a new project. I create the DATA_STATS table and run my procedure GATHER_DATA_STATS for each table that we will be using. By doing this it allows me to have a permanent records of the stats for each attribute and saves me time in having to rerun various stats at different points of the project. I can also use these stats to produces some additional stats or to produce some graphs.

He is the code for the GATHER_DATA_STATS procedure.

CREATE OR REPLACE PROCEDURE gather_data_stats(p_table_name IN varchar2) AS

   cursor c_attributes (c_table_name varchar2)
                       is SELECT table_name,
                                 column_name,
                                 data_type,
                                 data_length,
                                 data_precision,
                                 data_scale
                          FROM user_tab_columns
                          WHERE table_name = upper(c_table_name);

   v_sql     NUMBER;
   v_rows    NUMBER;
BEGIN
   dbms_output.put_line('Starting to gather statistics for '||upper(p_table_name)||' at '||to_char(sysdate,'DD-MON-YY HH24:MI:SS'));

   FOR r_att in c_attributes(p_table_name) LOOP
      --
      -- remove any previously generated stats
      --
      v_sql := DBMS_SQL.OPEN_CURSOR;
      DBMS_SQL.PARSE(v_sql, 'delete from DATA_STATS where table_name = '''||r_att.table_name||''' and column_name = '''||r_att.column_name||'''', DBMS_SQL.NATIVE);
      v_rows := DBMS_SQL.EXECUTE(v_sql);
--dbms_output.put_line('delete from DATA_STATS where table_name = '''||r_att.table_name||''' and column_name = '''||r_att.column_name||'''');

      IF r_att.data_type = 'NUMBER' THEN
         dbms_output.put_line(r_att.table_name||' : '||r_att.column_name||' : '||r_att.data_type);

         --
         -- setup the insert statement and execute
         --
         v_sql := DBMS_SQL.OPEN_CURSOR;
         DBMS_SQL.PARSE(v_sql, 'insert into data_stats select '''||r_att.table_name||''', '''||r_att.column_name||''', '''||r_att.data_type||''', '||r_att.data_length||', '||nvl(r_att.data_precision,0)||', '||nvl(r_att.data_scale,0)||', count(*) num_value, (sum(CASE WHEN '||r_att.column_name||' IS NULL THEN 1 ELSE 0 END)/COUNT(*))*100 null_percent, count(distinct '||r_att.column_name||') distinct_count, (count(distinct '||r_att.column_name||')/count(*))*100 distinct_percent, avg('||r_att.column_name||') avg_value, min('||r_att.column_name||') min_value, max('||r_att.column_name||') max_value, stddev('||r_att.column_name||') std_value, stats_mode('||r_att.column_name||') mode_value, variance('||r_att.column_name||') var_value, null, null from '|| r_att.table_name, DBMS_SQL.NATIVE);
         v_rows := DBMS_SQL.EXECUTE(v_sql);

      ELSIF r_att.data_type IN ('CHAR', 'VARCHAR', 'VARCHAR2') THEN
         dbms_output.put_line(r_att.table_name||' : '||r_att.column_name||' : '||r_att.data_type);

         --
         -- We need to gather a smaller number of stats for the character attributes
         --
         v_sql := DBMS_SQL.OPEN_CURSOR;
         begin

         DBMS_SQL.PARSE(v_sql, 'insert into data_stats select '''||r_att.table_name||''', '''||r_att.column_name||''', '''||r_att.data_type||''', '||r_att.data_length||', '||nvl(r_att.data_precision,0)||', '||nvl(r_att.data_scale,0)||', count(*) num_value, (sum(CASE WHEN '||r_att.column_name||' IS NULL THEN 1 ELSE 0 END)/COUNT(*))*100 null_percent, count(distinct '||r_att.column_name||') distinct_count, (count(distinct '||r_att.column_name||')/count(*))*100 distinct_percent, null, null, null, null, stats_mode('||r_att.column_name||') mode_value, null, null, null from '|| r_att.table_name, DBMS_SQL.NATIVE);
         v_rows := DBMS_SQL.EXECUTE(v_sql);

-- dbms_output.put_line('insert into data_stats select '''||r_att.table_name||''', '''||r_att.column_name||''', '''||r_att.data_type||''', '||r_att.data_length||', '||nvl(r_att.data_precision,0)||', '||nvl(r_att.data_scale,0)||', count(*) num_value, (sum(CASE WHEN '||r_att.column_name||' IS NULL THEN 1 ELSE 0 END)/COUNT(*))*100 null_percent, count(distinct '||r_att.column_name||') distinct_count, (count(distinct '||r_att.column_name||')/count(*))*100 distinct_percent, null, null, null, null, stats_mode('||r_att.column_name||') mode_value, null, null, null from '|| r_att.table_name);
         exception
         when others then
            dbms_output.put_line(v_rows);
         end;

      ELSE
         dbms_output.put_line('Unable to gather statistics for '||r_att.column_name||' with data type of '||r_att.data_type||'.');
      END IF;
   END LOOP;

   dbms_output.put_line('Finished gathering statistics for '||upper(p_table_name)||' at '||to_char(sysdate,'DD-MON-YY HH24:MI:SS'));
  
   commit;
END;

Then to run it for a table:

exec gather_data_stats('mining_data_build_v');

We can view the contents of the DATA_STATS table by executing the following in SQL*Plus or SQL Developer

select * from DATA_STATS;

image

Tuesday, April 2, 2013

OTN has links to two of my blog posts

Over the past couple of weeks I’ve noticed that I had a bit of a spike in my blog stats (I don’t check them often). In particular there was 2 groups of blog posts that were getting a lot of the hit.

After a bit of investigation I found out that it was do to referrals from one particular website. It was OTN or Oracle Technology Network, and more specifically it was from their webpage dedicated for Database Admins and Developer.

Yes OTN had links to my blog posts on Clustering in Oracle Data Miner and to my blog post on Are you a Type I and Type II Data Scientists.

image

What a surprise this was to discover!!!  and what a honour Smile

I don’t know how long they will be on the OTN webpage, but hopefully lots of people in the Oracle community will find them useful.

I’m working on my next set of Oracle Data Miner blog posts, so watch this space. Plus I’ve started work on two technical articles that I’ll be submitting to OTN over the next few weeks. So hopefully you will see these up on OTN soon.

Monday, March 25, 2013

Review of Oracle Magazine–May/June 1999

The headline articles for the May/June 1999 edition of Oracle Magazine were focused on using internet technologies to allow businesses to work together more efficiently.

image

Other articles included:

  • Oracle and Hewlett-Packard announce the a prebuild Oralce 8i appliance. This had a code-name of Raw Iron.
  • Oracle announce a $100million venture fund to promote innovation by companies developing products and services based on Oracle 8i
  • Oracle 8i comes with the an enhanced feature that automatically keeps a standby database synchronized with the production database. This is called the Automated Standby Databases (ASD) and hopes to reduce the amount of manual work DBAs need to perform.
  • Some helpful suggestions on how to go about implementing parallel DML in Oracle 8.
    • Rules for Parallel Insert
      • The insert statement must be of the form ‘insert into table_name select …’
      • The table being modified must have a specified parallel declaration or you must specify a parallel hint directive in the insert statement
      • You can perform parallel insert on non-partitioned as well as partitioned tables
      • After the parallel DML is complete no other SQL statements can access the same table until a Commit is issued.
    • Rules for Parallel Update and Delete
      • Table table must have a parallel declaration specified or you must specify a parallel hint directive in the update/delete statement
      • You can perform parallel update or delete on partitioned tables only
      • You cannot see the result of the parallel update or delete during the transaction
  • By using the parallel options, data intensive SQL statements, database recovery, and data loads can be executed by multiple processes simultaneously. All the following operations can be executed in parallel
    • table scan
    • sort merge join
    • Not In
    • select distinct
    • aggregation
    • cube
    • create table as select
    • rebuild index partition
    • move partition
    • update
    • Insert ….. select
    • Enable constraint
    • PL/SQL functions called from SQL
    • Nested loop join
    • Hash join
    • Group by
    • Union and union all
    • Order by
    • Rollup
    • Create index
    • Rebuild index
    • Split partition
    • Delete
    • Star transformation

To view the cover page and the table of contents click on the image at the top of this post or click here.

My Oracle Magazine Collection can be found here. You will find links to my blog posts on previous editions and a PDF for the very first Oracle Magazine from June 1987.

Friday, March 22, 2013

Type I and Type II Data Scientists

Over the past 18 months we have seen a significant increase in the use of the term Data Scientist. Maybe it is because the HBR and many other publications have been promoting it.

Yes the areas of statistics and predictive analytics has evolved to include a lot more techniques and technologies.

Unfortunately the term Data Scientist has been over used and a lot of people have joined in with the Marketing hype. There are reports of organisations hiring a data scientist only to fire them within a few months because they did not deliver anything useful. Data Science is not some silver bullet to an organization problems and data science may not deliver anything useful, but in the vast majority of cases it will.

One thing that has been emerging over the past few weeks is that there seems to be two main types of Data Scientist. There are the Data Scientists who perform certain tasks or are focused on specific technologies. Then there are the Data Scientists who are not as technical as the previous group but are focused on how they can use the technologies to deliver business benefit.  I like to call these Type I and Type II Data Scientist.

The Type I Data Scientist

This is perhaps to most common type of Data Scientist we see around, or the most common type of person who is calling themselves a Data Scientist. These are people who know a lot about and are really good at a technique or technology that is associated with Data Science. Some of these would be the “old school” type of people and include:

  • Statisticians
  • Data Miners
  • Predictive Modellers
  • Machine Learning
  • Data Warehousing
  • Business Intelligence & Visualization
  • Big Data
  • R / Oracle / SAS / SPSS / etc.

The people in each of these have a deep knowledge of their topic and can tell/show you lots of detail about how best to explore data in their given field.

Yes you don’t have to have a Stats background to call yourself a Data Scientist, but some knowledge of Stats would be useful (you don’t need a PhD or Master)

The Type II Data Scientist

A Type II Data Scientist is a slightly different breed of person. They would have a little bit of knowledge of some or all of the areas listed under the Type I Data Scientist, but would not have the depth of knowledge of a topic that a Type I Data Scientist would have.

The Type II Data Scientist approaches the types of problems that organisations are facing in a different way. They will concentrate on the business goals and business problems that the organisation are facing. Based on these they will identify what the data scientist project will focus on, ensuring that there is a measurable outcome and business goal. The Type II Data Scientist will be a good communicator, being able to translate between the business problem and the technical environment necessary to deliver what is needed. During the project the data science team will discovery various insight about the data. The Type II Data Scientist will prioritise these and feed them back to the various business units. Some of these insights can range from something new, verifying business knowledge beliefs, areas where better data capture is needed, improvements in applications, etc.

The Type II Data Scientist would be the Data Science team leader within the organisation that manages the Type I Data Scientists, keeping them focused on the key deliverables of delivering measurable business benefits.

image

I really like the following phrase that I have come across recently:

“We haven't learned how to handle small data well, let alone throw big data on there.”

Data Science is not about Big Data. There is much more an organization can do with Data Science without having to get involved with Big Data. This is where the skills of the Type II Data Scientist is important, as they can direct the managers of an organization to focus on their real data problems and not get carried away with some of the marketing hype. When the time is right they will look at incorporating typical big data problems within their existing analytical environment.

One thing is for sure. The definition of “what is a” Data Scientist is still evolving. But there does seem to be some consensus the corresponds to the separation of the Type I and Type II Data Scientist roles.

Thursday, March 14, 2013

Clustering in Oracle Data Miner-Part 5

This is a the fifth and final blog post on building and using Clustering in Oracle Data Miner. The following outlines the contents of each post in this series on Clustering.

  1. The first part we looked at what clustering features exist in ODM and how to setup the data that we will be using in the examples
  2. The second part will focus on how to building Clusters in ODM .
  3. The third post will focus on examining the clusters produced by ODM and how to use the Clusters to apply to new data using ODM.
  4. The fourth post will look at how you can build and evaluate a Clustering model using the ODM SQL and PL/SQL functions.
  5. The fifth and final post will look at how you can apply your Clustering model to new data using the ODM SQL and PL/SQL functions.

 

Step 1 – What Clustering models do we have

In my previous post I gave the query to retrieve the clustering models that we have in our schema. Here it is again.

column model_name format a20
column mining_function format a20
column algorithm format a20
SELECT model_name,
       mining_function,
       algorithm,
       build_duration,
       model_size
FROM ALL_MINING_MODELS
WHERE mining_function = 'CLUSTERING';

image

This time we see that we have 3 cluster models. Our new model is called CLUSTER_KMEANS_MODEL.  

column child format a40
column cluster_id format a25

select cluster_id,
       record_count,
       parent,
       tree_level,
       child
from table(dbms_data_mining.get_model_details_km('CLUS_KM_1_25'))

The following image shows all the clusters produced and we can see that we have the renamed cluster labels we set when we used the ODM tool.

image

Step 2 – Setting up the new data

There are some simple rules to consider when preparing the data for the cluster model. These really apply to all of the data mining algorithms.

- You will need to have the data prepared and in the same format as you used for building the model

- This will include the same table structure. Generally this should not be a problem. If you need to merge a number of tables to form a table with the correct format, the simplest method is to create a view.

- All the data processing for the records and each attribute needs to be completed before you run the apply function.

- Depending on the complexity of this you can either build this into the view (mentioned above), run some pl/sql procedures and create a new table with the output, etc.  I would strongly suggest that the minimum pre-processing you have to do on the data the simpler the overall process and implementation will be.

- The table or view must have one attribute for the CASE_ID. The CASE_ID is an attribute that is unique for each record. If the primary key of the table is just one attribute you can use this. If not then you will need to generate a new attribute that is unique. One way to do this is to concatenate each of the attributes that form the primary key.

Step 3 – Applying the Cluster model to new data – In Batch mode

There are two ways of using an Oracle Data Mining model in the database. In this section we will look at how you can run the cluster model to score data in a batch mode. What I mean by batch mode is that you have a table of data available and you want to score the data with what the model thinks their cluster will be.

To do this we need to run the APPLY function that is part of the DBMS_DATA_MINING package.

image

image

For clustering we do not have CASE_ID, so we can leave this parameter NULL.

One of the parameters is called RESULT_TABLE_NAME. Using the DBMS_DATA_MINING.APPLY package and function, it looks to create a new table that will contain the outputs of the cluster scoring. This table (for the KMeans and O-Cluster algorithms) will contain three attributes.

CASE_ID       VARCHAR2/NUMBER
CLUSTER_ID    NUMBER
PROBABILITY   NUMBER

The table will have the CASE_ID. This is the effectively the primary key of the table.

If we take our INSURANCE_CUST_LTV table as the table containing the new data we want to score (Yes this is the same table we used to build the cluster model) and the CLUSTER_KMEANS_MODEL as the cluster model we want to use. The following codes show the APPLY function necessary to score the data.

BEGIN  

  DBMS_DATA_MINING.APPLY(
     model_name          => 'CLUSTER_KMEANS_MODEL',
     data_table_name     => 'INSURANCE_CUST_LTV',
     case_id_column_name => 'CUSTOMER_ID',
     result_table_name   => 'CLUSTER_APPLY_RESULT');
END;

On my laptop this took 3 second to complete. This involved scoring 15,342 records, creating the table CLUSTER_APPLY_RESULT and inserting 153,420 scored records into the table CLUSTER_APPLY_RESULT.

image

Why did we get 10 times more records in our results table than we did in our source table ?

Using the batch mode i.e. using the DBMS_DATA_MINING.APPLY function it will create a record for each of the possible clusters that the record will belong too along with the probability of it belonging to that cluster. In our case we have built our clustering models based on 10 clusters.

In the following diagram we have a listing for two of the customers in our dataset, the clusters that have been assigned to them and the probability of that record/customer belonging to that cluster. We can then use this information to make various segmentation decisions based on the probabilities that each has for the various clusters.

image

Step 4 – Applying the Cluster model to new data – In Real-time mode

When we looked at applying a classification algorithm to new data we were able to use the PREDICTION SQL function. As clustering is an unsupervised data mining technique we will not be able to use the PREDICTION function.

Instead we have the functions CLUSTER_ID and CLUSTER_PROBABILITY.

CLUSTER_ID will tell us what cluster the record is most likely to belong too i.e. the cluster with the highest probability.

This is different to the bulk processing approach as we will only get one record/result being returned.

In the following example we are asking what cluster do these two customers most likely belong too.

SELECT customer_id,
       cluster_id(cluster_kmeans_model USING *) as Cluster_Num
FROM   insurance_cust_ltv
WHERE  customer_id in ('CU3141', 'CU3142');

image

Is we look back to Step 3 above we will see that the clusters listed correspond to what we have discovered.

The next function is CLUSTER_PROBABILTY. With this function we can see what the probability of customer belonging to a particular cluster. Using the results for customer CU3141 we can see what the probability is for this cluster, along with a few other clusters.

SELECT customer_id,
       cluster_probability(cluster_kmeans_model, '3' USING *) as Cluster_3_Prob,
       cluster_probability(cluster_kmeans_model, '4' USING *) as Cluster_4_Prob,
       cluster_probability(cluster_kmeans_model, '7' USING *) as Cluster_7_Prob,
       cluster_probability(cluster_kmeans_model, '9' USING *) as Cluster_9_Prob
FROM   insurance_cust_ltv
WHERE  customer_id = 'CU3141';

image

We can also combine the CLUSTER_ID and CLUSER_PROBABILITY functions in one SELECT statement.

In the following query we want to know what the most likely cluster is for two customers and the cluster probability.

SELECT customer_id,
       cluster_id(cluster_kmeans_model USING *) as Cluster_Num,
        cluster_probability(cluster_kmeans_model, cluster_id(cluster_kmeans_model USING *) USING *) as Cluster_Prob
FROM   insurance_cust_ltv
WHERE  customer_id in ('CU3141', 'CU3142');

image

Check back soon for my more blog posts on performing data mining in Oracle, using the Oracle Data Miner tool (part of SQL Developer) and the in-database SQL and PL/SQL code.

I hope you have enjoyed blog posts on Oracle Data Miner and you have found them useful. Let me know if there are specific topics you would like me to cover.

Thanks

Brendan Tierney

Monday, March 11, 2013

Oracle Magazine–March/April 1999

The headline articles for the March/April 1999 edition of Oracle Magazine were on the evolving world of the DBA. With some much new technology available in the database the role of the DBA is moving from a back office type role to one having a significant strategic influence in the organisation.
image
Other articles included:
  • Oracle releases a web based version of their Oracle Strategic Procurement application that includes three key parts: Strategic Sourcing, Internet Procurement and Process Automation.
  • Sun and Oracle announce a strategic agreement that allows both companies to enhance their product offerings by exchanging key technologies. Oracle will use the core of the Sun Solaris operating environment to deliver the industry’s first database server appliances.
  • Oracle Data Mart Suite releases version 2.5. It includes, Oracle Data Mart Builder, Oracle Data Mart Designer, Oracle 8 Enterprise Edition, Oracle Discoverer, Oracle Application Server and Oracle Reports and Reports Server.
  • New integration between Oracle Reports release 6.0 and Oracle Express Server release 6.2 to give users the ability to distribute high quality reports of information held in a multi-dimensional database across the enterprise.
  • The need for the DBA to know and understand the V$ views has been increasing during the later releases of 7.3 and 8i. The can be used for a variety of purposes, including understanding locked users, system resources, licencing and parameter settings.
  • One thing that all DBAs need to plan for is a database recovery. Planning it is one thing, but practicing it is another thing. A typical recovery plan will include, choosing a data file, create a backup, take the damaged tablespace offline, restore the damaged data file, bring the tablespace back online, recover the tablespace, bring the tablespace back online and test it.
  • Avoiding trigger errors, including Mutating and constraining table errors.
  • There is an article by Bryan Laplante on using Historgrams to Optimize Data Mart Performance.

To view the cover page and the table of contents click on the image at the top of this post or click here.
My Oracle Magazine Collection can be found here. You will find links to my blog posts on previous editions and a PDF for the very first Oracle Magazine from June 1987.

Tuesday, March 5, 2013

Clustering in Oracle Data Miner–Part 4

This is a the fourth part of a five (5) part blog post on building and using Clustering in Oracle Data Miner. The following outlines the contents of each post in this series on Clustering.
  1. The first part we looked at what clustering features exist in ODM and how to setup the data that we will be using in the examples
  2. The second part will focus on how to building Clusters in ODM .
  3. The third post will focus on examining the clusters produced by ODM and how to use the Clusters to apply to new data using ODM.
  4. The fourth post will look at how you can build and evaluate a Clustering model using the ODM SQL and PL/SQL functions.
  5. The fifth and final post will look at how you can apply your Clustering model to new data using the ODM SQL and PL/SQL functions.
With Clustering we are trying to find hidden patterns in our data. Unlike classification we a not directing the algorithms on what areas/attributes to focus on.
In my preceding post on Clustering in Oracle Data Miner I showed how you can use the Oracle Data Miner (ODM) tool, that is part of SQL Developer, to build a cluster model and to apply it to new data.
As I have shown in previous blog posts we can do all of the same steps using SQL and PL/SQL. In this blog post I will show you how to to these SQL and PL/SQL functions to generate a Clustering model and how you can evaluate the model produced.
Step 1 – What Cluster Models to we already have
Oracle Data Miner comes with a number of data dictionary tables. We can use these to see what already exists in our schema. In the example below we will have a look to see the Cluster models that were produced in the Part 2 blog post.
format model_name format a20
format mining_function format a20
format algorithm format a20

SELECT model_name,
       mining_function, 
       algorithm, 
       build_duration, 
       model_size
FROM ALL_MINING_MODELS
WHERE mining_function = 'CLUSTERING';

image
We can also look at the model settings that ODM produced. We can look at the settings of the K-Means model. This model (in my case) is called CLUS_KM_1_25.
column setting_value format a20
select setting_name,
         setting_value,
         setting_type
from  all_mining_model_settings
where model_name = 'CLUS_KM_1_25'

image
We can also look to see what attributes are used in the K-Mean model.
SELECT attribute_name, 
       attribute_type, 
       usage_type, 
       target
from  all_mining_model_attributes
where model_name = 'CLUS_KM_1_25'

image
I will show you how to use this Clustering model or the one that we are about to generate in my next blog post.
Step 2 – Creating the Setting table
When creating an Oracle Data Mining model in SQL you need to create Setting table. This will contain a record for each setting they you need to set for the algorithm you want to run. The settings table has a very basic structure and only consists of two columns. It is a good idea to create a separate Settings table for each algorithm or type of data mining you want to do. To create the settings table we run the following:
CREATE TABLE CLUSTER_SETTINGS (
Setting_Name  VARCHAR2(30),
Setting_Value VARCHAR2(4000));

The next step is to define the setting we want to use. The first of these will be the actual algorithm you want to run. Oracle has two Clustering algorithms: algo_kmeans and algo_o_cluster. In the examples we are working through we are using the K-Mean algorithm and we want to set the number of clusters to 10. We are also going to set the Automatic Data Preparation (ADP) on. Here is the code.
BEGIN
   INSERT INTO CLUSTER_SETTINGS (setting_name, setting_value)
   VALUES (dbms_data_mining.algo_name, dbms_data_mining.ALGO_KMEANS);
   INSERT INTO CLUSTER_SETTINGS (setting_name, setting_value)
   VALUES (dbms_data_mining.prep_auto, dbms_data_mining.PREP_AUTO_ON);
   INSERT INTO CLUSTER_SETTINGS (setting_name, setting_value)
   VALUES (dbms_data_mining.clus_num_clusters, 10);
  
   COMMIT;
END;

We can check to see that these records were inserted.
column setting_value format a25
select * from CLUSTER_SETTINGS;
image
The other settings will be dependent on what clustering algorithm is being used. Each one has their own setting. If you do not define any additional setting Oracle will use the in-build default settings.
To see what the default settings:
column setting_value format a35
select *
from table(dbms_data_mining.get_default_settings)
where setting_name like 'KM%';

image
If you want to use a different value to the default, just create a new record in the CLUSTER_SETTINGS table with the new value.

Step 3 – Create the new Cluster Model
We have the algorithm settings ready, so the next steps is to create the Cluster model. We will be using the DBMS_DATA_MINING.CREATE_MODEL function and pointing it to our CLUSTER_SETTINGS table.
BEGIN
   DBMS_DATA_MINING.CREATE_MODEL(
      model_name          => 'CLUSTER_KMEANS_MODEL',
      mining_function     => dbms_data_mining.clustering,
      data_table_name     => 'INSURANCE_CUST_LTV',
      case_id_column_name => 'CUSTOMER_ID',
      target_column_name  => null,
      settings_table_name => 'CLUSTER_SETTINGS');
END;

This takes a few seconds to run on my laptop. When finished we can uses queries like we used in Step 1 above to find out the details of what was produced.
Step 4 – Examining the Cluster model
The first thing that we will need to check is that the model was created.
format model_name format a20
format mining_function format a15
format algorithm format a15

SELECT model_name, 
       mining_function,
       algorithm,
       build_duration,
       model_size
FROM ALL_MINING_MODELS
WHERE mining_function = 'CLUSTERING';

image
One thing that is a bit curious is the mode size. We have basically created two K-means models using the same settings, but the final models have different sizes. See below for an explanation.
Now lets look at the algorithms settings for our new Cluster model (Cluster KMeans Model)
select setting_name,
       setting_value,
       setting_type
from  all_mining_model_settings
where model_name = 'CLUSTER_KMEANS_MODEL'

image
We can also look at the attributes used in the clusters.
SELECT attribute_name, 
       attribute_type, 
       usage_type, 
       target
from  all_mining_model_attributes
where model_name = 'CLUSTER_KMEANS_MODEL'
image
If we compare this list back to the list of attributes that were part of the model that we developed in the ODM tool, you will see that we have some extra attributes listed for the new K-means model.
The reason for these extra attributes and the bigger size (although still small) will be due to us having applied a sampling in the ODM tool. We took a subset of the data in the ODM tool and built the model based on this data. For the model we just created in SQL we took all the data (no sampling). That is why it would have looked at including more attributes because it was looking at a much bigger data set, etc.
Step 5 – Examining the Cluster Details
In this step we are going to look at some of the clusters produced by our new Cluster model.
We can produce the information relating to the clusters that were produced and what level in the hierarchy each cluster belongs. We can use this information to create/draw our own cluster hierarchy diagram.
set lines 90
column child format a40

select id, 
       record_count, 
       parent, 
       tree_level, 
       child
from table(dbms_data_mining.get_model_details_km('CLUSTER_KMEANS_MODEL'))

image
To look at the Centroid details for a particular cluster e.g. Cluster ID = 7, we can run:
column attribute_name format a25
column mode_value format a25

select t.id,
       c.attribute_name, 
       c.mean,
       c.mode_value,
       c.variance
from table (dbms_data_mining.get_model_details_KM('CLUSTER_KMEANS_MODEL')) t,
     table(t.centroid) c
where t.id = 7
order by c.attribute_name

The results will contain details of the centroid for cluster number 7. The centroid details will include the attributes that were used to form the centroid and the values for each attribute.
image
The last thing we can look at is the rule for a cluster. The following query will give you the full details of the cluster rule for Cluster 7. This will give for each attribute what the values or ranges of values that is used for that cluster. I’ll let you run this as it returns a large number of records. Run it and have a look at some of the output.
select t.id, 
       a.attribute_name, 
       a.conditional_operator,
       nvl(a.attribute_str_value,
       a.attribute_num_value) AS value,
       a.attribute_support,
       a.attribute_confidence
from TABLE(dbms_data_mining.get_model_details_km('CLUSTER_KMEANS_MODEL'))  t, 
     TABLE(t.rule.antecedent)  a
where t.id = 7
ORDER BY t.id, a.attribute_name, attribute_support, attribute_confidence desc, value;


My next blog post on Clustering in Oracle, will be the final one in the series. It will show you how to apply a Clustering model to new data in your database.