Thursday, February 25, 2016

Recoding variable values using ore.recode

Oracle R Enterprise comes with a vast array of features that not really documented anywhere. One of these features that I've recently found useful is the ore.recode() function.

The following code illustrates how you can records the values in an existing attributes or (more specifically in this example) how you can create a new attribute based on the values in another attribute.

The data set that I'm using is the White Wine data set that can be found on the UCI Machine Learning Repository Archive website. You can download this data set and load it into a table in your Oracle schema using just two commands.

> WhiteWine = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv",
                        sep=";", header=TRUE)
> ore.create(WhiteWine, table="WHITE_WINE")

This data set has an attribute called "quality". This "quality" attribute contains values ranging from 2 to 8, and indicates the quality of the wine.

A typical task you may want to do is to relabel values into attributes to something a bit more meaningful or to group some values into a more standardised value.

To demonstrate this I want to create a new attribute that contains a description of the type of wine (and who I might share it with).

In this case, and to allow for other values in future versions of the data sets I've coded up the following:

quality  grade
-------  ----------------
1        Paint Stripper
2        Vinegar
3        Barely Drinkable
4        For the in-laws
5        For my family
6        To share with friends
7        For cooking
8        To share with my wife
9        Mine all Mine

The next step we need to perform is to gather some information about the values in the "quality" attribute. We can use the table command to quickly perform the aggregations, and then use the marplot function to graph the distributions.

> WHITE_WINE2 <- WHITE_WINE
> table(WHITE_WINE2$quality)
> barplot(table(WHITE_WINE2$quality), xlab="Wine Quality Ranking")

NewImage

Now we are ready to perform the recoding of the values using the ore.recode() function.

> WHITE_WINE2$grade <- ore.recode(WHITE_WINE2$quality, old=c(1, 2, 3, 4, 5, 6, 7, 8, 9), 
                     new=c("1-Paint Stripper", "2-Vinegar", "3-Barely Drinkable",
                           "4-For the in-laws", "5-For my family", "6-To share with friends", 
                           "7-For cooking", "8-To share with my wife", 
                           "9-Mine all Mine"))

You can now go and inspect the data, perform a frequency count and compare the values with what we had previously.

> head(WHITE_WINE2[,c("quality", "grade")]) 
> table(WHITE_WINE2$grade) 

The final step is to write the newly modified data set back to your Oracle schema into a new table. This is to ensure that the original data is modified so that it can be used or reused later.

> ore.create(WHITE_WINE2, "WHITE_WINE2")

Thursday, February 18, 2016

OUG Ireland 2016: APPs Track highlights

Today I was joined by Debra Lilly who is the APPs track lead and conference chair for OUG Ireland. Debra lets us know what we can look forward to on the APPs track at this years conference.

Check out Debra's video.

Click on the image below to get more details of the agenda and to register for this 2 day conference.

NewImage

Follow the conference and OUG Ireland on twitter using #oug_ire

Monday, February 15, 2016

OUG Ireland 2016: DBA and Developer track highlights

This morning I records a short video (1:18m) view with Simon Holt giving a quick overview and some highlights of the DBA and Developer tracks for the upcoming OUG Ireland 2016 conference.

Click on the image below to get more details of the agenda and to register for this 2 day conference.

NewImage

Follow the conference and OUG Ireland on twitter using #oug_ire

Friday, February 12, 2016

Spark versus Flink

Spark is an open source Apache project that provides a framework for multi stage in-memory analytics. Spark is based on the Hadoop platform and can interface with Cassandra OpenStack Swift, Amazon S3, Kudu and HDFS. Spark comes with a suite of analytic and machine learning algorithm allowing you to perform a wide variety of analytics on you distribute Hadoop platform. This allows you to generate data insights, data enrichment and data aggregations for storage on Hadoop and to be used on other more main stream analytics as part of your traditional infrastructure. Spark is primarily aimed at batch type analytics but it does come with a capabilities for streaming data. When data needs to be analysed it is loaded into memory and the results are then written back to Hadoop.

NewImage

Flink is another open source Apache project that provides a platform for analyzing and processing data that is in a distributed stream and/or batch data processing. Similarly to Spark, Flink comes with a set of APIs that allows for each integration in with Java, Scala and Python. The machine learning algorithms have been specifically tuned to work with streaming data specifically but can also work in batch oriented data. As Flink is focused on being able to process streaming data, it run on Yarn, works with HDFS, can be easily integrated with Kafka and can connect to various other data storage systems.

NewImage

Although both Spark and Flink can process streaming data, when you examine the underlying architecture of these tools you will find that Flink is more specifically focused for streaming data and can process this data in a more efficient manner.

There has been some suggestions in recent weeks and months that Spark is now long the tool of choice for analytics on Hadoop. Instead everyone should be using Flink or something else. Perhaps it is too early to say this. You need to consider the number of companies that have invested significant amount of time and resources building and releasing products on top of Spark. These two products provide similar-ish functionality but each product are designed to process this data in a different manner. So it really depends on what kind of data you need to process, if it is bulk or streaming will determine which of these products you should use. In some environments it may be suitable to use both.

Will these tool replace the more traditional advanced analytics tools in organisations? the simple answer is No they won't replace them. Instead they will complement each other and if you have a Hadoop environment you will will probably end up using Spark to process the data on Hadoop. All other advanced analytics that are part of your more traditional environments you will use the traditional advanced analytics tools from the more main stream vendors.

Tuesday, February 2, 2016

ORE video : Demo Code Part 5

The following is the fifth and final set of demo code from my video on using R in the Oracle Database. Check out the video before using the following code. The blog post for the video will be updated to contain links to all blog posts that have the various demo code.

The following examples illustrate how you can use the Oracle R Enterprise capabilities within SQL and PL/SQL. The following illustrate building a GLM model using the glm algorithm that comes with the R language, and then uses this mode to score or label new data that is stored in a table. The last part of the example illustrates how you can perform What-If analysis using this ORE model

-- Build & save the R script, called Demo_GLM in the DB
--  This builds a GLM  DM model in the DB
--
Begin
   sys.rqScriptDrop('Demo_GLM');
   sys.rqScriptCreate('Demo_GLM',
      'function(dat,datastore_name) {
          mod <- glm(AFFINITY_CARD ~ CUST_GENDER + AGE + CUST_MARITAL_STATUS + COUNTRY_NAME + CUST_INCOME_LEVEL + EDUCATION + HOUSEHOLD_SIZE + YRS_RESIDENCE, dat, family = binomial())
       ore.save(mod, name=datastore_name, overwrite=TRUE)   }');
end;
/

--
-- After creating the script you need to run it to create the GLM model
--
select * 	
from table(rqTableEval(
             cursor(select CUST_GENDER,
                           AGE,
                           CUST_MARITAL_STATUS,
                           COUNTRY_NAME,
                           CUST_INCOME_LEVEL,
                           EDUCATION,
                           HOUSEHOLD_SIZE,
                           YRS_RESIDENCE,
                           AFFINITY_CARD
                    from mining_data_build_v),
             cursor(select 1 as "ore.connect", 'myDatastore' as "datastore_name” from dual),
                  'XML', 'Demo_GLM' ));


--
-- There are 2 ways to use the GLM model : in Batch and in Real-Time mode
--
-- First Step : Build the in-database R script to score you new data
--
Begin
   sys.rqScriptDrop('Demo_GLM_Batch');
   sys.rqScriptCreate('Demo_GLM_Batch',
      'function(dat, datastore_name) {
      ore.load(datastore_name)
      prd <- predict(mod, newdata=dat)
      prd[as.integer(rownames(prd))] <- prd
      res <- cbind(dat, PRED = prd)
      res}');
end;
/

-- 
-- Now you can run the script to score the new data in Batch model
--   The data is located in the table MINING_DATA_APPLY
--
select * from table(rqTableEval(
              cursor(select CUST_GENDER, AGE, CUST_MARITAL_STATUS, COUNTRY_NAME, CUST_INCOME_LEVEL, EDUCATION, HOUSEHOLD_SIZE, YRS_RESIDENCE
                     from   MINING_DATA_APPLY_V
                     where rownum <= 10),
             cursor(select 1 as "ore.connect", 'myDatastore' as "datastore_name" from dual),
             'select CUST_GENDER, AGE, CUST_MARITAL_STATUS, COUNTRY_NAME, CUST_INCOME_LEVEL, EDUCATION, HOUSEHOLD_SIZE, YRS_RESIDENCE, 1 PRED from MINING_DATA_APPLY_V','Demo_GLM_Batch'))
order by 1, 2, 3;

--
-- Now let us use the Demo_GLM_Batch script to score data in Real-Time
--  The data values are passed to the GLM model
--
select * from table(rqTableEval(
              cursor(select 'M' CUST_GENDER,
                            23 AGE,
                            'Married' CUST_MARITAL_STATUS, 
                            'United States of America' COUNTRY_NAME,
                            'B: 30,000 - 49,999' CUST_INCOME_LEVEL, 
                            'Assoc-A' EDUCATION,
                            '3' HOUSEHOLD_SIZE, 
                            5 YRS_RESIDENCE
                     from dual),
              cursor(select 'myDatastore' "datastore_name", 1 "ore.connect" from dual),
                    'select CUST_GENDER, AGE, CUST_MARITAL_STATUS, COUNTRY_NAME, CUST_INCOME_LEVEL, EDUCATION, HOUSEHOLD_SIZE, YRS_RESIDENCE, 1 PRED from MINING_DATA_APPLY',
              'Demo_GLM_Batch')) order by 1, 2, 3;