Showing posts with label Data Science. Show all posts
Showing posts with label Data Science. Show all posts

Thursday, February 7, 2019

Ethics in the AI, Machine Learning, Data Science, etc Era

Ethics is one of those topics that everyone has a slightly different definition or view of what it means. The Oxford english dictionary defines ethics as, 'Moral principles that govern a person's behaviour or the conducting of an activity'.

As you can imagine this topic can be difficult to discuss and has many, many different aspects.

In the era of AI, Machine Learning, Data Science, etc the topic of Ethics is finally becoming an important topic. Again there are many perspective on this. I'm not going to get into these in this blog post, because if I did I could end up writing a PhD dissertation on it. 

But if you do work in the area of AI, Machine Learning, Data Science, etc you do need to think about the ethical aspects of what you do. For most people, you will be working on topics where ethics doesn't really apply. For example, examining log data, looking for trends, etc

But when you start working of projects examining individuals and their behaviours then you do need to examine the ethical aspects of such work. Everyday we experience adverts, web sites, marketing, etc that has used AI, Machine Learning and Data Science to delivery certain product offerings to us.

Just because we can do something, doesn't mean we should do it.

One particular area that I will not work on is Location Based Advertising. Imagine walking down a typical high street with lots and lots of retail stores. Your phone vibrates and on the screen there is a message. The message is a special offer or promotion for one of the shops a short distance ahead of you. You are being analysed. Your previous buying patterns and behaviours are being analysed, Your location and direction of travel is being analysed. Some one, or many AI applications are watching you. This is not anything new and there are lots of examples of this from around the world.
But what if this kind of Location Based Advertising was taken to another level. What if the shops had cameras that monitored the people walking up and down the street. What if those cameras were analysing you, analysing what clothes you are wearing, analysing the brands you are wearing, analysing what accessories you have, analysing your body language, etc. They are trying to analyse if you are the kind of person they want to sell to. They then have staff who will come up to you, as you are walking down the street, and will have customised personalised special offers on products in their store, just for you.

See the segment between 2:00 and 4:00 in this video.  This gives you an idea of what is possible.



Are you Ok with this?

As an AI, Machine Learning, Data Science professional, are you Ok with this?

The technology exists to make this kind of Location Based Marketing possible. This will be an increasing ethical consideration over the coming years for those who work in the area of AI, Machine Learning, Data Science, etc

Just because we can, doesn't mean we should!


Monday, August 1, 2016

Why Data Science projects fail

Over the past few weeks or months (maybe even years) I've had several conversations with various people about why Data Science (or whatever you want to call it) projects fail or never really get started.

Before we go any further perhaps we need to define what 'fail' means in these conversations. Typically fail means that the project doesn't deliver what was hoped for, it got bogged down is some technical or political issues, it did not deliver useful results, and more typically it is run once (or a couple of times) and never run again. You get the idea.

The following points outline some of the most typical reasons why Data Science projects fail, but this is not an exhaustive list. This list is just some of the most typical reason.

  • We need Big Data: It seems like everything that you read says you need Big Data for your data science project. Firstly what big data means to one person or company can be very different to what it means for another person/company. One possible definition is that it might include all the various social media and log type of data. If you don't have all of this data then no big deal. You can still do data science projects. You have lots and lots of other data. The data that you generate every day for the general running of your business. You can use that. If you have some history of this data going back over a few months or a couple of years then even better (and most of you will say Yes I have that data). Work with the data that you already have, that you already understand, that you are already using, etc and use that data to see if you can gain extra insights that will have some value to your business (it needs to have value otherwise whats the point). Some people call this everyday type of data you have, 'Small Data'. Big Data or Small Data are really bad terms. It is just Data. Let us work with data we already have and incrementally add in newer data (from your typical 'Big Data' sources) with each iteration of the data science project.
  • We need Big Technology: This kind of follows on from the mistake of believing we need Big Data to do our data science projects. As most companies will be working with the data that they already have, and you will have various technology solutions in place to manage this data. Then do we really need Big Data Technology solutions for our Data Science projects? Technologies like Hadoop and everything that goes along with it. The simple answer is 'No You Don't'. Now don't get me wrong. These technologies are important with it comes to managing Big Data, but you don't needs these to perform your data science projects. Many, many companies both large and small are performing data science projects using their existing technology solutions and have perhaps just added some analytics tools to support their project using the data that they are already managing. Most companies have databases to store and manage their data. You can use your analytics software to work with the data in these database to analyse, model and predict. Any results that are produced can be easily integrated back into these databases and the results can then be used by various groups within your organisation. Use the technologies you have, that you understand, that you can use to the max, supplemented with some newer analytics software that works with all of these for your data science projects. (An example: one project I've worked on included a retail organisation for one of the largest countries in the work. I was working with 3 years of sales data. Is this big data? I was able to use my laptop to perform advanced analytics on all their data)
  • Old School Data Science: Give me all your data, I'll analyse it and tell you what is happening. Unfortunately this kind of phrases are still very common. They are common and considered out of date 20 years ago when I worked on my first data science project (it wasn't called data science back then). If you do come across someone saying this to you, I would question their ability to deliver anything. If it was me, I would just say 'No thank you', and move onto someone else. You as a company will already know a lot of what is happening in your business, what data is currently being used for and any potential areas where you know advanced analytics and data science can help. You will know that the focus areas should be and how good or not your data is. You need someone who can help you to identify the key areas and what data science techniques can be used to help you to gain (a possible) greater insight into what is happening.
  • No clear objective or business question/problem and no measurable outcomes: In a way this is very similar to the previous point. You don't get into your car each morning and start driving, with the eventual hope that you arrive at work on time. No, you plan what you want to do (get to work), how you are going to get then (using your car) and when you want to get there by (your work start time). Using these you then plan out what is the best route you need to take to get to work, in the most efficient way you can, using your knowledge and experience of the road network, supplemented by traffic reports and making adjustments as necessary, to ensure that you get to work on time. This is exactly the same for data science projects. You need a good clear objective, that can be broken down into distinct problems, that will each require a specific set of advanced analytics to generate a measurable outcome. The measurable outcomes should allow you to measure if the advanced analytics actually gives you a valuable return. For example if you predict that you can increase sales by 3%, this sound good. But if the cost of implementing the solution is treating any the profit generated then you might decide that this solution is not worth continuing with.
  • Not productionalising the outcomes: This point follows on from the previous two points. A lot of what you read and a lot of what I've seen is that Data Science looks are discovering some new (and actionable) insights. But that is where the discussion ends. As if a report is produced that makes a recommendation or a list of customers to target, and that is it. What happens to your data science project then. It really gets canned or you might be told that we will come back to it in a few months (and possibly a year) from now. This is not what you really want. Why? because when you finally remember to come back to review the project and to do another run, the people who where involved in the original project have moved on or are not available. It then become too difficult to start over again and that is when the data science project fails. I've used the word 'productionalising' (is that a real word?) What I mean by that is that we need to take our data science project and build it into our every day applications and processes. For example if we build a customer risk model for loans in a bank. This should be built into the application that captures the loan application by the customer. That way when the bank employee is entering the loan application they can be given live feedback. They can then use this live feedback to address any issues with the customer. What can be typical is that this is discovered some weeks later when the loan has already been approved. We need to automate the use of our data science work. Another example is fraud detection. I know of several companies who have fraud detection measures in place. It can take them 4-6 weeks to identify a potential fraud case that needs investigation. Using data science and building this into their transaction monitoring systems they can now detect potential fraud cases in near real time )no big data architectures being used). By automating it we get quicker response and take actions at the right time. The quicker we can react the more money we can make or save. This is an area that a lot of companies are now focusing on when they are looking at data science project as this is they way that they can get a quicker return on their investment in their data science projects.
  • Very little senior management support: I think most of the data science projects are supported by senior management to some extent. The more successful the data science project the more involved the senior managers are and the more they understand of what these projects can potentially deliver. But with the ever changing and evolving world of IT most of the senior managers are very focused on the here and now, keeping the lights on, making sure their day-to-day applications are up and running, the backups and recovery processes are in place (and tested), and future proofing their application. It is well known that very little time and resources (human and money) are available for adding new functionality. Most of what I've mentioned is very IT related and perhaps the IT managers are not the most suitable people to sponsor data science projects. I've already some of the reasons but sometimes IT can get a bit caught up with the technology and trying to use the newest thing. Some of the most successful projects I've worked on have had senior managers from a business function. They will not be focused on the technology but on the processes around the data science project and how the outputs of the data science project can be used. The more focused they are on this the more successful the project will be. They will then act as the key to informing (and selling) the rest of the business on the success of the project. This in turn create more and more data scicene projects and will keep you busy for a long time to come
  • Ticking the box: Unfortunately I've seen this in way too many companies. Board level or the senior management team have hear about data science and all the magic that is can produce. The message is then passed down through the organisation that we need to be doing more and more of this. A business unit is chosen as for the pilot project. The pilot is completed, successfully, and the good news message is fed back up the ladder. But that is when enthusiasm ends. We have done a data science project, it was successful and now lets move on to the next thing. I've seen pilot or POC project that have proven to potentially save $10+M a year with a cost of $100K per year, being canned. Yes I've been told this is fantastic, this is beyond our wildest dreams. Only for nothing else to happen.
  • The data is no good: You need data, you need historical data. The more you have more more useful it will be for the data science project. But what if the data is of poor quality? How can this happen? Well it can happen very frequently. You may have applications that are poorly designed, that have a very poor data model, the staff are not trained correctly to ensure that good data gets entered, etc. etc. The list could go on and on. It is one thing for an application to capture data but if that data cannot be used for any meaningful purposes then it has very little value. Some companies have people hired that constantly inspect the data, assess the quality of the data and are then feeding back ideas on how to improve the quality of the data captured by the applications and also by the people inputting the data. Without good quality data then there is very little a data science project can do to magically convert it into good quality data. I've been in the situation where >90% of the data was unusable. We give them a list what improvements they needed to make and only come back to use then they have completed these and have at least 6 months of good quality data. We might be able to do something then. We never heard from them again. Also I get to talk to a lot of start ups who want to have data science build in from day one. These have very little 'real' data. Again I get to tell them come back to me when you have 6 months of data.
  • Too much focus on descriptive analytics: Although descriptive analytics is an important step in the early stages of all data science projects, they is still a huge number of consulting and product companies who are promoting this as a data science project. Like I said descriptive analytics is an important step, but it doesn't end there. It is just the beginning. When selecting a consulting or product company to partner with on your data science projects you need to ensure that they are offering more than just descriptive analytics. In a similar way to what I've mentioned in the points above, you need to look at how you can make use of these descriptive analytics and share them with the wider community in your company. But you also need to have some control over the proliferation of various visualisation tools. Descriptive analytics and visualisations is not data science or a major output of data science. It is only one part of a data science project and far more value outputs from a data science project can be achieved by using one or more of the advanced analytics methods that are available to you.
  • Ignoring your BI/DW: Unfortunately when it comes to a lot of data science projects your have two very different approaches to working with the data. One approach seems to be that we will look at your data that is available in the transactional databases (and other data sources), we will then look at how to integrate and clean this data before getting onto the fund stuff of exploring and then performing the advanced analytics. This approach completely ignores the BI team and any data warehouse that might exist. If a data warehouse already exists then it probably contains all or most of the data you are going to use. Therefore you can avoid all that them spent integrating and cleaning the data. The data warehouse will have this done for you. Plus the data warehouse will have a lot more data than what the current transactional databases will contain. Please, Please, Please use the data in the data warehouse and you will find that you will save a lot of time on your data science project. In addition to the time saved you will have a lot more (possibly years of) data to work with. I always try to work with data warehouse data. When I do I can go back 5 years and build predictive models from back then. I can then roll these through various time periods and can easily measure how good the level of predictive I'm getting. I also get to see if there are any changes in the data and how they affect the models. Plus I also get to see how the various algorithms and their associated models change and evolve over time. This allows me to demonstrate to the customer how the use of data science and predictive models works with their data over the past 5 years. This build up confidence with the customer on what is being done and what can be achieved. In one case I was able to demonstrate that if they implemented my solution 5 years ago, they would have save $40+M in that time period. If I didn't use the data warehouse I wouldn't have been able to prove this. Needless to say the customer was very happy.
  • Make up of team is wrong: You don't need a team of PhDs: There has been lots written about what the make up of skills what your data science team should be. Back a few years ago all the talk was that you need to have people with PhDs maths, stats or related states. Plus all you needed to do was to hire one of these. We all know that this is not true but was part of the rubbish that people were talking about. We all know that you really need a team of people and perhaps you already have some of these people already employed in your company already. You have database people, you have ETL people, you have data integration people, you have data analysts, you have project managers, you have business analysts, you have domain experts, etc. How many of those people have PhDs or require a PhD to do their job. But perhaps you don't have people with the skills of applying advanced analytic techniques to your data and business problems. Perhaps it is these people who you really need the most. Do these people really need to have a PhD? No they don't. You need someone who knows and understands the various techniques and most importantly how to use these to solve business problems. All too often people try to show off about using a particular technique or parameter setting, or a particular formula, or graphic technique, or using a certain language over another, or what library or package is the best. Don't engage in this. Look for people that can apply the correct technique or combination of techniques to your business problems. But despite what I said in the first two point, as your data management requirements grow you are going to need some addition people with some big data technologies.
  • Communication: being able to explain what data science can do, what it is producing and relating that back to the business. Being able to work with the management team, end users and all involved to show and explain what and how the data science project can do to support their work. Most technical people are not good at this. Bus some people are and these are a very valuable resource as part of your data science team or are keen supporter of what data science can do and how it can be used to help the business developed new and interesting actionable insights.
  • The output is not a report => You need to operationise/productionalise the data science project: See the point above on productionalising your data science work. The outputs should not be a report or a list of some form. With proper planning data science can become a central to all the operational systems in your company. They can help you make better and quicker decisions on how you interact with your customers, improve the efficiencies of your processes, etc. The list goes on and on. All data science projects are cyclical in nature. For example you developer a churn prediction system. You use this to interact with your customers. You are trying to change or alter their behaviour and this in turn changes them as a customer. This in turn affect the churn prediction system. It will no longer be as effective. So you will need to update it on a semi-regular basis. This could be every 3, 4, 6, or 12 months. It all depends. You can build in checks into your productionalised data science projects to detect when the predictive models need updating. This in turn helps your data science team to be more productive, with quicker turn around times of each iteration. Also with each iteration you can look to see if new data is available for you to include and use. Maybe at this point some of your big data sources are coming online with some useful data.

So when looking to start a Data Science project it is important to know a few things before you start. The following attempts to use the 5 W's to try explains these.

  • what you are doing
  • why you are doing it
  • who it is for and what they will gain from it
  • where will it be used within your applications/processes
  • when you are going to commence the project and how it will fit into strategic goals of your organisation

There has been plenty written about what magic Data Science projects will produce and bring to your organisation. You need to be careful of people who only talk about the magic. You also need to understand that it may not work or deliver what you are lead to believe. In all the projects I've worked on we have had some amazing results. But in one or two projects we have had results that where only a percentage or two better than what they are already doing.

Perhaps I need to write another blog post on 'Why Data Science projects succeed', and this will only be based on what I've experienced (in the real-world).

Like I said at the beginning, this is not an exhaustive list. There are many more and I'm sure you will have a few of your own. These are the typical reasons that I've come across in my 20 years of doing these kind of projects and long before the term data science existed.


Tuesday, July 28, 2015

Charting Number of R Packages over time (Part 3)

This is the third and final blog post on analysing the number of new R packages that have been submitted over time.

Check out the previous blog posts:

In this blog post I will show you how you can perform Forecasting on our data to make some predictions on the possible number of new packages over the next 12 months.

There are 2 important points to note here:

  1. Only time will tell if these predictions are correct or nearly connect. Just like with any other prediction techniques.
  2. You cannot use just one of the Forecasting techniques in isolation to make a prediction. You need to use a number of functions/algorithms to see which one suits your data best.

The second point above is very important with all prediction techniques. Sometimes you see people/articles talking about them only using algorithm X. They have not considered any of the other techniques/algorithms. It is their favourite or preferred method. But that does not mean it works or is suitable for all data sets and all scenarios.

In this blog post I'm going to use 3 different forecasting functions, the in-build Forecast function in R, using HoltWinters and finally using ARIMA. Yes there are many more (it is R after all) and I'll leave these for you to explore.

1. Convert data set to Time Series data format

The first thing I need to do is to convert the data I want analyze into TimeSeries format (ts). This looks to have one record or instance for each data point.

So you cannot not have any missing data, or in my case any missing dates. Yes (for my data set) we could have some months where we do not have any submissions. What I could do is to work out mean values (or things like that) and fill for the missing months. But I'm feeling a bit lazy and after examining the data I see that we have a continuous set of data from September 2009 onwards. This is fine as most of the growth up to that point is flat.

So I need to subset the data to only include cases greater than or equal to September 2009 and less than or equal to June 2015. I wanted to explore July 2015 as the number for this month is incomplete.

The following code builds on the work we did in the second blog post in the series

library(forecast)
library(ggplot2)

# Subset the data
sub_data <- subset(data.sum, Group.date >= as.Date("2009-08-01", "%Y-%m-%d"))
sub_data <- subset(sub_data, Group.date <= as.Date("2015-06-01", "%Y-%m-%d"))

# Subset again to only take out the data we want to use in the time series
sub_data2 <- sub_data[,c("R_NUM")]

# Create the time series data, stating that it is monthly (12) and giving the start and end dates
ts_data <- ts(sub_data2, frequency=12, start=c(2009, 8), end=c(2015, 6))

# View the time series data
ts_data
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2009                               2   3   4   3   4
2010   5   5   1  11   5   2   5   4   4   5   1   3
2011  11   4   3   6   6   5   9  15   5   8  23  18
2012  33  17  51  28  37  33  50  71  41 231  51  60
2013  75  67  76  81  76  74  77  89 111  96 111 200
2014 155 129 175 140 145 133 155 207 232 162 229 310
2015 308 343 332 378 418 558    

We now have the data prepared for input to our Forecasting functions in R.

2. Using Forecast in R

For the Forecast function all you need to do is pass in the Time Series dataset and tell the function how my steps into the future you want it to predict. In all my examples I'll ask the functions to predict for the next 12 months.

ts_forecast <- forecast(ts_data, h=12)
ts_forecast

         Point Forecast     Lo 80     Hi 80 Lo 95     Hi 95
Jul 2015       447.1965  95.67066  784.3163     0  958.6669
Aug 2015       499.4344 115.94329  873.1822     0 1069.8689
Sep 2015       551.7875 123.88426  952.0773     0 1230.4707
Oct 2015       603.7212 156.89486 1078.0395     0 1370.2069
Nov 2015       654.7647 143.29718 1179.4903     0 1603.8335
Dec 2015       704.5162 135.76829 1352.8925     0 1844.7230
Jan 2016       752.6447 151.09936 1502.6088     0 2100.9708
Feb 2016       798.8877 156.37383 1652.0847     0 2575.3715
Mar 2016       843.0474 159.67095 1848.1703     0 2888.1738
Apr 2016       884.9849 154.59456 2061.7990     0 3281.6062
May 2016       924.6136 148.04651 2325.9060     0 3891.5064
Jun 2016       961.8922 138.67935 2531.7578     0 4395.6033

plot(ts_forecast)

NewImage

For this we get a very large range of values and very wide predictive intervals. If we limit the y axis we can get a better picture of the actual predictions.

plot(ts_forecast, ylim=range(0:1000))
NewImage

3. Using HoltWinters

For HoltWinters we can use the in-built R function for this. All we need to do is to pass in the Time Series data set. The first part we can plot the HoltWinters for the existing data set

?HoltWinters
hw <- HoltWinters(ts_data)
plot(hw)
NewImage

Now we want to predict for the next 12 months

forecast <- predict(hw, n.ahead = 12, prediction.interval = T, level = 0.95)
forecast

              fit       upr      lwr
Jul 2015 519.9304  599.8097 440.0512
Aug 2015 560.1083  648.4183 471.7983
Sep 2015 601.4528  701.0163 501.8892
Oct 2015 643.9639  757.3750 530.5529
Nov 2015 681.5168  811.0727 551.9608
Dec 2015 724.7363  872.4508 577.0218
Jan 2016 773.8308  941.4768 606.1848
Feb 2016 809.8836  999.0401 620.7272
Mar 2016 847.1448 1059.2371 635.0525
Apr 2016 898.4476 1134.7795 662.1158
May 2016 933.8755 1195.6532 672.0977
Jun 2016 972.3866 1260.7376 684.0356

plot(hw, forecast)
NewImage

4. Using ARIMA

For ARIMA we need to perform a simple conversion of the Time Series data into ARIMA format and then perform the forecase

fc_arima <- auto.arima(ts_data)
fc_fc_arima  <- forecast(fc_arima, h=12)
fc_fc_arima

         Point Forecast    Lo 80     Hi 80    Lo 95     Hi 95
Jul 2015       524.4758 476.2203  572.7314 450.6753  598.2764
Aug 2015       567.1156 513.2301  621.0012 484.7048  649.5265
Sep 2015       609.7554 548.3239  671.1869 515.8041  703.7068
Oct 2015       652.3952 581.6843  723.1062 544.2522  760.5383
Nov 2015       695.0350 613.5293  776.5408 570.3828  819.6873
Dec 2015       737.6748 644.0577  831.2920 594.4998  880.8499
Jan 2016       780.3147 673.4319  887.1974 616.8516  943.7777
Feb 2016       822.9545 701.7797  944.1292 637.6337 1008.2752
Mar 2016       865.5943 729.2004 1001.9881 656.9978 1074.1907
Apr 2016       908.2341 755.7718 1060.6963 675.0631 1141.4050
May 2016       950.8739 781.5559 1120.1918 691.9244 1209.8233
Jun 2016       993.5137 806.6027 1180.4246 707.6581 1279.3693

plot(fc_fc_arima, ylim=range(0:800))
NewImage

As you can see there are very different results from each of these forecasting techniques. If this was a real life project on real data then we would go about exploring a lot more of the Forecasting function available in R. The reason for this is to identify which R function and Forecasting algorithm works best for our data.

Which Forecasting technique would you choose from the selection above?

But will this function and algorithm always work with our data? The answer is NO. As our data evolves so may the algorithm that works best for our data. This is why the data science/analytics world is iterative. We need to recheck/revalidate the functions/algorithms to see if we need to start using something else or not. When we do need to use another function/algorithm you need to ask yourself why this has happened, what has changed in the data, what has changed in the business, etc.

Wednesday, July 22, 2015

Charting Number of R Packages over time (Part 2)

This is the second blog post on charting the number of new R Packages over time.

Check out the first blog post that looked at getting the data, performing some simple graphing and then researching some issues that were identified using the graph.

In this blog post I will look at how you can aggregate the data, plot it, get a regression line, then plot it using ggplot2 and we will include a trend line using the geom_smooth.

1. Prepare the data

In my previous post we extracted and aggregated the data on a daily bases. This is the plot that was shown in my previous post. This gives us a very low level graph and perhaps we might get something a little bit more useable is we aggregated the data. I have the data in an Oracle Database so it would be easy for me to write another query to perform the necessary aggregation. But let's make things a little bit trickier. I'm going to use R to do the aggregation.

Our data set is in the data frame called data. What I want to do is to aggregate it up to monthly level. The first thing I did was to create a new column that contains the values of the new aggregate level.

data$R_MONTH <- format(rdate2, "%Y%m01")
data$R_MONTH <- as.Date(data$R_MONTH3, "%Y%m%d")
data.sum <- aggregate(x = data[c("R_NUM")],
                    FUN = sum,
                    by = list(Group.date = data$R_MONTH)
)

2. Plot the Data

We now have the data aggregated at monthly level. We can now plot the graph. Ignore the last data point on the chart. This is for July 2015 and I extracted the data on the 9th of July. So we do not have a full months of data here.

plot(as.Date(data.sum$Group.date), data.sum$R_NUM, type="b", xaxt="n", cex=0.75 , ylab="Num New Packages", main="Number of New Packages by Month")
axis(1, as.Date(data.sum$Group.date, "%Y-%d"), as.Date(data.sum$Group.date, "%Y-%d"), cex.axis=0.5, las=1)

This gives us the following graph.

NewImage

3. Plot the data using ggplot2

The basic plot function of R is great and allows us to quickly and easily get some good graphs produced. But it is a bit limited and perhaps we want to create something that is a bit more elaborate. ggplot2 is a very popular package that can allow us to create a graph, building it up in a number of steps and layers to give something that is a lot more professional.

In the following example I've kept things simple and Yes I could have done so much more. I'll leave that as an exercise for you go off an do.

The first step is to use the qplot function to produce a basic plot using ggplot2. This gives us something similar to what we got from the plot function.

library(ggplot2)
qplot(x=factor(data.sum$Group.date), y=data.sum$R_NUM, data=data.sum, 
       xlab="Year/Month", ylab='Num of New Packages', asp=0.5)

This gives us the following graph.

NewImage

Now if we use ggplot2 then we need to specify a lot more information. Here is the equivalent plot using ggplot2 (with a line plot).

NewImage

4. Include a trend line

We can very easily include a trend line in a ggplot2 graph using the geom_smooth command. In the following example we have the same chart and include a linear regression line.

plt <- ggplot(data.sum, aes(x=factor(data.sum$Group.date), y=data.sum$R_NUM)) + geom_line(aes(group=1)) +
  theme(text = element_text(size=7),
        axis.text.x = element_text(angle=90, vjust=1)) + xlab("Year / Month") + ylab("Num of New Packages") +
  geom_smooth(method='lm', se=TRUE, size = 0.75, fullrange=TRUE, aes(group=20))
plt

NewImage

We can tell a lot from this regression plot.

But perhaps we would like to see a trend line on the chart, with something like a moving averages plot. Plus I've added in a bit of scaling to help with representing the data at a monthly level.

library(scales)
plt <- ggplot(data.sum, aes(x=as.POSIXct(data.sum$Group.date), y=data.sum$R_NUM)) + geom_line() + geom_point() +
  theme(text = element_text(size=12),
        axis.text.x = element_text(angle=90, vjust=1)) + xlab("Year / Month") + ylab("Num of New Packages") +
  geom_smooth(method='loess', se=TRUE, size = 0.75, fullrange=TRUE) +
  scale_x_datetime(breaks = date_breaks("months"), labels = date_format("%b"))
plt
NewImage

In the third blog post on this topic I will look at how we can use some of the forecasting and predicting functions available in R. We can use these to see help us visualize what the future growth patterns might be for this data. I have some interesting things to show.

Wednesday, July 15, 2015

Charting Number of R Packages over time (Part 1)

This is the first of a three part blog post on charting and analysing the number of R package submissions.

(I will update this blog post with links to the other two posts as they come available)

I'm sure most of you have heard of the R programming language. If not then perhaps it is something that you might want to go off an learn a bit about. Why? well it is one of the most popular languages for performing various types of statistics, advanced topics on statistics and machine learning and for generating lots of cool looking graphs.

If this is not something that you might be interested then it is time to go to another website/blog.

In this blog post I'm going to chart the number of packages submitted to R and are available for download and installation.

Why am I doing this? I got bored one day after coming back from my vacation and I though it would be a useful thing to do. Then after doing this I decided to use these graphs somewhere else, but you will have to wait until 2016 to find out!

The R website has a listing of all the packages and the dates that they were submitted.

NewImage

There are a variety of tools available that you can use to extract the information on this webpage and there are lots of examples or R code too. I'll leave that as a little exercise for you to do.

I extracted all of this information and stored it in a table in my Oracle Database (of course I did as I work with Oracle databases day in day out). This will allow me to easily reuse this data whenever I need it plus I can update this table with new packages from time to time.

NewImage

The following R code:

  1. Setups up and ROracle connection to my schema in my database
  2. Connects to the database
  3. Setups up a query to extract the data from the table
  4. Fetches this data into an R data frame called data
  5. Reformat the date columns to remove the time element to it
  6. Plot the data
library(ROracle)

drv <- dbDriver("Oracle")
# Create the connection string
host <- "localhost"
port <- 1521
service <- "pdb12c"
connect.string <- paste(
  "(DESCRIPTION=",
  "(ADDRESS=(PROTOCOL=tcp)(HOST=", host, ")(PORT=", port, "))",
  "(CONNECT_DATA=(SERVICE_NAME=", service, ")))", sep = "")

con <- dbConnect(drv, username = "brendan", password = "brendan",dbname=connect.string)

res<-dbSendQuery(con, "select r_date, count(*) r_num from r_packages
                       group by r_date order by 1 asc")
data <- fetch(res)

rdate<- data$R_DATE
rdate2<-as.Date(rdate,"%d/%m/%y")

plot(data$R_NUM~rdate2, data, type="l" , xaxt="n")
axis(1, rdate2, format(rdate2, "%b %y"), cex.axis=.7, las=1)

After I run the above code I get the following plot.

NewImage

(Yes I could have done a better job on laying out the chart with all sorts of labels and colors etc)

This chart gives us a plot of the number of new submissions by day.

There are 2 very obvious things that stand out from this graph. The easiest one to deal with is that we can see that there has been substantical growth in new submissions over the past 3 years. Perhaps we need to examine these a bit closer and when you do you will find that a lot of these are existing packages that have been resubmitted with updates.

There is a very obvious peak just over half ways along the chart. We really need to investigate this to understand what has happended. This peak occurs on the 29th October 2012. What happened on the 29th October 2012 as this is clearly an anomaly with the rest of the data. Well on this date R version 2.15.2 was release and a there was a lot of update pagackes got resubmitted.

Check out my next two blog posts were I will explore this data in a bit more detail.

Part 2 blog post

Part 3 blog post

Wednesday, June 3, 2015

PMML in Oracle Data Mining

PMML (Predictive Model Markup Langauge) is an XML formatted output that defines the core elements and settings for your Predictive Models. This XML formatted output can be used to migrate your models from one data mining or predictive modelling tool to another data mining or predictive modelling tool, such as Oracle.

Using PMML to migrate your models from one tool to another allows for you to use the most appropriate tools for developing your models and then allows them to be imported into another tool that will be used for deploying your predictive models in batch or real-time mode. In particular the ability to use your Predictive Model within your everyday applications enables you to work in the area of Automatic or Prescriptive Analytics. Oracle Data Mining and the Oracle Database are ideal or even the best possible tools to allow for Automatic and Prescriptive Analytics for your transa

PMML is an XML based standard specified by the Data Mining Group

Oracle Data Mining supports the importing of PMML models that are compliant with version 3.1 of the standard and for Regression Models only. The regression models can be for linear regression or binary logistic regression.

The Data Mining Group Archive webpage have a number of sample PMML files for you to download and then to load into your Oracle database.

To Load the PMML file into your Oracle Database you can use the DBMS_DATA_MINING.IMPORT_MODEL function. I’ve given examples of how you can use this function to import an Oracle Data Mining model that was exported using the EXPORT_MODEL function.

The syntax of the IMPORT_MODEL function when importing a PMML file is the following

DBMS_DATA_MINING.IMPORT_MODEL (
      model_name        IN  VARCHAR2,
      pmmldoc           IN  XMLTYPE
      strict_check      IN  BOOLEAN DEFAULT FALSE);

The following example shows how you can load the version 3.1 Logistic Regression PMML file from the Data Mining Group archive webpage

NewImage

 

BEGIN    
   dbms_data_mining.IMPORT_MODEL (‘PMML_MODEL',
        XMLType (bfilename (‘IMPORT_DIR', 'sas_3.1_iris_logistic_reg.xml'),
          nls_charset_id ('AL32UTF8')
        ));
END;

 

This example uses the default value for STRICT_CHECK as FALASE. In this case if there are any errors in the PMML structure then these will be ignored and the imported model may contain “features” that may make it perform in a slightly odd manner.

Thursday, April 30, 2015

Viewing Models Details for Decision Trees using SQL

When you are working with and developing Decision Trees by far the easiest way to visualise these is by using the Oracle Data Miner (ODMr) tool that is part of SQL Developer.
Developing your Decision Tree models using the ODMr allows you to explore the decision tree produced, to drill in on each of the nodes of the tree and to see all the statistics etc that relate to each node and branch of the tree.
But when you are working with the DBMS_DATA_MINING PL/SQL package and with the SQL commands for Oracle Data Mining you don't have the same luxury of the graphical tool that we have in ODMr. For example here is an image of part of a Decision Tree I have and was developed using ODMr.
Blog dt 1
What if we are not using the ODMr tool? In that case you will be using SQL and PL/SQL. When using these you do not have luxury of viewing the Decision Tree.
So what can you see of the Decision Tree? Most of the model details can be used by a variety of functions that can apply the model to your data. I've covered many of these over the years on this blog.
For most of the data mining algorithms there is a PL/SQL function available in the DBMS_DATA_MINING package that allows you to see inside the models to find out the settings, rules, etc. Most of these packages have a name something like GET_MODEL_DETAILS_XXXX, where XXXX is the name of the algorithm. For example GET_MODEL_DETAILS_NB will get the details of a Naive Bayes model. But when you look through the list there doesn't seem to be one for Decision Trees.
Actually there is and it is called GET_MODEL_DETAILS_XML. This function takes one parameter, the name of the Decision Tree model and produces an XML formatted output that contains the attributes used by the model, the overall model settings, then for each node and branch the attributes and the values used and the other statistical measures required for each node/branch.
The following SQL uses this PL/SQL function to get the Decision Tree details for model called CLAS_DT_1_59.
SELECT dbms_data_mining.get_model_details_xml('CLAS_DT_1_59')
FROM dual;

If you are using SQL Developer you will need to double click on the output column and click on the pencil icon to view the full listing.
Blog dt 2
Nothing too fancy like what we get in ODMr, but it is something that we can work with.
If you examine the XML output you will see references to PMML. This refers to the Predictive Model Markup Language (PMML) and this is defined by the Data Mining Group (www.dmg.org). I will discuss the PMML in another blog post and how you can use it with Oracle Data Mining.

Friday, April 24, 2015

Changing REVERSE Transformations in Oracle Data Miner

In my previous blog post I showed you how you can have a look at the transformations that the Automatic Data Preparation (ADP) feature of Oracle Data Mining produces. I also gave some example of the different types of ADF that are performed for different algorithms.

One of the features of the transformations produced is that it will generate a REVERSE_EXPRESSION. This will take the scored results and apply the inverse of the transformation that was performed when the data was being prepared for input to the algorithm.

Somethings you may want to have the scored data returned in a slightly different ways or labeled in a slightly different way.

In this blog post I will show you how to define an alternative REVERSE_EXPRESSION for an attribute.

The function we need to use for this is the ALTER_REVERSE_EXPRESSION procedure that is part of the DBMS_DATA_MINING package.

When we score data for a typical classification problem we typically use 0 (zero) and 1 to be the target variable values. But what if we wanted the output from our classification model to label the scored data slighted differently.

In this case we can use the ALTER_REVERSE_EXPRESSION procedure to define the new values. What if we wanted the zero to be labeled as NO and the 1 as YES. In this case we can use the following.

BEGIN

    dbms_data_mining.alter_reverse_expression(

       model_name => 'CLAS_NB_1_59',

       expression => 'decode(affinity_card, ''1'', ''YES'', ''NO'')',

       attribute_name => 'AFFINITY_CARD');

END;

When we view the transformations for our data mining model we can now see the transformation.

Blog dat trans 3

Now when we score our data the predicted target variable will now have our newly defined values.

SELECT cust_id,

        PREDICTION(CLAS_NB_1_59 USING *) PRED

FROM mining_data_apply_v

FETHC FIRST 5 ROWS ONLY;

Blog dat trans 4

You can see that this is a very powerful feature and allows use to turn the scored data values is a different way to make them more useful. This is particularly the case as we work towards a more Automatic type of Predictive Analytics.

Saturday, April 18, 2015

ODM : View Transformations generated by Automatic Data Prepreparation

A very powerful feature of Oracle Data Mining and one that I think does not get enough notice is called Automatic Data Preparation.

Data Preparation is one of the most time consuming, repetitive and boring parts of the work that a Data Miner or Data Scientist performs as part of their daily tasks. Apart from gathering the data, integrating the data, getting the data into the required formation the most interesting part of the work is with feature engineering.

Then you have all the other boring data preparation tasks of how to handle missing data, type conversion, binning, normalization, outlier treatment etc.

With Automatic Data Preparation (ADP) in Oracle Data Mining you can let Oracle work all of these things out for you and to perform all the necessary coding and to store all of this coding as part of the in-database data mining model.

This is Fantastic. This ADP feature can same you hours and in some cases days of effort.

But (there is always a but :-) ) what if you are a bit unsure if the transformations that are being performed are exactly what you would wanted. Maybe you would like to see what Oracle is doing and depending on this you can do it a different way.

The first step is to examine the transformations that are generated by stored as part of the in-database data mining model. The DBMS_DATA_MINING package has a function called GET_MODEL_TRANSFORMATIONS. When you query this function, passing in the name of the data mining model, you will get returned the list of transformations that have been applied to each model.

In the following example a GLM model was created using the Oracle Data Miner tool (that is part of SQL Developer). When you use Oracle Data Miner, ADP is automatically turned on.

The following query calls the GET_MODEL_TRANSFORMATIONS function with the data mining model called CLAS_GLM_1_59/.

SELECT * FROM TABLE(DBMS_DATA_MINING.GET_MODEL_TRANSFORMATIONS('CLAS_GLM_1_59'));

The following image contains the output generated by this query.

Blog dat trans 1

When you look at the data under the EXPRESSION column we get to see what the ADP did to the data. In most of the cases there are just some simple data clean-up being performed and formatting for getting the data ready for input into the algorithm.

If we now look at the Naive Bayes model for the same data set we get a very different sent of transformations being listed under the EXPRESSION column.

SELECT * FROM TABLE(DBMS_DATA_MINING.GET_MODEL_TRANSFORMATIONS('CLAS_NB_1_59'));

Blog dat trans 2

Now we get to see some of the data binning that ADP performs and is required for input to the Naive Bayes algorithm. You will also notices that we also have some transformations in the REVERSE_EXPRESSION column. These are the inverse or reverse of the transformation that was generated in the EXPRESSION column.

I will let you explore the data transformations that are produced by ADP for the SVM and Decision Tree algorithms.

I will show you how you change the reverse expression in my next blog post, as there are times when you might want the data to be presented slightly differently after the model has been run to score your data.

To get more details of what Automatic Data Preparation is performed for each data mining algorithm you can check out this link in the 11g documentaion. This section seems to be missing from the online 12c documentation.

Thursday, March 12, 2015

Automatic Analytics is So main stream. Not something new.

Everyone is doing advanced analytics. Right? Hmm

Everyone is talking about advanced analytics? Yes that is true.

Everyone is an expert in advanced analytics? This is so not true. Watch out for these Great Pretenders. You know what I mean! You know who I mean! Maybe you know some of them already? If not, watch out for these Great Pretenders!!!

Some people are going around talking about data mining, predictive analytics, advanced analytics, machine learning etc as if this is some new topic. Well it isn't. It isn't anything new and most of the techniques have been about for 10, 20, 30+ years.

Some people are saying you should only use language X or tool Y because. Everything else is basically rubbish.

What we do have is a wider understanding of how to use these techniques on our various data sources.

What we have is a lot more tools that allow us to perform these tasks a lot easier, at greater speed, with more functionality and without the need to fully understand the hard core maths that is going on behind the scenes.

What we have is a lot more languages to perform these tasks and to support the vast amount of work that goes into understanding the data and preparing the data.

Someone thing for all of us to watch out for, when we ready about these topics, is what kind of problem area they are addressing. The following table illustrates the three main types or categories of Analytics. These categories are Descriptive Analytics, Predictive Analytics and Prescriptive Analytics. I think most people would agree that the Descriptive and Predictive Analytics categories are very mature at this stage. With Predictive Analytics we are perhaps still evolving in this category and a lot more work needs to be done before this this become wide spread.

Blog 1

Some people talk as if Predictive Analytics is some new and exciting topic. But isn't all that new. It was been around for the past 30+ years. If you go back over the Gartner Hype Cycle that comes out every September, Predictive Analytics is no longer being shown on this graph. The last time it appeared on the Gartner Hype Cycle was back in 2013 and it was positioned on the far right of the graph in the section called Plateau of Productivity.

So Predictive Analytics is very mature and main stream. Part of the reason that it is main stream is that Predictive Analytics has allowed for a new category of Analytics to evolve and this is Automatic Analytics.

Automatic Analytics is where Advanced and Predictive Analytics has been build into our day to day applications that are used to run our business. We do not need the hard core type of data scientists to perform various analytic on our data. Instead these task, once they have been defined, can then be added to our applications to process, evaluate and make decisions all automatically. This is were we need the data scientists to be able to communicate with the business and be able to work with them to solve real world business projects. This is a different type of data scientist to the "hard" core data scientist who delves into the various statistical methods, machine learning methods, data management methods, etc.

The following table extends the table given above to include Automatic Analytics, and is my own take on how and where Automatic Analytics fits.

Blog 2

Every time we get an insurance quote, health insurance quote, get a "random" call from our Telco offering a free upgrade, get our loyalty card statements, get a loan from the bank, look at or buy a book on Amazon, etc. the list could go on and on, but these are all examples of how predictive analytics has been automated into our everyday business application.

But this is nothing new. When I first got into data mining/predictive analytics over 16 years ago, it was considered a common thing that certain types of companies did. What has happened in the time since and particularly in the past few years is that a lot more people are seeing the value in using it.

Before I finish off this post we can have a quick look at what Oracle has been doing in this area. They have their Advanced Analytics Option and Real-Time Decisions tools to all data scientists do their magic. But over the past X years (nobody can give me an exact number) they have been very, very active in building in lots and lots of predictive analytics into their various business applications, particularly with into with Fusion Apps and BI Apps.

Blog 3

A recent quote from Oracle highlights their aim with this,

" ... products designed to close the gap between data scientists and businesses."

Now with Oracle making a big push to the cloud, they are busy adding in more and more Automatic (Predictive) Analytics into their Cloud Applications. What we need from Oracle is a clearer identification of where they have done this. Plus with the migration of their Apps to the cloud, their Advanced Analytics Option is a core part of their Cloud platform. As they upgrade or add new features into their Cloud Apps, you will now be able to get the benefit of these Automatic (Predictive) Analytics as they come available.

Blog 5

Wednesday, February 25, 2015

US President talks about Data Science

Check out the video of US President talking about Data Science and the first Chief Data Scientist of the USA talks about his mission.

Tuesday, January 20, 2015

Evaluating Classification Results

When you are working on building classification models you will need some ways of measuring the effectiveness of each model that you will build. This measurement/evaluation is perform during the model build process.

Typically the model build process consists of 2 steps (I'm assuming all data preparation etc has been completed:

  • Build the model: During this step you will feed in a portion of your data set to the data mining algorithm. Typical this data will be a subset of your data set and will typically consist of 60% to 70% of the data. This data is used to by the data mining algorithm to build the model.
  • Test the model: After the model has been built you will need to test the model to see how efficient it is at making the predictions. This is where we use the data that was not used to build the model. For this data we already know the outcome. So after we have applied the model to this data subset we can measure the predicted values against the actual values.

Most of the data mining tools will automate these two steps, specifically the splitting the data into the build and test data sets. But if you are using a language like R, etc then you will need to manually perform these steps.

The most common way of collating the test results is to use the Confusion Matrix. This allows us to layout the correct predictions, the incorrect predictions and to perform a number of other statistical measurements.

True Positives

True Negatives

False Positives

False Negatives

The last two of the above values are also commonly referred to in statistics as Type 1 (false positive) and Type 2 (false negative) errors.

Depending on your project you will concentrate on a combination of the true and false values of either the Positives or the negatives.

For example, in Medical Diagnostics for cancer, you will be looking to keep the False Negatives to a minimum. This is where you have predicted someone does not have cancer, but actually does. The consequence of this is that the person is not brought back for addition testing and we all know what will happen. On the other hand it is OK to have a hight False Positive in this case. In this scenario you bring the person back for additional tests and discover that they are all clear :-)

Precision = How many of the selected items are relevant? (as a percentage)

Recall = How many of the relevant items are selected? (as a percentage)

Accuracy = How many did we correctly predict? (as a percentage)

The following table illustrates these measurements and tests.

Confusion Matrix

There are lots of other statistical tests that can be performed on your results. Everyone will have their own preferences. What I have highlighted here are the main statistical test for you to look at.

You cannot use one or a few of the statistical tests to make a decision on what data mining model works best for your data. It is a combination of these statistical test, your understanding of the data and you understanding of the business project that need to be considered.

In my next 2 blog posts I will show you how you can perform these tests on the results generated by the Oracle Data Miner tool and then on the Oracle Data Miner models produced using PL/SQL.

Friday, August 8, 2014

my Oracle Data Miner Book

Some of you may be aware that I have been writing a on Oracle Data Miner. Actually the book covers the Oracle Data Miner GUI that is part of SQL Developer, the SQL and PL/SQL functions, procedures and packages that form the Oracle Data Mining option in the database and lots of other topics for the DBA, Developer and BI/DW people.
Today is a bit day for this book as it is officially released and available for purchase. See below for some links to where you can but the book in print and e-book formats. It has been published by McGraw-Hill/Oracle Press.
The book is aimed at a variety of people and the aim of the book is to introduce them to using the Oracle Data Miner tool and how to perform various data mining and predictive analytics tasks using SQL and PL/SQL.
The book will not teach you about how each of the data mining algorithms works. There is a bit of an assumption that you know a bit about these already. There are lots of books and resources about that cover that material. You can look on my book as an getting start / how to use type of book.
Below are are the images of the front cover and the back cover.
Book Cover            Book Back Cover
For more details of the book and for some updates keep an eye on my ODM Book page. On this page I'm adding a FAQ secion. This will be based on questions that I receive about the book.
If you buy the book then I hope you will find it helpful. If you are going to attend one of my presentations at an Oracle User Group meeting then bring the book along and I can sign it for you. Alternatively if you are at Oracle Open World 2014, come along to the Oracle Press Book Store, as I will be there to sign books on Wednesdays 1st October between 13:00 and 13:30.
Where can you Buy my Oracle Data Miner book (print and e-book).
You can buy the book from the McGraw-Hill/Oracle Press website and from Amazon. Each site will offer discounts so check out which one is the best for you.
McGraw-Hill/Oracle Press
For USA locations (enter promo code Tierney to save 20% and free delivery) www.mhprofessional.com
For UK & Ireland locations (enter promo code Tierney to save 20% and free delivery) www.mcgraw-hill.co.uk/tpr
Amazon
Click here to buy it on www.amazom.com
Click here to but it on www.amazon.co.uk

Monday, April 14, 2014

Oracle Advanced Analytics and Oracle Fusion Apps

At a recent Oracle User Group conference, I was part of a round table discussion on Apps and BI. Unfortunately most of the questions were focused on Apps and the new Fusion Applications from Oracle. I mentioned that there was data mining functionality (using the Oracle Advanced Analytics Option) built into the Fusion Apps, it seems to come as a surprise to the Apps people. They were not aware of this built in functionality and capabilities. Well Oracle Data Mining and Oracle Advanced Analytics has been built into the following Oracle Fusion Applications.
  • Oracle Fusion HCM Workforce Predictions
  • Oracle Fusion CRM Sales Prediction Engine
  • Oracle Spend Classification
  • Oracle Sales Prospector
  • Oracle Adaptive Access Manager
Oracle Data Mining and Oracle Advanced Applications are also being used in the following applications:
  • Oracle Airline Data Model
  • Oracle Communications Data Model
  • Oracle Retail Data Model
  • Oracle Security Governor for Healthcare
I intend to submit a presentation on this topic to future Oracle User Group conferences as a way of spreading the Advanced Analytics message within the Oracle user community. If you would like me to present on this topic at your conference or SIG drop me an email and we can make the necessary arrangement :-)

Sunday, April 6, 2014

The ORE Packages

If you are interested in using ORE or just to get an idea of what does ORE give you that does not already exist in one of the other R packages then the table below lists the packages that come as part of ORE.

Before you can use then you will need to load these into your workspace. To do this you can issue the following command from the R prompt or from the prompt in RStudio.

> library(ORE)

RStudio is my preferred R interface and is widely used around the world.
ORE Installed Packages Description
ORE Oracle R Enterprise
OREbase ORE - base
OREdm The ORE functions that use the in-database Oracle Data Miner algorithms
OREeda The ORE functions used for exploratory data analysis
OREgraphics The ORE functions used for graphics
OREpredict The ORE functions used for model predictions
OREstats The ORE stats functions
ORExml The ORE functions that convert R objects to XML
DBI R Database Interface
ROracle OCI based Oracle database interface for R
XML Tools for parsing and generating XML within R and S-Plus.
bitops Functions for Bitwise operations
png Read and write PNG images

In addition to these core ORE packages, ORE also uses some R packages as part of the core ORE packages listed above. The following table lists the R packages that are used in the ORE packages. So make sure you have these packages installed. They should have come with your installation of R, but if something has happened then you can download them again.

R Packages used by ORE Description
base The R Base Package
boot Bootstrap Functions (originally by Angelo Canty for S)
class Functions for Classification
cluster Cluster Analysis Extended Rousseeuw et al
codetools Code Analysis Tools for R
compiler The R Compiler Package
datasets The R Datasets Package
foreign Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, dBase, ..
graphics The R Graphics Package
grDevices The R Graphics Devices and Support for Colours and Fonts
grid The Grid Graphics Package
KernSmooth Functions for kernel smoothing for Wand & Jones (1995)
lattice Lattice Graphics
MASS Support Functions and Datasets for Venables and Ripley's MASS
Matrix Sparse and Dense Matrix Classes and Methods
methods Formal Methods and Classes
mgcv GAMs with GCV/AIC/REML smoothness estimation and GAMMs by PQL
nlme Linear and Nonlinear Mixed Effects Models


I've been using R a lot over the past few years and I've had a number of projects involving R particularly over the past 12 month. I just found out that I will now have another short duration R project in May and June.

So watch out for lots more blog posts on R and ORE. Plus the usual blog posts on using Oracle Data Mining. ORE and Oracle Data Mining are very closely linked.

Sunday, March 30, 2014

Gartner 2014 Advanced Analytics Quadrant

The Gartner 2014 Advanced Analytics Quadrant is out now. Well it is if you can find it.

Some of the companies have put it up on their websites to promote their position.

For some reason Oracle hasn't and I wonder why?

Gartner Advanced Analytics MQ Feb2014

You can see that some typical technologies are missing from this, but this is to be expected. How much are companies really deploying these alternatives on real problems and in production. Perhaps the positioning of Revolution Analysis might be an indicator. At some point there might be a shift from investigative analysis into more main stream projects and then into production.

What is still evident from this years quadrant is that SAS and IBM (SPSS) still have very dominant positions and perhaps will have for some time to come.

It will be interesting how this will all play out over the next few years.

Sunday, March 23, 2014

Using the in-database ODM algorithms in ORE

Oracle R Enterprise is the version of R that Oracle has that runs in the database instead of on your laptop or desktop.

Oracle already has a significant number of data mining algorithms in the database. With ORE they have exposed these so that they can be easily called from your R (ORE) scripts.

To access these in-database data mining algorithms you will need to use the ore.odm package.

ORE is continually being developed with new functionality being added all the time. Over the past 2 years Oracle have released and updated version of ORE about every 6 months. ORE is generally not certified with the latest version of R. But is slightly behind but only a point or two of the current release. For example the current version of ORE 1.4 (released only last week) is certified for R version 3.0.1. But the current release of R is 3.0.3.

Will ORE work with the latest version of R? The simple answer is maybe or in theory it should, but is not certified.

Let's get back to ore.dm. The following table maps the ore.odm functions to the in-database Oracle Data Mining functions.

ORE Function Oracle Data Mining Algorithm What Algorithm can be used for
ore.odmAI Minimum Description Length Attribute Importance
ore.odmAssocRules Apriori Association Rules
ore.odmDT Decision Tree Classification
ore.odmGLM Generalized Linear Model Classification and Regression
ore.odmKMeans k-Means Clustering
ore.odmNB Naïve Bayes Classification
ore.odmNMF Non-Negative Matrix Factorization Feature Extraction
ore.odmOC O-Cluster Clustering
ore.odmSVM Support Vector Machines Classification and Regression

As you can see we only have a subset of the in-database Oracle Dat Miner algorithms. This is a pity really, but I'm sure as we get newer releases of ORE these will be added.

Wednesday, March 12, 2014

ODM: Changing the bar chart format in Explore Node

In Oracle Data Miner you can use the Explore Node to gather an initial set of statistics for your dataset. As part of this you will also get a bar chart that shows the distributions of the values contained within each attribute. The following example shows the default layout of the bar charts. Explore1

These graphs a very useful for presenting the initial data exploration results from to your business users. In addition to these graphs you can also use the Graph node to give some additional graphical representations.

But the default bar chart that is produced by the Explore Node can appear to be a bit basic.

So what if we could change the layout to have a 3-D effect. People like 3-D bar charts.

Is this possible in Oracle Data Miner? If so then how can we do it?

Well it is possible and you can use the following steps to change your bar charts to 3-D.

To access the Explore Node settings go the the Tools menu and then select Preferences from the drop down menu.

Explore2

Then the Preferences window opens scroll down to the Data Miner option and expand the available options.

Explore3

The Explorer Data Viewer allows you to change the Precision settings. The section option is the Graphical Settings. You can change the Depth Radius setting. By default this is set to Zero. By increasing this value you can change the degree of the 3-D effect of the bar charts. You can also change the colour scheme too.

Explore4

I'm not a fan of the other colour schemes that are available and mu favourite is still the default Nautical. The following bar chart is the same as the one at the top of this post but has the 3-D effect.

Explore5

Tuesday, September 24, 2013

Adding Oracle Data Miner to OBIEE

Oracle Data Miner is a very powerful tool that provides advanced machine learning algorithms that are embedded in the Oracle database. By using Oracle Data Miner you do not have to use another tool, from another vendor, to do your data mining. You can do everything in the database, ensuring that the security of your data is maintained and use all the performance functionality that comes with the database.
To add to the advanced insights that you can get from using ODM, you can combine ODM with your OBIEE dashboards to gain a deeper level of insight of your data. This is the combining of data mining techniques and visualization techniques.
The purpose of this blog post is to show you the steps involved in adding an ODM model to your OBIEE dashboards. Lots of people have been asking for the details of how to do it, so here it is.
The following example is based on a presentation that I have given a few times (OUG Ireland, UKOUG, OOW) with Antony Heljula.
1. Export & Import the ODM model
If your data mining analysis and development was completed in a different database to where your OBIEE data resides then you will need to move the ODM model from ODM/development database to the OBIEE database.
ODM provides two PL/SQL procedures to allow you to easily move your ODM model. These procedures are part of the DBMS_DATA_MINING package. To export a model you will need to use the DBMS_DATA_MINING.EXPORT_MODEL procedure. Similarly to import your (exported) ODM model you will use the DBMS_DATA_MINING.IMPORT_MODEL procedure.
2. Create a view that uses the ODM model
You can create a view that uses the PREDICTION and PREDICTION_PROBABILITY functions to apply the import ODM model to your data. For example the following view is used to score our customer data to make a prediction of they are going to churn and the probability that this prediction is correct.
SELECT st_pk,
       prediction(clas_decision_tree using *) WITHDRAW_PREDICTION,
       prediction_probability(clas_decision_tree using *) WITHDRAW_PROBABILITY
FROM   CUSTOMER_DATA;

clip_image002
3. Import the view into the Physical layer of the BI Repository (RPD)
The view was then imported into the Physical layer of the BI Repository (RPD) where it was joined on primary key to the other customer tables (we had one records per customer in the view). With the tables being joined, we can use the prediction columns to filter the customer data. For example filter all the customer who are likely to churn, WITHDRAW_PREDICTION = ‘N’
clip_image002[11]
clip_image002[13]
4.Add the new columns to the Business Model layer
The new prediction columns were then mapped into the Business Model layer where they could be incorporated into various relevant calculations e.g. % Withdrawals Predicted, and then subsequently presented to the end users for reporting
clip_image002[9]
5. Add to your Dashboards
The Withdraw prediction columns could then be published on the BI Dashboards where they could be used to filter the data content. In the example below, the use has chosen to show data for only those customers who are predicted to Withdraw with a probability rating of >70%
clip_image002[5]