Monday, October 31, 2016

Data Science Is Multidisciplinary (updated October 2016)

[Update :October 2016.  There appears to be some discussion about the Venn diagram I've proposed below. The central part of this diagram is not anything I can up with. It was a commonly used Venn diagram for Data Mining. Thanks to Polly Michell-Guthrie for providing the original reference for the Venn. I just added the outer ring of additional skills needed for the new area of Data Science. This was just my view of things back in 2012. Things have moved on a bit since then]

A few weeks ago I had a blog post called Domain Knowledge + Data Skills = Data Miner.
In that blog post I was saying that to be a Data Scientist all you needed was Domain Knowledge and some Data Skills, which included Data Mining.
The reality is that the skill set of a Data Scientist will be much larger. There is a saying ‘A jack of all trades and a master of none’. When it comes to being a data scientist you need to be a bit like this but perhaps a better saying would be ‘A jack of all trades and a master of some’.
I’ve put together the following diagram, which includes most of the skills with an out circle of more fundamental skills. It is this outer ring of skills that are fundamental in becoming a data scientist. The skills in the inner part of the diagram are skills that most people will have some experience in one or more of them. The other skills can be developed and learned over time, all depending on the type of person you are.
image
Can we train someone to become a data scientist or are they born to be a data scientist. It is a little bit of both really but you need to have some of the fundamental skills and the right type of personality. The learning of the other skills should be easy(ish)
What do you think?  Are their Skill that I’m missing?

Tuesday, October 25, 2016

Oracle Text, Oracle R Enterprise and Oracle Data Mining - Part 5

In this 5th blog post in my series on using the capabilities of Oracle Text, Oracle R Enterprise and Oracle Data Mining to process documents and text, I will have a look at some of the machine learning features of Oracle Text.

Oracle Text comes with a number of machine learning algorithms. These can be divided into two types. The first is called 'Supervised Learning' where we have two machine learning algorithms for classification type of problem. The second type is called 'Unsupervised Learning' where we have the ability to use clustering machine learning algorithms to look for patterns in our text documents and to find similarities between documents based on their contents.

It is this second type of document clustering that I will work through in this blog post.

When using clustering with text documents, the machine learning algorithm will look for patterns that are common between the documents. These patterns will include the words used, the frequency of the words, the position or ordering of these words, the co-occurance of words, etc. Yes this is a large an complex task and that is why we need a machine learning algorithm to help us.

With Oracle Text we only have one clustering machine learning algorithm available to use. When we move onto using the Oracle Advanced Analytics Option (Oracle Data Mining and Oracle R Enterprise) we more algorithms available to us.

With Oracle Text the clustering algorithm is called k-Means. In a way the actual algorithm is unimportant as it is the only one available to us when using Oracle Text. To use this algorithm we have the CTX_CLS.CLUSTERING procedure. This procedure takes the documents we want to compare and will then identify the clusters (using hierarchical clustering) and will then tells us, for each document, what clusters the documents belong to and they probability value. With clustering a document (or a record) can belong to many clusters. Typically in the text books we see clusters that are very distinct and are clearly separated from each other. When you work on real data this is never the case. We will have many over lapping clusters and a data point/record can belong to one or more clusters. This is why we need the probability vale. We can use this to determine what cluster our record belongs to most and what other clusters it is associated with.

Using the example documents that I have been using during this series of blog posts we can use the CTX_CLS.CLUSTERING algorithm to cluster and identify similarities in these documents.

We need to setup the parameters that will be used by the CTX_CLS.CLUSTERING procedure. Tell it to use the k-Means algorithm and then the number of clusters to generate. As with all Oracle Text procedures or algorithms there are a number of settings you can configure or you can just accept the default values.

exec ctx_ddl.drop_preference('Cluster_My_Documents');
exec ctx_ddl.create_preference('Cluster_My_Documents','KMEAN_CLUSTERING');
exec ctx_ddl.set_attribute('Cluster_My_Documents','CLUSTER_NUM','3');

The code above is an example of the basics of what you need to setup for clustering. Other attribute or cluster parameter setting available to you include, MAX_DOCTERMS, MAX_FEATURES, THEME_ON, TOKEN_ON, STEM_ON, MEMORY_SIZE and SECTION_WEIGHT.

Now we can run the CTX_CLS.CLUSTERING procedure on our documents. This procedure has the following parameters.

- The Oracle Text Index Name

- Document Id Column Name

- Document Assignment (cluster assignment) Table Name. This table will be created if it doesn't already exist

- Cluster Description Table Name. This table will be created if it doesn't already exist.

- Name of the Oracle Text Preference (list)

exec ctx_cls.clustering(
'MY_DOCUMENTS_OT_IDX',
'DOC_PK',
'OT_CLUSTER_RESULTS',
'DOC_CLUSTER_DETAILS',
'Cluster_My_Documents');

When the procedure has completed we can now examine the OT_CLUSTER_RESULTS and the DOC_CLUSTER_DETAILS tables. The first of these (OT_CLUSTER_RESULTS) allows us to see what documents have been clustered together. The following is what was produced for my documents.

SELECT d.doc_pk, 
       d.doc_title, 
       r.clusterid, 
       r.score 
FROM my_documents d, 
     ot_cluster_results r 
WHERE d.doc_pk = r.docid;

NewImage

We can see that two of the documents have been grouped into the same cluster (ClusterId=2). If you have a look back at what these documents are about then you can see that yes these are very similar. For the other two documents we can see that they have been clustered into separate clusters (ClusterId=4 & 5). The clustering algorithms have said that they are different types of documents. Again when you examine these documents you will see that they are talking about different topics. So the clustering process worked !

You can also explore the various features of the clusters by looking that he DOC_CLUSTER_DETAILS table. Although the details in this table are not overly useful but it will give you some insight into what clusters the k-Means algorithm has produced.

Hopefully I've shown you how easy it is to setup and use the clustering feature of Oracle Text.

WARNING: Before using the Clustering or Classification with Oracle Text, you need to check with your local Oracle Sales representative about if there is licence implication. There seems to be some mentions the the algorithms used are those that come with Oracle Data Mining. Oracle Data Mining is a licence cost option for the database. So make sure you check before you go using these features.

Saturday, October 22, 2016

Our first OUG Ireland Meet-up

Last Thursday evening (20th October) we had our first Meet-up event for OUG Ireland.

Up to recently we have had a one or two full day SIG events that covered both the Tech and BA/Big Data areas. But we have been finding it increasingly difficult to get speakers and attendees to take a full day out of work to attend the SIG events. This was particularly true for a SIG event we had scheduled to happen in early October. But by the end of August things were not coming together for us, so it was time to try something new.

Over the past couple of years Meet-ups have been growing in numbers and in popularity. We (the OUG Ireland SIG committee) have been keeping an eye on this in the UK and Ireland.

It was time to get this concept a try.

What did it entail and what venue did you use?

The first thing we needed to do was to arrange a venue. A very popular location for Meet-ups in Ireland is in one of Bank of Ireland branches. This is Bank of Ireland on Grand Canal Dock in Dublin. It is one of their enterprise centres and is open during the day for meetings, as a workspace and it also operates as a branch. In the evenings and on a Saturday morning it is available for groups to hold meetings for larger groups and Meet-ups. We have the venue from 6pm-8pm.

What presentations did you have?

After securing a venue the we then decided to have the theme of the Meet-up to be about 'Updates from Oracle Open World'. Most of the SIG committee was at Oracle Open World, so that should be easy enough to put a few presentations together and we got the local Oracle office to joins us to.

NewImage

What about catering?

The venue very kindly supplies some soft drinks, tea, coffee, a few beers, along with some sandwiches and pastries. All for free!

So far we have a free venue, free catering and the committee for presenters.

NewImage

How did you advertise the event?

The only other thing we needed to do now was to advertise the event. For this we used a combination of EventBrite and Meet-up.com for this, along with our own contacts. Plus some of our friends helped to spread the word. This worked really well. We ended up getting roughly the same number of people registering for the event on each platform. We had 98 registrations on these websites.

Something we were warned about is that a lot of people will register, but if you get 40% of those turning up for the event then you are doing well. We got 48 attendees (=50%). We were delighted with this. For our full day SIG events we might have had 20-25 attendees.

How much this this event cost?

It cost us zero euro/dollars/sterling. As there was no admin, advertising costs, catering, room hire, nothing. Well that is not entirely true. There was a small cost and that was for our membership fee to be on Meet-up. That cost me about $30 for 6 months (unlimited plan). Yes I've paid that myself.

What was the feedback after the event?

The feedback was fantastic. People loved the new format, loved that it was in the evening after work, liked the short length presentations, liked that it was free, etc.

I also asked people if they might be interested in presenting at a future Meet-up. Personally I had 5 people talk to me about this. The over committee members also had people talk to them about it. It seems people are interested in trying the shorter format presentations, as it is not as daunting as presenting at a conference. A conference seems to be more formal and a step up in presenting levels.

NewImage NewImage

What next?

Well we have the same venue booked for 12th January and 11th May. We have our 2 presenters for the 12th January already. We actually had those before our first meet-up. Plus for the 11th May we have some possible presenters and I just need to work with them to see who will get on the agenda.

The plan was to have 3-4 of these each year. Based on the feedback and the level of interest we might need to have a few more. But it is still early days and we need to see how things develop.

Our Meet-up was in Dublin. Ideally we want to bring this to over regions. For example we could do the same in Belfast and Cork, and possibly Limerick and Galway. This is something we are looking at and in 2017 we will definitely have a Meet-up in one (or two) of those locations. That would bring us up to 4-5 Meet-ups in 2017.

Thank you to everyone who attended and everyone how helped to make this happen.

Monday, October 17, 2016

Oracle Data Miner (ODMr) 4.2 Repository Upgrade

With each new release of the Oracle Data Miner (ODMr) tool (part of SQL Developer) an upgrade of your ODMr Repository is needed. This is because of the numerous new features in the tool. This is particularly the case with ODMr (SQLDev) 4.2.

No most of the new features for ODMr 4.2 will not be visible until you are running a 12.2 Database. But a small number of new features are available if you are running an earlier version of the DB. Check out my blog post on some of these.

Before upgrading the ODMr repository, just like with any upgrade, make sure to do your backups. Although there is some coping of objects done during the repository upgrade (lot story but a few versions ago my ODMr repository and work got wiped during an upgrade), you should always export and save your workflows. You will need to do this using your current version of ODMr/SQL Dev before you start using ODMr 4.2.

When you have saved your workflows etc you can then start using ODMr/SQLDev 4.2.

The easiest way to do the ODMr 4.2 Repository upgrade is to let the tool do it for you. You can do this by trying to open one of your ODMr connections.

IMPORTANT: You will need to have the SYS password for the ODMr upgrade, so have your DBA do this step for you or have them on standby to enter the password for you.

NewImage

NOTE: This upgrade is being done on a CDB/PDB 12.2 DB.

When prompted enter the SYS password.

NewImage

When promoted click on the Start button.

NewImage

The progress bar will let you know things are going.

NewImage

When complete you will get the following.

NewImage

It is always good to check the Log file/report. Especially if you encounter errors !

NewImage

Job Done!

You can now start using all (well almost all) the new features of ODMr 4.2.

When the 12.2 Database is available you will get to see lots more features.

Tuesday, October 11, 2016

OTN Appreciation Day : My favourite thing from OTN #ThanksOTN

This blog post is my contribution to the OTN Appreciation Day, the brain child of Tim Hall (read his blog post here).

For my contribution, I'm going to write about something that is a bit different to what most people will be writing about. Most people will be writing about some feature of the Oracle Database or maybe their favourite tool.

I'm not going to do that. What I'm going to write about is something that OTN does for use Developers, DBAs, etc.

Basically OTN has done so much over the years to help developers in a multitude of different ways.

Apart from the support that OTN gives me as an Oracle ACE Director (Thank you!), one of my favourite things that OTN makes available to us are the VirtualBox Pre-built Developer VMs.

NewImage

These pre-built VMs allow us developers to go play with the technology, to learn how to use it, to follow tutorials, to see how various software applications work together, etc all within a virtual machine.

I bet that (almost) everyone reading this blog and taking part in the OTN Appreciation Day will have used one or more of the virtual box prebuilt VMs.

Why is this a good thing? How would you like trying to install all this software from scratch? Not me. Typically for me when performing an install I usually mess something up. If this happens often enough then you may just get frustrated with what you are trying to do and just give up on it. The result will probably be you giving a negative review to your employer.

But the pre-build VMs take the pain of installing (sometimes) large and complex software is taken away from you and allows you to dive straight into using the software. I also really love that the VMs come with tutorials, decent data sets, example applications built using the software, and demonstrations on how to get each of these working together.

If you mess anything up, then you can just re-import the VM and start all over again. When you are finished using the VM and testing the software, all you need to do is to delete the VM. You latop, desktop or where ever you have installed the VM is left clean with no partially uninstalled files, etc.

Each of us will have our favourite VMs. For most people the Developer Day VM is fantastic. It you are a beginner or an experienced developer I would bet most people will have a copy of this VM and are probably using it as their personal Oracle Database sever.

For me, I'm also a regular user of the Oracle Big Data Lite VM and the OBIEE Sample Application VM.

For OTN Appreciation day, I haven't talked about a Database feature. Instead I've talked about something that OTN has done for us, the developer, DBA, etc community. I'd like to thank OTN for supporting the community by providing these VirtualBox pre-built VMs for us to use. You have saved me/us many, many, many hours/days/weeks/months over the years.

BTW. I'm looking forward to the VM with the 12.2c Database.

Monday, October 10, 2016

OUG Ireland Meet-up 20th October

Come along to the first OUG Ireland meet-up on the 20th October, in Bank of Ireland, Grand Canal Dock, Dublin, between 18:15 and 20:00.

Over the years the OUG Ireland SIG committee have organised one day SIG events once or twice a year. This is in addition to the annual OUG Ireland conference (typically held in March). Sometimes it has been a challenge to get people to attend, sometimes it has been a challenge to get enough speakers, sometimes it was a challenge to get a good venue, etc.

So we have decided to try something a little bit different. In keeping with the current trend of smaller scale events we have organised our first Meet-up. This will be a short 2 hour event to be held after work on the 20th October. So come along and joins us.

This is a free and open event. You do not need to be a member of the user group to come to this meet-up.

Here are the details:

Theme for Meet-up

Updates from Oracle Open World 2016

Agenda

18:00-18:20 : Sign-in, meet and greet, and setup of space with seats etc

18:20-18:30 : Introductions & Welcome, Agenda, what is OUG Ireland

18:30-18:45 : The Oracle 12.2 Database new features (Simon Holt)

18:45-19:00 : What's new in the BI, BA, Big Data world from Oracle (Brendan Tierney)

19:00-19:15 : What's happening with Cloud (Tony Cassidy)

19:15-19:30 : Other updates from Oracle (John Caulfield, Oracle)

19:30-19:45 : Q&A session and Open discussion

Location

Bank Of Ireland

1 Grand Canal Square

Dublin

Please sign up, so that we know who is coming

There are 2 places where you can sign up. It doesn't matter which one you use but please use one of them to let us know you will be there.

Sign up on EventBrite.com

Sign up on Meetup.com


We will be looking to setup more Meet-up events, so let us know what you think of the new format and particularly if you would like to get involved with talking about a topic, project, new feature, whatever, etc. for 15-20 minutes (a short demo would be good)

Wednesday, October 5, 2016

Oracle Data Miner 4.2 EA : New Features

A couple of weeks ago during the madness of Oracle Open World there was some new product releases and lots of updates to existing products.

One such product was SQL Developer. They released an Early Adopter version (EA1). This is where you can try out the new version of the product, but you need to be careful as it is not the GA/Production version. So it may have some "features".

One component of SQL Developer is the Oracle Data Miner tool. This tool GUI workflow based tool based on the Oracle Advanced Analytics option. At OOW we got to hear about the various new Oracle Data Mining features that are coming with Oracle 12.2 Database. For Oracle Data Miner (ODMr) 4.2 (EA) there are a lot of new features but most of these are hidden and will only come available when you are using the Oracle 12.2 DB.

But if you are using a 12.1 (or earlier) then there are some new features. I've been having a bit of a look around the EA1 release to see what is new and available to us now (while we wait for 12.2).

If you are on Oracle 12.1 DB or earlier there are two main new features. These are a new Workflow Scheduler and being able to specify in-memory options for ODMr objects. These can be easily found on the ODMr menu bar, are highlighted in the following image.

NewImage

Let us now have a quick look at these.

ODMr Workflow Scheduler

The Workflow Scheduler allows us to take an ODMr Workflow and to use schedule it to run in the Oracle Database at a defined time or for a defined schedule. Previously we would have to write the SQL and PL/SQL code to enable the scheduling. Plus the ODMr schedule was outputted in a number of SQL scripts. So it was a little bit of challenge to get the workflow running on a regular basis.

Now with the new in-built ODMr Schedular we can quickly and easily do this without having to write a line of SQL or PL/SQL. The tool will look after the hard bit for us. We can schedule the entire workflow or certain parts of the workflow.

NewImage

When setting up your schedule you can pick the Start Date, how frequently you would like it run (daily, weekly, monthly or some other custom frequency), when it should end (never, after X number of runs or on a specific date). You can also re-use an existing schedule.

NewImage

For the advanced settings you can setup email notification, the job priority level, maximum run durations and limits, and timezone to use.

NewImage

ODMr In-memory Options

To access the in-memory options you can click on the 'Performance Options' button on the ODMr menu or you can access it via the menu (Tools -> Preferences) to get the complete list of in-memory settings.

NewImage

When you use ODMr to build your data mining workflows, ODMr will create a number of objects for each of the nodes of the workflow. These are typically created as tables in your schema. The previous version of ODMr introduced the Performance Options, where you could set the degree of parallel to use for some Nodes and the underlying SQL and PL/SQL code that is generated.

Now we can specify if the tables created should be in-memory, and available of the significant performance response times when you are using the data in these tables. This is particularly useful as we work with larger and larger data sets and we want our lighting fast response from some of our data mining tasks.

In addition to turning on the in-memory option for certain nodes, we can also specify the in-memory configuration settings such as the level of Columnar Compression to use and the Priority Level.

NewImage


(I've been on the 12.2 beta so I've had a chance to try out many of the new features. There is some good stuff coming and I'll have blog posts about these when 12.2 comes GA)