Saturday, October 20, 2018

Oracle 18c XE - Comes with in-database and R machine learning

As of today 20th October, Oracle has finally released Oracle 18c XE aka Express Edition

A very important word associated with Oracle 18c XE is the word 'FREE'

Yes it is FREE

This FREE product is backed full of features. Think of all the features that come with the Enterprise Edition of the Database. It comes with most of those features, including some of the extra add on features.

I said it comes with most features. There are a few features that don't come with XE, so go check out the full list here.

NewImage

There are a few restrictions:

  • Up to 12 GB of user data
  • Up to 2 GB of database RAM
  • Up to 2 CPU threads
  • Up to 3 Pluggable Databases

I know of so many companies and applications that easily meet the above restrictions.

For the Data Scientists and Machine Learning people, the Advanced Analytics option is now available with Oracle 18c XE. That means you can use the in-memory features for super fast analytics, use the in-database machine learning algorithms, and also use the embedded R feature called Oracle R Enterprise.

Yes you are limited to 12G of user data. That might be OK for most people but for those whose data is BIG then this isn't an option for you.

There is a phrase, "Your data isn't as big as you think", so maybe your data might fit within the 12G.

Either way this can be a great tool to allow you to try out machine learning for Free in a test lab environment.

Go download load it and give it a try.

Thursday, October 18, 2018

Creating an Autonomous Data Warehouse Cloud Service

The following outlines the steps to create a Autonomous Data Warehouse Cloud Service.
Log into your Oracle Cloud account and then follow these steps.
1. Select Autonomous Data Warehouse Cloud service from the side menu
NewImage
2. Select Create Autonomous Data Warehouse button
NewImage
3. Enter the Compartment details (Display Name, Database Name, CPU Core Count & Storage)
NewImage
4. Enter a Password for Administrator, and then click ‘Create Autonomous Data Warehouse’
NewImage
5. Wait until the ADWC is provisioned
Going from this
NewImage
to this
NewImage
And you should receive and email that looks like this
NewImage
6. Click on the name of the ADWS you created
NewImage
7. Click on the Service Console button
NewImage
8. Then click on Administration and then Download a Connection Wallet
Specify the password
NewImage
You an now use this to connect to the ADWS using SQL Developer
All done.

Monday, October 15, 2018

R vs Python vs SQL for Machine Learning (Infographic)

Next week I'll be giving several presentation on machine learning at Oracle Open World and Oracle Code One. In one of these presentation an evaluation of using R vs Python vs SQL will be given and discussed.

Check out the infographic containing the comparisons.

Click here to download the PDF version.

Info Graphic

Wednesday, October 10, 2018

OOW 2018 Chocolate Tasting

Calling all Oracle ACEs, Developer Champions and Oracle Product Managers from around the World.

Are you going to Oracle Open World or Oracle Code One?

If you are, bring some of your favourite chocolates from where you live and share them with other Oracle ACEs, Developer Champions and Oracle PMs.

Location : The Hub (Moscone West).

Date : Wednesday 24th October

Time : 3pm-4pm

All you have to do is to bring some of the best chocolate from your country or your favourite chocolate, meet with other people, talk about Oracle technologies and what you have learned during your time at Oracle Open World and Oracle Code One.

Please don't bring your typical high street, mass market type of chocolate. Bring the good stuff. Pick it up at your local chocolate shop or in the airport as you begin your travels.

Last year (2017) we had chocolate from 14 different countries. They were all very different and very tasty.

I'll have some Butlers Chocolates with me for the tasting. What chocolates will you bring?

Friday, September 7, 2018

OOW18 and Code One agendas with Date and Times

I've just received an email in from the organisers of Oracle Open World (18) and Oracle Code One (formally Java One) with details of when I will be presenting.

It's going to be a busy presenting schedule this year with 4 sessions.

It's going to be a busy presenting schedule this year with 3 sessions on the Monday.

Check out my sessions, dates and times.

Screenshot 2018 09 07 09 10 11

In addition to these sessions I'll also be helping out in the Demo area in the Developer Lounge. I'll be there on Wednesday afternoon handing out FREE beer.

Wednesday, August 29, 2018

Bringing Neural Networks to Production using GraphPipe

Machine learning is a fascinating topic. It has so much potential yet very few people talk about using machine learning in production. I've been highlighting the need for this for over 20 years now and only a very small number of machine learning languages and solutions are suitable for production use. Why? maybe it is due to the commercial aspects and as many of the languages and tools are driven by the open source community, one of the last things they get round to focusing on is production deployment. Rightly they are focused at developing more and more machine learning algorithms and features for developing models, but where the real value comes is will being able to embed machine learning model scoring in production system. Maybe this why the dominant players with machine learning in enterprises are still the big old analytics companies.

Yes that was a bit a of a rant but it is true. But over the summer and past few months there has been a number of articles about production deployment.

But this is not a new topic. For example, we have Predictive Model Markup Language (PMML) around for a long time. The aim of this was to allow the interchange of models between different languages. This would mean that the data scientist could develop their models using one language and then transfer or translate the model into another language that offers the same machine learning algorithms.

But the problem with this approach is that you may end up with different results being generated by the model in the development or lab environment versus the model being used in production. Why does this happen? Well the algorithms are developed by different people/companies and everyone has their preferences for how these algorithms are implemented.

To over come this some companies would rewrite their machine learning algorithms and models to ensure that development/lab results matched the results in production. But there is a very large cost associated with this development and ongoing maintenance as the models evolved. This would occur, maybe, every 3, 6, 9, 12 months. Somethings the time to write or rewrite each new version of the model would be longer than its lifespan.

These kind of problems have been very common and has impacted on model deployment in production.

In the era of cloud we are now seeing some machine learning cloud solutions making machine learning models available using REST services. These can, very easily, allow for machine learning models to be included in production applications. You are going to hear more about this topic over the coming year.

But, despite all the claims and wonders and benefits of cloud solutions, it isn't for everyone. Maybe at some time in the future but it mightn't be for some months or years to come.

So, how can we easily add machine learning model scoring/labeling to our production systems? Well we need some sort of middleware solutions.

Given the current enthusiasm for neural networks, and the need for GPUs, means that these cannot (easily) be deployed into production applications.

There have been some frameworks put forward for how to enable this. Once such framework is called Graphpipe. This has recently been made open source by Oracle.

Graphpipe

Graphpipe is a framework that to access and use machine learning models developed and running on different platforms. The framework allows you to perform model scoring across multiple neural networks models and create ensemble solutions based on these. Graphpipe development has been focused on performance (most other frameworks don't). It uses flatbuffers for efficient transfer of data and currently has integration with TensorFlow, PyTorch, MXNet, CNTK and via ONNX and caffe2.

Expect to have more extensions added to the framework.

Graphpipe website

Graphpipe getting started

Graphpipe blogpost

Graphpipe download

Monday, August 13, 2018

Spark docker images

Spark is a very popular environment for processing data and doing machine learning in a distributed environment.

When working in a development environment you might work on a single node. This can be your local PC or laptop, as not everyone will have access to a multi node distributed environment.

But what if you could spin up some docker images there by creating additional nodes for you to test out the scalability of your Spark code.

There are links to some Docker images that may help you to do this.

Or simply create a cloud account on the Databricks Community website to create your own Spark environment to play and learn.

Thursday, August 2, 2018

A selection of Hadoop Docker Images

When it comes to big data platforms one of the biggest challenges is getting a test environment setup where you can try out the various components. There are a few approaches to doing this this. The first is to setup your own virtual machine or some other container with the software. But this can be challenging to get just a handful of big data applications/software to work on one machine.

But there is an alternative approach. You can use one of the preconfigured environments from the likes of AWS, Google, Azure, Oracle, etc. But in most cases these come with a cost. Maybe not in the beginning but after a little us you will need to start handing over some dollars. But these require you to have access to the cloud i.e. wifi, to run these. Again not always possible!

So what if you want to have a local big data and Hadoop environment on your own PC or laptop or in your home or office test lab? There ware a lot of Virtual Machines available. But most of these have a sizeable hardware requirement. Particularly for memory, with many requiring 16+G of RAM ! Although in more recent times this might not be a problem but for many it still is. Your machines do not have that amount or your machine doesn't allow you to upgrade.

What can you do?

Have you considered using Docker? There are many different Hadoop Docker images available and these are not as resource or hardware hungry, unlike the Virtual Machines.

Here is a list of some that I've tried out and you might find them useful.

Cloudera QuickStart image

You may have tried their VM, now go dry the Cloudera QuickStart docker image.

Read about it here.

Check our Docker Hub for lots and lots of images.

Docker Hub is not the only place to get Hadoop Docker images. There are lots on GitHub Just do a quick Google search to find the many, many, many images.

These Docker Hadoop images are a great way for you to try out these Big Data platforms and environments with the minimum of resources.

Monday, July 23, 2018

Lessor known Apache Machine Learning languages

Machine learning is a very popular topic in recent times, and we keep hearing about languages such as R, Python and Spark. In addition to these we have commercially available machine learning languages and tools from SAS, IBM, Microsoft, Oracle, Google, Amazon, etc., etc. Everyone want a slice of the machine learning market!

The Apache Foundation supports the development of new open source projects in a number of areas. One such area is machine learning. If you have read anything about machine learning you will have come across Spark, and maybe you might believe that everyone is using it. Sadly this isn't true for lots of reasons, but it is very popular. Spark is one of the project support by the Apache Foundation.

But are there any other machine learning projects being supported under the Apache Foundation that are an alternative to Spark? The follow lists the alternatives and lessor know projects: (most of these are incubator/retired/graduated Apache projects)

Flink Flink is an open source system for expressive, declarative, fast, and efficient data analysis. Stratosphere combines the scalability and programming flexibility of distributed MapReduce-like platforms with the efficiency, out-of-core execution, and query optimization capabilities found in parallel databases. Flink was originally known as Stratosphere when it entered the Incubator.

Documentation

(graduated)

HORN HORN is a neuron-centric programming APIs and execution framework for large-scale deep learning, built on top of Apache Hama.

Wiki Page

(Retired)

HiveMail Hivemall is a library for machine learning implemented as Hive UDFs/UDAFs/UDTFs

Apache Hivemall offers a variety of functionalities: regression, classification, recommendation, anomaly detection, k-nearest neighbor, and feature engineering. It also supports state-of-the-art machine learning algorithms such as Soft Confidence Weighted, Adaptive Regularization of Weight Vectors, Factorization Machines, and AdaDelta. Apache Hivemall offers a variety of functionalities: regression, classification, recommendation, anomaly detection, k-nearest neighbor, and feature engineering. It also supports state-of-the-art machine learning algorithms such as Soft Confidence Weighted, Adaptive Regularization of Weight Vectors, Factorization Machines, and AdaDelta.

Documentation

(incubator)

MADlib Apache MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data. Key features include: Operate on the data locally in-database. Do not move data between multiple runtime environments unnecessarily; Utilize best of breed database engines, but separate the machine learning logic from database specific implementation details; Leverage MPP shared nothing technology, such as the Greenplum Database and Apache HAWQ (incubating), to provide parallelism and scalability.

Documentation

(graduated)

MXNet A Flexible and Efficient Library for Deep Learning . MXNet provides optimized numerical computation for GPUs and distributed ecosystems, from the comfort of high-level environments like Python and R MXNet automates common workflows, so standard neural networks can be expressed concisely in just a few lines of code.

Webpage

(incubator)

OpenNLP OpenNLP is a machine learning based toolkit for the processing of natural language text. OpenNLP supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection and coreference resolution.

Documentation

(graduated)

PredictionIO PredictionIO is an open source Machine Learning Server built on top of state-of-the-art open source stack, that enables developers to manage and deploy production-ready predictive services for various kinds of machine learning tasks.

Documentation

(graduated)

SAMOA SAMOA provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms that run on top of distributed stream processing engines (DSPEs). It features a pluggable architecture that allows it to run on several DSPEs such as Apache Storm, Apache S4, and Apache Samza.

Documentation

(incubator)

SINGA SINGA is a distributed deep learning platform. An intuitive programming model based on the layer abstraction is provided, which supports a variety of popular deep learning models. SINGA architecture supports both synchronous and asynchronous training frameworks. Hybrid training frameworks can also be customized to achieve good scalability. SINGA provides different neural net partitioning schemes for training large models.

Documentation

(incubator)

Storm Storm is a distributed, fault-tolerant, and high-performance realtime computation system that provides strong guarantees on the processing of data. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language.

Documentation

(graduated)

SystemML SystemML provides declarative large-scale machine learning (ML) that aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single node, in-memory computations, to distributed computations such as Apache Hadoop MapReduce and Apache Spark.

Documentation

(graduated)

Big data ml

I will have a closer look that the following SQL based machine learning languages in a lager blog post:

- MADlib

- Storm

Thursday, July 12, 2018

Oracle Developer Champion

Yesterday evening I received an email titled 'Invitation to Developer Champion Program'.

What a surprise!
Oracle dev champion
The Oracle Developer Champion program was setup just a year ago and is aimed at people who are active in generating content and sharing their knowledge on new technologies including cloud, micro services, containers, Java, open source technologies, machine learning and various types of databases.
For me, I fit into the machine learning, cloud, open source technologies, a bit on chatbots and various types of databases areas. Well I think I do!

This made me look back over my activities for the past 12-18 months. As an Oracle ACE Director, we have to record all our activities. I'd been aware that the past 12-18 months had been a bit quieter than previous years. But when I looked back at all the blog posts, articles for numerous publications, books, and code contributions, etc. Even I was impressed with what I had achieved, even though it was a quiet period for me.

Membership of Oracle Developer Champion program is for one year, and the good people in Oracle Developer Community (ODC) will re-evaluate what I, and the others in the program, have been up to and will determine if you can continue for another year.

In addition to writing, contributing to projects, presenting, etc Oracle Developer Champions typically have leadership roles in user groups, answering questions on forums and providing feedback to product managers.

The list of existing Oracle Developer Champions is very impressive. I'm honoured to be joining this people.

Click on the image to go to the Oracle Developer Champion website to find out more.
Screen Shot 2018 07 12 at 17 21 32

And check out the list of existing Oracle Developer Champions.
 Oracle dev champion O ACEDirectorLogo clr

Thursday, June 28, 2018

My book on Oracle R Enterprise translated into Chinese

A couple of days ago the post man knocked on my door with a package. I hadn't ordered anything, so it was a puzzling what it might be.

When I opened the package I found 3 copies of a book in Chinese.

It was one of my books !

One of my books was translated into Chinese !

What a surprise, as I wasn't aware this was happening.

At this time I'm not sure where you can purchase the book, but I'll update this blog post when I find out.

Monday, June 18, 2018

Twitter Analytics using Python - Part 3

This is my third (of five) post on using Python to process Twitter data.

Check out my all the posts in the series.

In this post I'll have a quick look at how to save the tweets you have download. By doing this allows you to access them at a later point and to perform more analysis. You have a few instances of saving the tweets. The first of these is to save them to files and the second option is to save them to a table in a database.

Saving Tweets to files

In the previous blog post (in this series) I had converged the tweets to Pandas and then used the panda structure to perform some analysis on the data and create some charts. We have a very simple command to save to CSV.

# save tweets to a file
tweets_pd.to_csv('/Users/brendan.tierney/Dropbox/tweets.csv', sep=',')

We can inspect this file using a spreadsheet or some other app that can read CSV files and get the following.

Twitter app8

When you want to read these tweets back into your Python environment, all you need to do is the following.

# and if we want to reuse these tweets at a later time we can reload them
old_tweets = pd.read_csv('/Users/brendan.tierney/Dropbox/tweets.csv')

old_tweets

Tweet app9

That's all very easy!


Saving Tweets to a Database

There are two ways to add tweets to table in the database. There is the slow way (row-by-row) or the fast way doing a bulk insert.

Before we get started with inserting data, lets get our database connection setup and the table to store the tweets for our date. To do this we need to use the cx_oracle python library. The following codes shows the setting up of the connections details (without my actual login details), establishes the connects and then retrieves some basic connection details to prove we are connected.

# import the Oracle Python library
import cx_Oracle

# define the login details
p_username = "..."
p_password = "..."
p_host = "..."
p_service = "..."
p_port = "1521"

# create the connection
con = cx_Oracle.connect(user=p_username, password=p_password, dsn=p_host+"/"+p_service+":"+p_port)
cur = con.cursor()

# print some details about the connection and the library
print("Database version:", con.version)
print("Oracle Python version:", cx_Oracle.version)


Database version: 12.1.0.1.0
Oracle Python version: 6.3.1

Now we can create a table based on the current date.

# drop the table if it already exists
#drop_table = "DROP TABLE TWEETS_" + cur_date
#cur.execute(drop_table)

cre_table = "CREATE TABLE TWEETS_" + cur_date + " (tweet_id number, screen_name varchar2(100), place varchar2(2000), lang varchar2(20), date_created varchar2(40), fav_count number, retweet_count number, tweet_text varchar2(200))"

cur.execute(cre_table)

Now lets first start with the slow (row-by-row) approach. To do this we need to take our Panda data frame and convert it to lists that can be indexed individually.

lst_tweet_id = [item[0] for item in rows3]
lst_screen_name = [item[1] for item in rows3]
lst_lang =[item[3] for item in rows3]
lst_date_created = [item[4] for item in rows3]
lst_fav_count = [item[5] for item in rows3]
lst_retweet_count = [item[6] for item in rows3]
lst_tweet_text = [item[7] for item in rows3]

#define a cursor to use for the the inserts
cur = con.cursor()
for i in range(len(rows3)):
    #do the insert using the index. This can be very slow and should not be used on big data
    cur3.execute("insert into TWEETS_2018_06_12 (tweet_id, screen_name, lang, date_created, fav_count, retweet_count, tweet_text) values (:arg_1, :arg_2, :arg_3, :arg_4, :arg_5, :arg_6, :arg_7)",
                 {'arg_1':lst_tweet_id[i], 'arg_2':lst_screen_name[i], 'arg_3':lst_lang[i], 'arg_4':lst_date_created[i],
                  'arg_5':lst_fav_count[i], 'arg_6':lst_retweet_count[i], 'arg_7':lst_tweet_text[i]})

#commit the records to the database and close the cursor
con.commit()
cur.close()

Tweet app10

Now let us look a quicker way of doing this.

WARNING: It depends on the version of the cx_oracle library you are using. You may encounter some errors relating to the use of floats, etc. You might need to play around with the different versions of the library until you get the one that works for you. Or these issues might be fixed in the most recent versions.

The first step is to convert the panda data frame into a list.

rows = [tuple(x) for x in tweets_pd.values]
rows

Tweet app11

Now we can do some cursor setup like setting the array size. This determines how many records are sent to the database in each batch. Better to have a larger number than a single digit number.

cur = con.cursor()

cur.bindarraysize = 100

cur2.executemany("insert into TWEETS_2018_06_12 (tweet_id, screen_name, place, lang, date_created, fav_count, retweet_count, tweet_text) values (:1, :2, :3, :4, :5, :6, :7, :8)", rows)

Check out the other blog posts in this series of Twitter Analytics using Python.