Tuesday, March 31, 2015

OTech article on Predictive Queries

Last week the Spring 2015 edition of OTech Magazine was published.

Check out the link to the it here.

Otech 1

I was lucky to have an article accepted and published in this edition and the topic of the article was on Predictive Queries.

I've given a presentation on Predictive Queries at a few Oracle User Group conferences over the past 6 months or so, and this article covers what I talk about in that presentation.

The article covers what Predictive Queries are about and goes through some example of how you can use them. Again I give some of these examples in my presentation.

Now is your chance to try out Predictive Queries using the examples in the articles.

Otech 2

Recently I recorded a very short video with Bob Hubbard of OTN on this topic as part of his 2 Minute Tech Tips. Check out my blog post about this video and view the video.

OTN tech tip

Thursday, March 12, 2015

Automatic Analytics is So main stream. Not something new.

Everyone is doing advanced analytics. Right? Hmm

Everyone is talking about advanced analytics? Yes that is true.

Everyone is an expert in advanced analytics? This is so not true. Watch out for these Great Pretenders. You know what I mean! You know who I mean! Maybe you know some of them already? If not, watch out for these Great Pretenders!!!

Some people are going around talking about data mining, predictive analytics, advanced analytics, machine learning etc as if this is some new topic. Well it isn't. It isn't anything new and most of the techniques have been about for 10, 20, 30+ years.

Some people are saying you should only use language X or tool Y because. Everything else is basically rubbish.

What we do have is a wider understanding of how to use these techniques on our various data sources.

What we have is a lot more tools that allow us to perform these tasks a lot easier, at greater speed, with more functionality and without the need to fully understand the hard core maths that is going on behind the scenes.

What we have is a lot more languages to perform these tasks and to support the vast amount of work that goes into understanding the data and preparing the data.

Someone thing for all of us to watch out for, when we ready about these topics, is what kind of problem area they are addressing. The following table illustrates the three main types or categories of Analytics. These categories are Descriptive Analytics, Predictive Analytics and Prescriptive Analytics. I think most people would agree that the Descriptive and Predictive Analytics categories are very mature at this stage. With Predictive Analytics we are perhaps still evolving in this category and a lot more work needs to be done before this this become wide spread.

Blog 1

Some people talk as if Predictive Analytics is some new and exciting topic. But isn't all that new. It was been around for the past 30+ years. If you go back over the Gartner Hype Cycle that comes out every September, Predictive Analytics is no longer being shown on this graph. The last time it appeared on the Gartner Hype Cycle was back in 2013 and it was positioned on the far right of the graph in the section called Plateau of Productivity.

So Predictive Analytics is very mature and main stream. Part of the reason that it is main stream is that Predictive Analytics has allowed for a new category of Analytics to evolve and this is Automatic Analytics.

Automatic Analytics is where Advanced and Predictive Analytics has been build into our day to day applications that are used to run our business. We do not need the hard core type of data scientists to perform various analytic on our data. Instead these task, once they have been defined, can then be added to our applications to process, evaluate and make decisions all automatically. This is were we need the data scientists to be able to communicate with the business and be able to work with them to solve real world business projects. This is a different type of data scientist to the "hard" core data scientist who delves into the various statistical methods, machine learning methods, data management methods, etc.

The following table extends the table given above to include Automatic Analytics, and is my own take on how and where Automatic Analytics fits.

Blog 2

Every time we get an insurance quote, health insurance quote, get a "random" call from our Telco offering a free upgrade, get our loyalty card statements, get a loan from the bank, look at or buy a book on Amazon, etc. the list could go on and on, but these are all examples of how predictive analytics has been automated into our everyday business application.

But this is nothing new. When I first got into data mining/predictive analytics over 16 years ago, it was considered a common thing that certain types of companies did. What has happened in the time since and particularly in the past few years is that a lot more people are seeing the value in using it.

Before I finish off this post we can have a quick look at what Oracle has been doing in this area. They have their Advanced Analytics Option and Real-Time Decisions tools to all data scientists do their magic. But over the past X years (nobody can give me an exact number) they have been very, very active in building in lots and lots of predictive analytics into their various business applications, particularly with into with Fusion Apps and BI Apps.

Blog 3

A recent quote from Oracle highlights their aim with this,

" ... products designed to close the gap between data scientists and businesses."

Now with Oracle making a big push to the cloud, they are busy adding in more and more Automatic (Predictive) Analytics into their Cloud Applications. What we need from Oracle is a clearer identification of where they have done this. Plus with the migration of their Apps to the cloud, their Advanced Analytics Option is a core part of their Cloud platform. As they upgrade or add new features into their Cloud Apps, you will now be able to get the benefit of these Automatic (Predictive) Analytics as they come available.

Blog 5

Monday, March 9, 2015

OUG Ireland 2015 is next week

The annual OUG Ireland conference is on next week on Thursday 19th March.

If you haven't already signed up for the conference this is only a few days left to do so. Click here to go to the registrations pages.

Also don't forget to sign Maria Colgan's one day seminar on the Oracle 12c In-Memory Option.

As always there is a very full agenda with 7 streams, 47 presentations and several keynote presentations.

I'll be a draw for a copy of my book and I'll be giving away a few Oracle Press goodies too. Check out this blog post for the details and rules of the book draw.

The following are the presentations I'm planning on attending (so you know where to find me)

TimePresenterTopic
09:10-09:30Debra LilleyOUG Ireland Welcome, Introduction and Opening
09:30-10:10Jon Paul (Oracle)Opening Keynote by Jon Paul from Oracle
10:15-11:00Oralce Presentation Oracle Big Data Strategy

11:00-11:25 Exhibition Hall
11:25-12:10Antony HeljulaReal Business Value Using Predictive BI

(I've seen this before but I worked with Antony on some of what he will be talking about)

12:15-13:00Roel Hartman &

Brendan Tierney

What Are They Thinking? With Oracle Application Express & Oracle Data Mining.

(we gave this presentation at Oracle Open World back in September 2014)

12:15-13:00Gurcan OrhanHow to handle Dev, Test & Prod with ODI
13:00-14:00 Lunch

(and then freaking out before I give my second presentation)

14:00-14:45Brendan TierneyPredictive Queries in Oracle 12c Database

(I suppose I have to turn up to my own presentation)

14:50-15:35Roel HartmanHidden APEX 5 Gems Revealed

(APEX 5 is due out any day now)

15:35-16:00Exhibition Hall & Coffee

(and then freaking out before I give my third presentation)

16:00-16:45Brendan TierneyRunning R in your Oracle Database using Oracle R Enterprise

(This presentation generally runs for 50 minutes)

16:50-17:35Maria ColganBI, Dev & Tech Closing Keynote: Oracle Database In-Memory-The next big thing
17:35-18:35Event Social i.e. free drink :-)

As you can see it is going to be a busy, busy day.

I would love to attend lots of others, but being able to be in multiple places at the same time is not one of them.

NOTE:The User Group has a rule that a presenter can have a max of 2 presentations. Unfortunately we had to break this rule a week out from the conference, due to some cancellations. And that is why I've ended with 2.5 presentations.

Friday, March 6, 2015

RIP SQL*Plus & hello SQL Command Line

Over the past couple of months Oracle has been releasing some EA (Early Adopter) versions of a new tool that is currently called SQL Command Line.

The team behind this new tool is the SQL Developer development team and they have been working on creating a new command line SQL tool that is based on some of the technology that is included in SQL Developer.

SQL Command Line in an stand alone tool and all you need to do is to download and un-zip the tile.

What I want to show in this blog post is some of new features that are available and that I have found particularly useful. But before we get onto those commands let us first have a look at how you can get setup and running with SQL Command Line.

Download & Setup

The current download of SQL Command Line can be found under the SQL Developer 4.1 EA Download page. I'm assuming when 4.1 is formally released the download for SQL Command line will be on the main SQL Developer Download web page.

SQL CL 1

After you have downloaded the file, all you need to do is to unzip the file and then copy the unzipped directory to where you want the software to be located on your client.

Now you are ready to get started with using SQL Command Line.

Connecting to your Oracle Schema

(That) Jeff Smith and Barry McGillin have a couple of good blog posts on the different connection methods and some setup or configuration you might need to consider. Check out these links for more details.

For me I did not have to do any additional setup or configuration. I was able to use the TNS Names and the EZConnect methods without any problems.

The following how to connect to my (DMUSER) schema using the EZConnect method. With this method we pass in the username, password, the host name, port number and the service name. Just like this

> sql dmuser/dmuser@localhost:1521/pdb12c

We can not have a look at the JDBC connection details.

SQL> show jdbc

-- Database Info --

Database Product Name: Oracle

Database Product Version: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production

With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

Database Major Version: 12

Database Minor Version: 1

-- Driver Info --

Driver Name: Oracle JDBC driver

Driver Version: 12.1.0.2.0

Driver Major Version: 12

Driver Minor Version: 1

Driver URL: jdbc:oracle:thin:@localhost:1521/pdb12c

SQL>


If we have a TNSNAMES.ORA file on our computer and the directory that it is in, is on the search PATH, then we can use the service names defined in the TNSNAMES.ORA file. The following example shows you how to use this in two ways. The first shows how to enter all the details when you are starting SQL CL and the other is when SQL CL prompts you for each parameter.

> sql dmuser/dmuser@pdb12c

and when we are prompted to enter the parameters, we get the following.

> sql

SQLcl: Release 4.1.0 Beta on Thu Mar 05 15:16:12 2015

Copyright (c) 1982, 2015, Oracle. All rights reserved.

SQLcl: Release 4.1.0 Beta on Thu Mar 05 15:16:14 2015

Copyright (c) 1982, 2015, Oracle. All rights reserved.

Username? (''?) dmuser

Password? (**********?) ******

Database? (''?) pdb12c

Connected to:

Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production

SQL>


As you can see these work in the same way as when we use SQL*Plus.


Now that you are connected to your schema, what else can you do? The following sections are some useful commands.

Commands & Help

The following list of commands is by no means a complete list of commands available in SQL Command Line. Theoretically everything you can currently do in SQL*Plus you can also do in SQL Command Line (theoretically) But the commands I give examples of below are some of my favourites (so far).

You can get the list of commands by typing help at the SQL prompt.

SQL> help

Then to get help on a specific command you can just add the command after the help.

SQL> help cd

CD

---

Changes path to look for script at after startup.

(show SQLPATH shows the full search path currently:

- CD current directory setting set by last cd command

- baseURL (url for subscripts)

- topURL (top most url when starting script)

- Last Node opened (i.e. file in worksheet)

- Where last script started

- Last opened on sqlplus path related file chooser

- SQLPATH setting

- "." if in SQLDeveloper UI (included in SQLPATH in command line (sdsql))

).

SQL>

Some work is still needed on the help documentation and what is listed for each command, as the current version is missing some important details.

Alais

This is by far my favourite new feature. This allows us to take some of our most common SQL statements and to create a shortcut for it.

Very soon I will not be using Oracle SQL but I will be using My SQL, as I will have created my own personalised version of SQL.

To list what aliases you have defined in your schema you can type

SQL > alais

Oracle will have a few aliases already defined in SQL CL. By having a look at some of these you can see some of what you want they can do and get ideas for what you might want to do with them. To list the contents of an alias, you can use the following command.

alias list {alias name}

for example

SQL > alias list tables

This command lists the query that is used for the 'tables' alias that comes with SQL CL.

I use Oracle Data Miner a lot and when you use this tool it can create a number of tables with a variety of names in your schema. Most of these you will never need to look at. So what I do is create an alias that excludes these from the list of tables in my schema.

SQL> alias tables2=select table_name from user_tables where table_name not like 'ODMR$%' and table_name not like 'DM$%' and table_name not like 'SYS_IOT%';

So now all I need to do to list my important data only tables (and exclude all the Oracle Data Miner tables) I can run my alias 'table2'.

SQL> tables2

You will quickly build up a suite of commands using aliases.

info and >info+

info and info+ are the new commands to replace the DESC command.

The difference between info and info+ is that info+ gives you some statistical information about the table and the attributes in the table. This is illustrated in the following examples.

Example using 'info'

Sqlcl 2

Example using 'info+'

Sqlcl 3

CTAS & DDL

If you want to get the DDL script to create a copy of a table you have two options open to you. The first of these is the DDL command. This creates a DDL statement based on the meta data for the table, just like in the following

Sqlcl 4

An alternative to this is to use the CTAS command that will give a slightly different output to DDL command. With the CTAS we also get the CREATE TABLE .. AS SELECT ...

History

In SQL*Plus we had a limited scroll through our previous commands. The same kind of scrolling is available in SQL CL, but we can get to see all our previous commands using the 'history' command. The following illustrates how you can list all you previous commands, I'm sure it is limited to a certain number or will be otherwise it will become a very long, long list.

SQL> history

To find out how often each command has been run you can run

SQL> history usage

and to find out how long the query took to run the last time it was run

SQL> history time


There are lots more that I could show, but this post is way, way to long as it is. What I suggest you do is go and download SQL CL (Command Line) and start using it today.

Tuesday, March 3, 2015

Book give away at OUG Ireland

The annual Oracle User Group in Ireland conference is on the 19th March in Croke Park.

I'll be giving 2 presentations, with one each on the Development and Business Analytics tracks. Here are the details of these presentations.

TimeRoomPresentation Title / Topic
14:00-14:45InterConnect 681Predictive Queries in Oracle 12c
16:00-16:45Davin SuiteRunning R in the Database using Oracle R Enterprise

I will be giving away a copy of my book to one luck person :-)

How will this book give away work?

During both of my presentations I will pass around a "hat" for you to put your name or business card into. Then at end of my last presentation we will draw one name out of the hat.

But you have to be in the room to collect the book. If you are not there then I will draw out another name (and so on) until the winner is in the room.

So by attending both of my presentations you are doubling your chances of winning my book.

(Maybe this is an attempt by me to have a good attendance at my last presentation)

Book Cover

Plus I might have a few other Oracle Press goodies to give away too.

Wednesday, February 25, 2015

US President talks about Data Science

Check out the video of US President talking about Data Science and the first Chief Data Scientist of the USA talks about his mission.

Sunday, February 22, 2015

Oracle ACEs at OUG Ireland 2015

The annual Oracle User Group in Ireland Conference will be on Thursday 19th March. This year the conference will be held in the Croke Park conference centre. This conference centre is only a short taxi ride from Dublin Airport and Dublin City Centre.

If you are planning a hotel stay for the conference I would recommend staying in a hotel in the city centre and get a taxi to/from the conference venue.

We have a large number of Oracle ACEs presenting at the conference. The following table lists the ACEs, their twitter handle and their website.


Oracle ACEType of ACETwitter NameBlog / Web Site
Brendan TierneyACE Director@brendantierney http://www.oralytics.com
Debra Lilley ACE Director @debralilley http://www.debrasoracle.blogspot.ie/
Jonathan Lewis ACE Director @JLOracle http://jonathanlewis.wordpress.com/
Tim Hall ACE Director @oraclebase http://oracle-base.com/
Alex Nuijten ACE Director @alexnuijten http://nuijten.blogspot.com/
Dhananjay Papde ACE Associate
Stewart Bryson ACE Director @stewartbryson http://www.redpillanalytics.com/
Antony Heljula ACE @aheljula http://www.peakindicators.com/
Gurcan Orhan ACE Director @gurcan_orhan https://gurcanorhan.wordpress.com/
Heli Helskyaho ACE Director @HeliFromFinland https://helifromfinland.wordpress.com/
Marco Gralike ACE Director @mgralike http://www.xmldb.nl
Roel Hartman ACE Director @roelh http://roelhartman.blogspot.com/
Martin Widlake ACE @mdwidlake https://mwidlake.wordpress.com/
Liron Amitzi ACE @amitzil http://www.dbaces.com/
David Kurtz ACE Director @davidmkurtz http://www.go-faster.co.uk/
Marcin Przepiorowski ACE @pioro http://oracleprof.blogspot.com/

Make sure you check out the full agenda for the conference by clicking on the following image. Plus there is a full day session on Friday 20th March with Maria Colgan on the Oracle In-Memory option.

Ougire15 hp cfp v2

Friday, February 13, 2015

My OTN 2 Minute Tech Tip: Predictive Queries

A few days ago I recorded a 2-minute tech tip with Bob Hubbard of OTN.

My topic was on Predictive Queries which are a new feature in the Oracle 12c Database.

The challenge was to talk about the topic within 2 minutes. That is a lot harder than you time. Believe me.

Check out the video on the Bobs OTN 2-Minute Tech Tip channel or click on the link below.

It was fun doing this and hopefully I get a chance to do another video with Bob.

Here is a screen capture of when things were being recorded.

OTN tech tip

Friday, January 30, 2015

Evaluating Classification Models in ODM (Part 2)

In a previous blog post I talked about and showed some of the typical statistical methods to evaluate the classification models that you develop. Click to see this (first) blog post.

In this blog post I want to show you how you can go about evaluating your classification models that you develop using Oracle Data Miner (part of SQL Developer).

What I'm not going to show you here is how to develop classification models using Oracle Data Mining :-( I've had several blog posts over the years on this topics. So you can go and search of those posts or alternately this topic is cover in a lot more detail in my Oracle Data Miner book :-)

After you have developed your ODM models in Oracle Data Miner you have 2 levels of details available to you. The first of these is the Compare Test Results. You can find this by right clicking on the Classification node of your ODM Workflow, as showing below.

Viewing the Test Results of all ODM Models

When you select the Compare Test Results a new (worksheet) tab will open. This will display summary statistics and graphics for the summary statistics for each Oracle Data Ming model created. In the following image an ODM model was created for each In-Database Classification algorithm in the Oracle Database.

Blog odm test results 2

Here we get to see 2 of the statistical measures that I talked about in my previous blog post, the (average) Accuracy and the Overall Accuracy. We can look at and examine this in a bit more detail in a minute. A new measure that I haven't mentioned before is the Predictive Confidence.

The Predictive Confidence measure provides an estimate of the overall goodness of the model. Predictive Confidence is a number between 0 and 1. Data Miner displays Predictive Confidence as a percent.

  • If Predictive Confidence=0, then it indicates that the predictions of the model are no better than the predictions made by using the naive model.
  • If Predictive Confidence=1, then it indicates that the predictions are perfect.
  • If Predictive Confidence=0.5, then it indicates that the model has cut the error of a naive model by 50%./li>

So the higher the value for Predictive Confidence the better the model. Particularly when it is higher than 50%.

After evaluation these summary statistical measures you will want to drill down on these to see the lower level statistical measures, for example you will want to see the confusion matrix and the corresponding statistical measures. To view the confusion matrix all you need to do is to click on the Performance Matrix tab. Before you can really start evaluating the models you will need to click on the Display drop down and select 'Show Detail' from the drop down list. Another thing you will need to do is to click/check the 'Show totals and codes' check box on the lower part of the screen. This will give you some of the statistical measures that I outlined in my previous blog post.

Blog odm test results 3

When you examine the statistical measures displayed on the screen you will notice that some of the statistical measures I outlined in my previous blog post are missing. Some of these missing measures are ones that you will want to consider and use as part of your evaluation of you ODM models.

So what how do you find out what these missing statistical measures are? Well ODM does not display these so the only real option open to you is to go and calculate them yourself :-( This is not ideal but these are relatively easy to calculate and you can do this on a piece of paper or you can open your spreadsheet software and let it calculate them for you (once you have defined to formula for each). Here is an example of the completed/extended confusion matrix based on the results from the CLAS_SVM_1_59 model shown in the above image.

Blog odm test results 4

In my next blog post I will look at how you can evaluate a classification model that was developed using the in-database Oracle Data Mining algorithms (Oracle Data Miner GUI was not used). The evaluation criteria that I will show will be based on the statistical methods that I highlighted in my first blog post on this topic.

Tuesday, January 20, 2015

Evaluating Classification Results

When you are working on building classification models you will need some ways of measuring the effectiveness of each model that you will build. This measurement/evaluation is perform during the model build process.

Typically the model build process consists of 2 steps (I'm assuming all data preparation etc has been completed:

  • Build the model: During this step you will feed in a portion of your data set to the data mining algorithm. Typical this data will be a subset of your data set and will typically consist of 60% to 70% of the data. This data is used to by the data mining algorithm to build the model.
  • Test the model: After the model has been built you will need to test the model to see how efficient it is at making the predictions. This is where we use the data that was not used to build the model. For this data we already know the outcome. So after we have applied the model to this data subset we can measure the predicted values against the actual values.

Most of the data mining tools will automate these two steps, specifically the splitting the data into the build and test data sets. But if you are using a language like R, etc then you will need to manually perform these steps.

The most common way of collating the test results is to use the Confusion Matrix. This allows us to layout the correct predictions, the incorrect predictions and to perform a number of other statistical measurements.

True Positives

True Negatives

False Positives

False Negatives

The last two of the above values are also commonly referred to in statistics as Type 1 (false positive) and Type 2 (false negative) errors.

Depending on your project you will concentrate on a combination of the true and false values of either the Positives or the negatives.

For example, in Medical Diagnostics for cancer, you will be looking to keep the False Negatives to a minimum. This is where you have predicted someone does not have cancer, but actually does. The consequence of this is that the person is not brought back for addition testing and we all know what will happen. On the other hand it is OK to have a hight False Positive in this case. In this scenario you bring the person back for additional tests and discover that they are all clear :-)

Precision = How many of the selected items are relevant? (as a percentage)

Recall = How many of the relevant items are selected? (as a percentage)

Accuracy = How many did we correctly predict? (as a percentage)

The following table illustrates these measurements and tests.

Confusion Matrix

There are lots of other statistical tests that can be performed on your results. Everyone will have their own preferences. What I have highlighted here are the main statistical test for you to look at.

You cannot use one or a few of the statistical tests to make a decision on what data mining model works best for your data. It is a combination of these statistical test, your understanding of the data and you understanding of the business project that need to be considered.

In my next 2 blog posts I will show you how you can perform these tests on the results generated by the Oracle Data Miner tool and then on the Oracle Data Miner models produced using PL/SQL.

Friday, January 16, 2015

Pulling Large Database tables in R

As the volume of the data in your tables grows, particularly in the big data world, you may run into some memory issues or package restrictions with pulling down the tables to your R environment.

Some of the R packages and drivers have some recommended numbers or limits for the number of records that can be fetched.

Caveate: My laptop is a Mac and at this point in time the ROracle package is unavailable for a Mac. It is for Windows, Solaris and AIX.

In the following example I'm looking at downloading a table with 300K records from an Oracle Database. I've already setup my DB connection using the Oracle JDBC driver. But when I run the following command I get an error.

> res<-dbSendQuery(jdbcConnection, "select * from my_large_table")

> dbFetch(res)

Error in .jcall(rp, "I", "fetch", stride) :

    java.lang.OutOfMemoryError: Java heap space

I also get a similar error if I run the following command.

> train_data <- dbReadTable(jdbcConnection, "MY_LARGE_TABLE")

How can you pull down a large table in R? So that you are not restricted to memory restrictions or limits on the number of records.

One way to do this is to loop through the data, pull the records down in chunks (a certain fetch size), put these into an array, and then merge them all together into a data frame. The following code illustrates how to do this.

> res<-dbSendQuery(jdbcConnection, "select * from my_large_table")

> dbFetch(res)

> rm(result)

> result<-list()

> i=1

> result[[i]]<-dbFetch(res,n=1000)

> while(nrow(chunk <- dbFetch(res, n = 1000))>0){

+     i<-i+1

+     result[[i]]<-chunk

+ }

> train_data<-do.call(rbind,result)

The above code runs surprisingly quickly, generate no errors and I now have all the data I need in my R environment.

The fetch size in the above example is set to 1000. This is a bit small really and is only set to that for illustration purposes here. You will need to play with this size to find out what size works best for your environment.

As with all programming languages and with R too there can be many different ways of performing the same thing.

Saturday, January 10, 2015

Creating a Word Cloud of Oracle's OAA webpages in R

The following is not something new but something that I have put together this evening, and I mainly make the following available as a note to myself and what I did. If you find it useful or interesting then you are more than welcome to use and share. You will also find lots of similar solutions on the web.

This evening I was playing around the the Text Mining (tm) package in R. So I decided to create a Word Cloud of the Advanced Analytics webpages on Oracle.com. These webpages contain the Overview webpage for the Advanced Analytics webpage, the Oracle Data Mining webpages and the Oracle R Enterprise webpages.

I've broken the R code into a number of sections.

1. Setup

The first thing that you need to do is to install four R packages these are "tm", "wordcloud" , "Curl" and "XML". The first two of these packages are needed for the main part of the Text processing and generating the word cloud. The last two of these packages are needed by the function "htmlToText". You can download the htmlToText function on github.

install.packages (c ( "tm", "wordcloud", "RCurl", "XML", "SnowballC")) # install 'tm'' package

library (tm)

library (wordcloud)

library (SnowballC)

# load htmlToText

source("/Users/brendan.tierney/htmltotext.R")

2. Read in the Oracle Advanced Analytics webpages using the htmlToText function

data1 <- htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/overview/index.html")

data2 <- htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html")

data3 <- htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-technologies/overview/index.html")

data4 <- htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html")

You will need to combine each of these webpages into one for processing in later steps.

data <- c(data1, data2)

data <- c(data, data3)

data <- c(data, data4)

3. Convert into a Corpus and perfom Data Cleaning & Transformations

To convert our web documents into a Corpus.

txt_corpus <- Corpus (VectorSource (data)) # create a corpus

We can use the summary function to get some of the details of the Corpus. We can see that we have 4 documents in the corpus.

> summary(txt_corpus)

A corpus with 4 text documents

The metadata consists of 2 tag-value pairs and a data frame

Available tags are:

    create_date creator

Available variables in the data frame are:

    MetaID

Remove the White Space in these documents

   tm_map <- tm_map (txt_corpus, stripWhitespace) # remove white space

Remove the Punctuations from the documents

   tm_map <- tm_map (tm_map, removePunctuation) # remove punctuations

Remove number from the documents

   tm_map <- tm_map (tm_map, removeNumbers) # to remove numbers

Remove the typical list of Stop Words

   tm_map <- tm_map (tm_map, removeWords, stopwords("english")) # to remove stop words(like ‘as’ ‘the’ etc….)

Apply stemming to the documents

If needed you can also apply stemming on your data. I decided to not perform this as it seemed to trunc some of the words in the word cloud.

  # tm_map <- tm_map (tm_map, stemDocument)

If you do want to perform stemming then just remove the # symbol.

Remove any addition words (would could add other words to this list)

   tm_map <- tm_map (tm_map, removeWords, c("work", "use", "java", "new", "support"))

If you want to have a look at the output of each of the above commands you can use the inspect function.

   inspect(tm_map)

4. Convert into a Text Document Matrix and Sort

   Matrix <- TermDocumentMatrix(tm_map) # terms in rows

   matrix_c <- as.matrix (Matrix)

   freq <- sort (rowSums (matrix_c)) # frequency data


   freq #to view the words and their frequencies

5. Generate the Word Cloud

   tmdata <- data.frame (words=names(freq), freq)

   wordcloud (tmdata$words, tmdata$freq, max.words=100, min.freq=3, scale=c(7,.5), random.order=FALSE, colors=brewer.pal(8, "Dark2"))

and the World Clould will look something like the following. Everything you generate the Word Cloud you will get a slightly different layout of the words.

OAA Word Cloud