Wednesday, August 17, 2016

Oracle Text, Oracle R Enterprise and Oracle Data Mining - Part 3

This is the third blog post of a series on using Oracle Text, Oracle R Enterprise and Oracle Data Mining. Check out the first and second blog posts of the series, as the data used in this blog post was extracted, processed and stored in a databases table.

This blog post is divided into 3 parts. The first part will build on what was covered in in the previous blog post and will expand the in-database ORE R script to include more data processing. The second part of this blog post will look at how you can use SQL to call our in-database ORE R scripts and to be able to include it in our custom applications, for example using APEX (part 3).

Part 1 - Expanding our in-database ORE R script for Text Mining

In my previous blog post we created an ORE user defined R script, that is stored in the database, and this script was used to perform text mining and to create a word cloud. But the data/text to be mined was processed beforehand and passed into this procedure.

But what if we wanted to have a scenario where we just wanted to say, here is the table that contains the data. Go ahead and process it. To do this we need to expand our user defined R script to include the loop to merge the webpage text into one variable. The following is a new version of our ORE user defined R script.

> ore.scriptCreate("prepare_tm_data_2", function (local_data) { 
  library(tm)
  library(SnowballC)
  library(wordcloud)
  
  tm_data <-""
  for(i in 1:nrow(local_data)) {
    tm_data <- paste(tm_data, local_data[i,]$DOC_TEXT, sep=" ")
  }
    
  txt_corpus <- Corpus (VectorSource (tm_data))
  
  # data clean up
  tm_map <- tm_map (txt_corpus, stripWhitespace) # remove white space
  tm_map <- tm_map (tm_map, removePunctuation) # remove punctuations
  tm_map <- tm_map (tm_map, removeNumbers) # to remove numbers
  tm_map <- tm_map (tm_map, removeWords, stopwords("english")) # to remove stop words
  tm_map <- tm_map (tm_map, removeWords, c("work", "use", "java", "new", "support"))

  # prepare matrix of words and frequency counts
  Matrix <- TermDocumentMatrix(tm_map) # terms in rows
  matrix_c <- as.matrix (Matrix)
  freq <- sort (rowSums (matrix_c)) # frequency data
  
  res <- data.frame(words=names(freq), freq)
  wordcloud (res$words, res$freq, max.words=100, min.freq=3, scale=c(7,.5), random.order=FALSE, colors=brewer.pal(8, "Dark2"))
} ) 

To call this R scipts using the embedded R execution we can use the ore.tableApply function. Our parameter to our new R script will now be an ORE data frame. This can be a table in the database or we can create a subset of table and pass it as the parameter. This will mean all the data process will occur on the Oracle Database server. No data is passed to the client or processing performed on the client. All work is done on the database server. The only data that is passed back to the client is the result from the function and that is the word cloud image.

> res <- ore.tableApply(MY_DOCUMENTS, FUN.NAME="prepare_tm_data_2")
> res

Part 2 - Using SQL to perform R Text Mining

Another way you ccan call this ORE user defined R function is using SQL. Yes we can use SQL to call R code and to produce an R graphic. Then doing this the R graphic will be returned as a BLOB. So that makes it easy to view and to include in your applications, just like APEX.

To call our ORE user defined R function, we can use the rqTableEval SQL function. You only really need to set two of the parameters to this function. The first parameter is a SELECT statement the defines the data set to be passed to the function. This is similar to what I showed above using the ore.tableApply R function, except we can have easier control on what records to pass in as the data set. The fourth parameter gives the name of the ORE user defined R script.

select *
from table(rqTableEval( cursor(select * from MY_DOCUMENTS),
                        null,
                        'PNG',
                        'prepare_tm_data_2'));

This is the image that is produced by this SQL statement and viewed in SQL Developer.

NewImage

Part 3 - Adding our R Text Mining to APEX

Adding the SQL to call an ORE user defined script is very simple in APEX. You can create a form or a report based on a query, and this query can be the same query that is given above.

Something that I like to do is to create a view for the ORE SELECT statement. This gives me some flexibility with some potential future modifications. This could be as simple as just changing the name of the script. Also if I discover a new graphic that I want to use, all I need to do is to change the R code in my user defined R script and it will automatically be picked up and displayed in APEX. See the images below.

WARNING: Yes I do have a slight warning. Since the introduction of ORE 1.4 and higher there is a slightly different security model around the use of user defined R scripts. Instead of going into the details of this and what you need to do in this blog post, I will have a separate blog post that describes the behaviour and what you need to do allow APEX to use ORE and to call the user defined R scripts in your schema. So look out for this blog post coming really soon.

NewImage In this blog post I showed you how you use Oracle R Enterprise and the embedded R execution features of ORE to use the text from the webpages and to create a word cloud. This is a useful tool to be able to see visually what words can stand out most on your webpage and if the correct message is being put across to your customers.

5 comments:

  1. Thank you for sharing. This is great. And it has inspired me to do something similar.

    If it helps, I think you can optimize your code a bit. Maybe change..

    ore.scriptCreate("prepare_tm_data_2", function (local_data) {
    library(tm)
    library(SnowballC)
    library(wordcloud)

    tm_data <-""
    for(i in 1:nrow(local_data)) {
    tm_data <- paste(tm_data, local_data[i,]$DOC_TEXT, sep=" ")
    }

    txt_corpus <- Corpus (VectorSource (tm_data))

    into

    ore.scriptCreate("prepare_tm_data_2", function (local_data) {
    library(tm)
    library(SnowballC)
    library(wordcloud)

    # tm_data <-""
    # for(i in 1:nrow(local_data)) {
    # tm_data <- paste(tm_data, local_data[i,]$DOC_TEXT, sep=" ")
    # }

    txt_corpus <- Corpus (VectorSource (local_data$DOC_TEXT))


    The idea is that you avoid a row-by-row for loop. In my dataset, it seems to save about 30-50% of the execution time.

    Again, thanks for sharing. Cool stuff. Especially making it work with APEX.

    ReplyDelete
    Replies
    1. Yes you are correct. Thanks for the comment.
      It is easier and quicker to vectorize the data.

      It is good to hear you found it useful.
      What version of APEX and ORE did you use?

      Delete
    2. HI, We are using Apex 5.1.1 . We want to make use of Oracle Data Mining and very new to ODM. DO you suggest any documents or examples. Our plan is to use ORE in future too.

      Thanks,
      Ashwin

      Delete
    3. I've some examples on my blog of using Oracle Data Mining with APEX, also ORE with APEX, and I've examples in my books too.

      Delete
  2. HI, We are using Apex 5.1.1 . We want to make use of Oracle Data Mining and very new to ODM. DO you suggest any documents or examples. Our plan is to use ORE in future too.

    Thanks,
    Ashwin

    ReplyDelete