Monday, August 8, 2016

Oracle Text, Oracle R Enterprise and Oracle Data Mining - Part 2

This is the second blog post of a series on using Oracle Text, Oracle R Enterprise and Oracle Data Mining. Check out the first blog post of the series, as the data used in this blog post was extracted, processed and stored in a databases table.

In this blog post I will show you how you use Oracle R Enterprise and the embedded R execution features of ORE to use the text from the webpages and to create a word cloud. This is a useful tool to be able to see visually what words can stand out most on your webpage and if the correct message is being put across to your customers.

Prerequisites: You will need to load the following R packages into your R environment 'tm', 'word cloud' 'SnowballC'. These are required to process the following R code segments.

install.packages (c( "tm", "wordcloud", "SnowballC"))
library (tm)
library (wordcloud)
library (SnowballC)

Select data from table and prepare: We need to select the data from the table in our schema and to merge it into one variable.

local_data <- ore.pull(MY_DOCUMENTS)

tm_data <-""
for(i in 1:nrow(local_data)) {
  tm_data <- paste(tm_data, local_data[i,]$DOC_TEXT, sep=" ")
}
tm_data

Create function to perform Text Mining: In my previous blog post on creating a word cloud I gave the R code. In order to allow for this R code to be run on the database server (using the embedded R execution of ORE) we need to package this text mining R code up into a ORE user defined R script. This is stored in the database.

ore.scriptDrop("prepare_tm_data")
ore.scriptCreate("prepare_tm_data", function (tm_data) { 
  library(tm)
  library(SnowballC)
  library(wordcloud)
  
  txt_corpus <- Corpus (VectorSource (tm_data))
  
  # data clean up
  tm_map <- tm_map (txt_corpus, stripWhitespace) # remove white space
  tm_map <- tm_map (tm_map, removePunctuation) # remove punctuations
  tm_map <- tm_map (tm_map, removeNumbers) # to remove numbers
  tm_map <- tm_map (tm_map, removeWords, stopwords("english")) # to remove stop words
  tm_map <- tm_map (tm_map, removeWords, c("work", "use", "java", "new", "support"))
  
  # prepare matrix of words and frequency counts
  Matrix <- TermDocumentMatrix(tm_map) # terms in rows
  matrix_c <- as.matrix (Matrix)
  freq <- sort (rowSums (matrix_c)) # frequency data
  
  res <- data.frame(words=names(freq), freq)
  wordcloud (res$words, res$freq, max.words=100, min.freq=3, scale=c(7,.5), random.order=FALSE, colors=brewer.pal(8, "Dark2"))
} ) 

Before we can run this user define R script, we need to ensure that we have the 'tm', 'SnowballC' and 'wordcloud' R packages installed on the Oracle Database server. On the Oracle Database server you need to rune ORE.

> library(ORE)

Then run the following command to install these R packages

> install_packages(c('tm','wordcloud', 'SnowballC'))

Run the function on the DB Server: You are now ready to run the function. In an earlier step we had gathered the data. Now we can pass this data to the in-database R script.

> res <- ore.doEval(FUN.NAME="prepare_tm_data", tm_data=tm_data)

The ore.doEval function is a general purpose ORE function. In this case we pass it two parameters. The first parameter is the neame of the user defined R script stored in the database, and the second parameter is the data. The function returns and ORE object that contains the word cloud graphic.

Display the results: You can very easily display the results.

> res

This gives us the following graphic.

NewImage

In my next blog post, of this series, I will show you how you can use the function created above and some other bits and pieces, using some other features of ORE and also in SQL.

4 comments:

  1. Hi Brendan:

    Are you using Oracle BigDataLite VM for this whole exercise? I am very interested in your thread and like to give it a try myself ... I am having Oracle BigDateLite (BDL) VM 4.7 ... it has pre-installed with Oracle R Distribution (ORD) 3.2. however, when I tried to install 'tm' package, I got an error.

    Given you are exploring "Oracle" technology (Oracle Texting, ORE, Oracle DM), I am thinking BDL VM should be the right sandbox for me to try it out.




    > install.packages("tm")
    Installing package into ‘/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library’
    (as ‘lib’ is unspecified)
    --- Please select a CRAN mirror for use in this session ---
    Warning: dependency ‘slam’ is not available
    trying URL 'http://archive.linux.duke.edu/cran/src/contrib/tm_0.7-1.tar.gz'
    Content type 'application/x-gzip' length 526702 bytes (514 KB)
    ==================================================
    downloaded 514 KB

    ERROR: dependency ‘slam’ is not available for package ‘tm’
    * removing ‘/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/tm’

    The downloaded source packages are in
    ‘/tmp/Rtmp6Yvikj/downloaded_packages’
    Warning message:
    In install.packages("tm") :
    installation of package ‘tm’ had non-zero exit status


    Can you share your thoughts on this?

    Thanks!

    Kevin

    ReplyDelete
  2. Packages on R cran keep on changing and some packages are no longer available, but may be available elsewhere. There are also dependencies between each version of R and what versions of packages are supported. Based on the information you have provided it looks like there is an issue relating to 'slam'. To fix this you need to install 'slam' separately. For example use the following and then install tm
    sudo apt-get install r-cran-slam

    ReplyDelete
  3. also check out this blog post that has details of how to overcome this issue
    https://blogs.oracle.com/r/r-package-installation-with-oracle-r-enterprise

    ReplyDelete
    Replies
    1. and you have an internet connection setup and working for the VM

      Delete