Saturday, January 10, 2015

Creating a Word Cloud of Oracle's OAA webpages in R

The following is not something new but something that I have put together this evening, and I mainly make the following available as a note to myself and what I did. If you find it useful or interesting then you are more than welcome to use and share. You will also find lots of similar solutions on the web.

This evening I was playing around the the Text Mining (tm) package in R. So I decided to create a Word Cloud of the Advanced Analytics webpages on Oracle.com. These webpages contain the Overview webpage for the Advanced Analytics webpage, the Oracle Data Mining webpages and the Oracle R Enterprise webpages.

I've broken the R code into a number of sections.

1. Setup

The first thing that you need to do is to install four R packages these are "tm", "wordcloud" , "Curl" and "XML". The first two of these packages are needed for the main part of the Text processing and generating the word cloud. The last two of these packages are needed by the function "htmlToText". You can download the htmlToText function on github.

install.packages (c ( "tm", "wordcloud", "RCurl", "XML", "SnowballC")) # install 'tm'' package

library (tm)

library (wordcloud)

library (SnowballC)

# load htmlToText

source("/Users/brendan.tierney/htmltotext.R")

2. Read in the Oracle Advanced Analytics webpages using the htmlToText function

data1 <- htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/overview/index.html")

data2 <- htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html")

data3 <- htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-technologies/overview/index.html")

data4 <- htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html")

You will need to combine each of these webpages into one for processing in later steps.

data <- c(data1, data2)

data <- c(data, data3)

data <- c(data, data4)

3. Convert into a Corpus and perfom Data Cleaning & Transformations

To convert our web documents into a Corpus.

txt_corpus <- Corpus (VectorSource (data)) # create a corpus

We can use the summary function to get some of the details of the Corpus. We can see that we have 4 documents in the corpus.

> summary(txt_corpus)

A corpus with 4 text documents

The metadata consists of 2 tag-value pairs and a data frame

Available tags are:

    create_date creator

Available variables in the data frame are:

    MetaID

Remove the White Space in these documents

   tm_map <- tm_map (txt_corpus, stripWhitespace) # remove white space

Remove the Punctuations from the documents

   tm_map <- tm_map (tm_map, removePunctuation) # remove punctuations

Remove number from the documents

   tm_map <- tm_map (tm_map, removeNumbers) # to remove numbers

Remove the typical list of Stop Words

   tm_map <- tm_map (tm_map, removeWords, stopwords("english")) # to remove stop words(like ‘as’ ‘the’ etc….)

Apply stemming to the documents

If needed you can also apply stemming on your data. I decided to not perform this as it seemed to trunc some of the words in the word cloud.

  # tm_map <- tm_map (tm_map, stemDocument)

If you do want to perform stemming then just remove the # symbol.

Remove any addition words (would could add other words to this list)

   tm_map <- tm_map (tm_map, removeWords, c("work", "use", "java", "new", "support"))

If you want to have a look at the output of each of the above commands you can use the inspect function.

   inspect(tm_map)

4. Convert into a Text Document Matrix and Sort

   Matrix <- TermDocumentMatrix(tm_map) # terms in rows

   matrix_c <- as.matrix (Matrix)

   freq <- sort (rowSums (matrix_c)) # frequency data


   freq #to view the words and their frequencies

5. Generate the Word Cloud

   tmdata <- data.frame (words=names(freq), freq)

   wordcloud (tmdata$words, tmdata$freq, max.words=100, min.freq=3, scale=c(7,.5), random.order=FALSE, colors=brewer.pal(8, "Dark2"))

and the World Clould will look something like the following. Everything you generate the Word Cloud you will get a slightly different layout of the words.

OAA Word Cloud

6 comments:

  1. Hi Brendan,

    I think you forgot to mention the conversion to lowercase (although it is apparent from the wordcloud that you have actually performed it). This could be done with the tm_map operation, as follows:

    tm_map <- tm_map (tm_map, tolower) # convert to lowercase

    which works, despite the fact that the argument 'tolower' is not included in the list of available transformations one gets with the getTransformations() command of the tm package...

    Also, for reproducibility, we should include the origin of the handy htmlToText function of Tony Breyal (it is funny how many people are actually using it, and why it has not been already included in a package):

    https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/htmlToText/htmlToText.R

    Thanks for your blog, and also for your wonderful book... :-)

    Regards

    Christos (aka @desertnaut)

    ReplyDelete
  2. Hi Christos

    Thanks for the comment. I had an updated version of the post with the link to github for the htmlToText function. But it looks like I forget to press the send button :-( Yes it is a great function and many thanks to Tony Breyal for sharing it.

    I actually deliberately left the tolower transformation out, as it is not needed. Yes it is not needed (really).

    Yes lots of other websites have the tolower transformation included in their sample code. So I wonder if they really inspected the before and after picture of their data for each command.

    If you examine the data using the inspect command you can view the data before and after each transformation.

    After applying the transformations, I still have a mixture of case in my text.

    But when you run the following command it converts the text/documents to lower case
    Matrix <- TermDocumentMatrix(tm_map) # terms in rows

    To see this run
    inspect(Matix)

    When you scroll through the output there is no upper case letters. The TermDocumentMatrix functions seems to perform a tolower automatically. So the tm_map(tm_map, tolower) transformation is not needed. But I should emphasise that it is not needed in this case/example (and for me). Depending on what text mining you are performing you may need to apply the tolower transformation.


    ReplyDelete
  3. Hi Brendan,

    Thanks for the response. You are right re the TermDocumentMatrix behaviour; I was surprized at first when I reproduced the result, but then checking the documentation I remembered: that's because, as explained in the documentation, the function inherits the settings from the termFreq() function, which includes by default tolower=TRUE

    These settings allow also for a more compact operation: instead of all the separate tm_map() functions, after you remove the white space, you can simply use

    Matrix <- TermDocumentMatrix(tm_map, control=list(removePunctuation=TRUE,
    removeNumbers=TRUE,
    stopwords=TRUE,
    stemming=TRUE))

    (contrary to 'tolower', the default values for all these arguments is FALSE)

    As always, it is a matter of taste and style...

    When I started working with the tm package, some years ago, I followed the "pipeline" as exposed in the package authors' paper in the Journal of Statistical Software (2008), where they explicitly use the tolower operation, hence I kept it...

    ReplyDelete
  4. You are correct and Yes it is a matter of taste and style. So with thanks to your contributions we can now show the longer way (shown in my original post) and the shorter version (given below) based on your suggestion.

    data1 <- htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/overview/index.html")
    data2 <- htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html")
    data3 <- htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-technologies/overview/index.html")
    data4 <- htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html")

    data <- c(data1, data2)
    data <- c(data, data3)
    data <- c(data, data4)

    txt_corpus <- Corpus (VectorSource (data)) # create a corpus

    tm_map <- tm_map (txt_corpus, stripWhitespace) # remove white space
    tm_map <- tm_map (tm_map, removeWords, c("work", "use", "java", "new", "support"))

    Matrix <- TermDocumentMatrix(tm_map, control=list(removePunctuation=TRUE,
    removeNumbers=FALSE,
    stopwords=TRUE,
    stemming=FALSE))
    matrix_c <- as.matrix (Matrix)
    freq <- sort (rowSums (matrix_c)) # frequency data

    tmdata <- data.frame (words=names(freq), freq)
    wordcloud (tmdata$words, tmdata$freq, max.words=100, min.freq=3, scale=c(7,.5), random.order=FALSE, colors=brewer.pal(8, "Dark2"))

    ReplyDelete
  5. That's great Brendan. You should only restore the 'removeNumbers' and 'stemming' arguments to TRUE (typos, I guess)...

    Thanks

    PS Now I am rethinking the whole stuff, I remember why it is a good (and suggested) practice to convert to lowercase *early* in the process: possible stopwords that are in the beginning of sentences ('And', 'We', 'About'...), as well as terms like 'Java' in your case, will *not* be removed otherwise, since the stopwords list contains only all-lowercase words; in fact, this is exactly the reason why, while you have tried to remove 'java', it is indeed present in your wordcloud (it is 'Java' in the text)! ;-)

    ReplyDelete
  6. [I think my other reply got lost as I wasn't signed in...]

    I had forgotten about the htmlToText function and at some point I think I improved it by moving over to the httr package and removing any lines of converted text with fewer than 20 characters (as that stuff is usually just navigation junk).

    Anyway, I don't have R anymore but I'm sure the function was vectorised, so you could save a bit of typing and just do the following (I think):

    urls <- c("http://www.oracle.com/technetwork/database/options/advanced-analytics/overview/index.html",
    "http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html",
    "http://www.oracle.com/technetwork/database/database-technologies/r/r-technologies/overview/index.html",
    "http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html")

    data <- htmlToText(urls)

    ReplyDelete