The following is not something new but something that I have put together this evening, and I mainly make the following available as a note to myself and what I did. If you find it useful or interesting then you are more than welcome to use and share. You will also find lots of similar solutions on the web.
This evening I was playing around the the Text Mining (tm) package in R. So I decided to create a Word Cloud of the Advanced Analytics webpages on Oracle.com. These webpages contain the Overview webpage for the Advanced Analytics webpage, the Oracle Data Mining webpages and the Oracle R Enterprise webpages.
I've broken the R code into a number of sections.
1. Setup
The first thing that you need to do is to install four R packages these are "tm", "wordcloud" , "Curl" and "XML". The first two of these packages are needed for the main part of the Text processing and generating the word cloud. The last two of these packages are needed by the function "htmlToText". You can download the htmlToText function on github.
install.packages (c ( "tm", "wordcloud", "RCurl", "XML", "SnowballC")) # install 'tm'' package
library (tm)
library (wordcloud)
library (SnowballC)
# load htmlToText
source("/Users/brendan.tierney/htmltotext.R")
2. Read in the Oracle Advanced Analytics webpages using the htmlToText function
data1 <- htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/overview/index.html")
data2 <- htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html")
data3 <- htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-technologies/overview/index.html")
data4 <- htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html")
You will need to combine each of these webpages into one for processing in later steps.
data <- c(data1, data2)
data <- c(data, data3)
data <- c(data, data4)
3. Convert into a Corpus and perfom Data Cleaning & Transformations
To convert our web documents into a Corpus.
txt_corpus <- Corpus (VectorSource (data)) # create a corpus
We can use the summary function to get some of the details of the Corpus. We can see that we have 4 documents in the corpus.
> summary(txt_corpus)
A corpus with 4 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
Remove the White Space in these documents
tm_map <- tm_map (txt_corpus, stripWhitespace) # remove white space
Remove the Punctuations from the documents
tm_map <- tm_map (tm_map, removePunctuation) # remove punctuations
Remove number from the documents
tm_map <- tm_map (tm_map, removeNumbers) # to remove numbers
Remove the typical list of Stop Words
tm_map <- tm_map (tm_map, removeWords, stopwords("english")) # to remove stop words(like ‘as’ ‘the’ etc….)
Apply stemming to the documents
If needed you can also apply stemming on your data. I decided to not perform this as it seemed to trunc some of the words in the word cloud.
# tm_map <- tm_map (tm_map, stemDocument)
If you do want to perform stemming then just remove the # symbol.
Remove any addition words (would could add other words to this list)
tm_map <- tm_map (tm_map, removeWords, c("work", "use", "java", "new", "support"))
If you want to have a look at the output of each of the above commands you can use the inspect function.
inspect(tm_map)
4. Convert into a Text Document Matrix and Sort
Matrix <- TermDocumentMatrix(tm_map) # terms in rows
matrix_c <- as.matrix (Matrix)
freq <- sort (rowSums (matrix_c)) # frequency data
freq #to view the words and their frequencies
5. Generate the Word Cloudtmdata <- data.frame (words=names(freq), freq)
wordcloud (tmdata$words, tmdata$freq, max.words=100, min.freq=3, scale=c(7,.5), random.order=FALSE, colors=brewer.pal(8, "Dark2"))
and the World Clould will look something like the following. Everything you generate the Word Cloud you will get a slightly different layout of the words.
Hi Brendan,
ReplyDeleteI think you forgot to mention the conversion to lowercase (although it is apparent from the wordcloud that you have actually performed it). This could be done with the tm_map operation, as follows:
tm_map <- tm_map (tm_map, tolower) # convert to lowercase
which works, despite the fact that the argument 'tolower' is not included in the list of available transformations one gets with the getTransformations() command of the tm package...
Also, for reproducibility, we should include the origin of the handy htmlToText function of Tony Breyal (it is funny how many people are actually using it, and why it has not been already included in a package):
https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/htmlToText/htmlToText.R
Thanks for your blog, and also for your wonderful book... :-)
Regards
Christos (aka @desertnaut)
Hi Christos
ReplyDeleteThanks for the comment. I had an updated version of the post with the link to github for the htmlToText function. But it looks like I forget to press the send button :-( Yes it is a great function and many thanks to Tony Breyal for sharing it.
I actually deliberately left the tolower transformation out, as it is not needed. Yes it is not needed (really).
Yes lots of other websites have the tolower transformation included in their sample code. So I wonder if they really inspected the before and after picture of their data for each command.
If you examine the data using the inspect command you can view the data before and after each transformation.
After applying the transformations, I still have a mixture of case in my text.
But when you run the following command it converts the text/documents to lower case
Matrix <- TermDocumentMatrix(tm_map) # terms in rows
To see this run
inspect(Matix)
When you scroll through the output there is no upper case letters. The TermDocumentMatrix functions seems to perform a tolower automatically. So the tm_map(tm_map, tolower) transformation is not needed. But I should emphasise that it is not needed in this case/example (and for me). Depending on what text mining you are performing you may need to apply the tolower transformation.
Hi Brendan,
ReplyDeleteThanks for the response. You are right re the TermDocumentMatrix behaviour; I was surprized at first when I reproduced the result, but then checking the documentation I remembered: that's because, as explained in the documentation, the function inherits the settings from the termFreq() function, which includes by default tolower=TRUE
These settings allow also for a more compact operation: instead of all the separate tm_map() functions, after you remove the white space, you can simply use
Matrix <- TermDocumentMatrix(tm_map, control=list(removePunctuation=TRUE,
removeNumbers=TRUE,
stopwords=TRUE,
stemming=TRUE))
(contrary to 'tolower', the default values for all these arguments is FALSE)
As always, it is a matter of taste and style...
When I started working with the tm package, some years ago, I followed the "pipeline" as exposed in the package authors' paper in the Journal of Statistical Software (2008), where they explicitly use the tolower operation, hence I kept it...
You are correct and Yes it is a matter of taste and style. So with thanks to your contributions we can now show the longer way (shown in my original post) and the shorter version (given below) based on your suggestion.
ReplyDeletedata1 <- htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/overview/index.html")
data2 <- htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html")
data3 <- htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-technologies/overview/index.html")
data4 <- htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html")
data <- c(data1, data2)
data <- c(data, data3)
data <- c(data, data4)
txt_corpus <- Corpus (VectorSource (data)) # create a corpus
tm_map <- tm_map (txt_corpus, stripWhitespace) # remove white space
tm_map <- tm_map (tm_map, removeWords, c("work", "use", "java", "new", "support"))
Matrix <- TermDocumentMatrix(tm_map, control=list(removePunctuation=TRUE,
removeNumbers=FALSE,
stopwords=TRUE,
stemming=FALSE))
matrix_c <- as.matrix (Matrix)
freq <- sort (rowSums (matrix_c)) # frequency data
tmdata <- data.frame (words=names(freq), freq)
wordcloud (tmdata$words, tmdata$freq, max.words=100, min.freq=3, scale=c(7,.5), random.order=FALSE, colors=brewer.pal(8, "Dark2"))
That's great Brendan. You should only restore the 'removeNumbers' and 'stemming' arguments to TRUE (typos, I guess)...
ReplyDeleteThanks
PS Now I am rethinking the whole stuff, I remember why it is a good (and suggested) practice to convert to lowercase *early* in the process: possible stopwords that are in the beginning of sentences ('And', 'We', 'About'...), as well as terms like 'Java' in your case, will *not* be removed otherwise, since the stopwords list contains only all-lowercase words; in fact, this is exactly the reason why, while you have tried to remove 'java', it is indeed present in your wordcloud (it is 'Java' in the text)! ;-)
[I think my other reply got lost as I wasn't signed in...]
ReplyDeleteI had forgotten about the htmlToText function and at some point I think I improved it by moving over to the httr package and removing any lines of converted text with fewer than 20 characters (as that stuff is usually just navigation junk).
Anyway, I don't have R anymore but I'm sure the function was vectorised, so you could save a bit of typing and just do the following (I think):
urls <- c("http://www.oracle.com/technetwork/database/options/advanced-analytics/overview/index.html",
"http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html",
"http://www.oracle.com/technetwork/database/database-technologies/r/r-technologies/overview/index.html",
"http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html")
data <- htmlToText(urls)