Friday, January 16, 2015

Pulling Large Database tables in R

As the volume of the data in your tables grows, particularly in the big data world, you may run into some memory issues or package restrictions with pulling down the tables to your R environment.

Some of the R packages and drivers have some recommended numbers or limits for the number of records that can be fetched.

Caveate: My laptop is a Mac and at this point in time the ROracle package is unavailable for a Mac. It is for Windows, Solaris and AIX.

In the following example I'm looking at downloading a table with 300K records from an Oracle Database. I've already setup my DB connection using the Oracle JDBC driver. But when I run the following command I get an error.

> res<-dbSendQuery(jdbcConnection, "select * from my_large_table")

> dbFetch(res)

Error in .jcall(rp, "I", "fetch", stride) :

    java.lang.OutOfMemoryError: Java heap space

I also get a similar error if I run the following command.

> train_data <- dbReadTable(jdbcConnection, "MY_LARGE_TABLE")

How can you pull down a large table in R? So that you are not restricted to memory restrictions or limits on the number of records.

One way to do this is to loop through the data, pull the records down in chunks (a certain fetch size), put these into an array, and then merge them all together into a data frame. The following code illustrates how to do this.

> res<-dbSendQuery(jdbcConnection, "select * from my_large_table")

> dbFetch(res)

> rm(result)

> result<-list()

> i=1

> result[[i]]<-dbFetch(res,n=1000)

> while(nrow(chunk <- dbFetch(res, n = 1000))>0){

+     i<-i+1

+     result[[i]]<-chunk

+ }

> train_data<-do.call(rbind,result)

The above code runs surprisingly quickly, generate no errors and I now have all the data I need in my R environment.

The fetch size in the above example is set to 1000. This is a bit small really and is only set to that for illustration purposes here. You will need to play with this size to find out what size works best for your environment.

As with all programming languages and with R too there can be many different ways of performing the same thing.

1 comment:

  1. Just curious, Brendan:

    ROracle package has a parameter bulk_read on methods dbConnect, dbSendQuery and dbGetQuery, that seems to me to be meant to solve this issue (default set to 1000.)

    So the issue you are adressing with this workaround is specifically for using JDBC? It does not have a similar parameter?

    ReplyDelete