Compare timing with different compression levels in save()

R has a very useful save() function which allows you to save multiple objects and as R objects. This means that you can save a dataframe with all the column classes (factors as factors etc) as it is instead of reading it from an ASCII file and them worrying about these. Also can save more complicated objects such as list and arrays. These objects can be read back into R using load().

There are various options to make the output file smaller which has been useful to me. The default values for save() are compress=TRUE and ascii=TRUE. Note that compress=TRUE corresponds to gzip compression which is compression level 6. The highest compression level is 9 which can be achieved by specifying compress=”bzip2″ and ascii=FALSE.

I have been using the highest compression as it saved hard disk space but until now I have not consider the time it takes to create and load objects created using different compression levels. Here are my codes to test these:

N <- 100
mat <- matrix( rnorm(N*N), nc=N )

system.time( save( mat, file="standard.rda" ) )
system.time( save( mat, file="nonStandard.rda", compress="bzip2", ascii=T) )

system.time( load("standard.rda") )
system.time( load("nonstandard.rda") )

Here are the timings (in seconds) for various values of N:

  Time to save() Time for load()
N standard non-standard standard non-standard
100 0.02 0.03 ~0 0.11
1,000 1.82 3.51 0.03 7.13
5,000 11.60 90.1 1.45 181.00

So the penalty for higher compression becomes much more noticeable with larger datasets. I should factor this along with file size generated and how often will I need to load this dataset.


2 comments

  1. Adai,

    Here is an interesting example (from my experience) of when the higher compression is both faster and creates much small files.

    N <- 5000
    mat <- matrix( as.integer(rnorm(N*N)*1e3), nc=N )
    system.time( save( mat, file="standard.rda" ) )
    system.time( save( mat, file="nonStandard.rda", compress="bzip2") )

    • Dear Andrey, apologies for the delay as I just seen this message and thank you for the suggestion. So you are saying to leave out the ascii=T option out? You are correct here as the timing to save and to load without ascii is much faster at the cost of a slightly larger file size (but still much smaller than standard save). However, what you have shown me is that storing the information as integer is produces better saving (which might work when the data is all of one type).

        x.digits5 <- round( rnorm(5000^2), digits=5)
        x.integer <- as.integer(10000*x.digits5)      ## same information but stored as an integer
      
        mat.digits5 <- matrix( x.digits5, nc=5000 )
        mat.integer <- matrix( x.integer, nc=5000 )
      
      mat.digits5   mat.integer
        save time load time file size   save time load time file size
      standard 7s 1s 104Mb   20s < 1s 65Mb
      compress=”bzip2″ 18s 11s 77Mb   8s 6s 48Mb
      compress=”bzip2″ & ascii=TRUE 37s 60s 67Mb   16s 36s 51Mb

      I think I will use compress=”bzip2″ in future, especially if I have to save() once and load() occasionally.

      If I need to store() once and load() frequently AND the data is of similar type (e.g. expression data matrix), then I can multiply by the appropriate number and use standard save (afterall mat.integer / 10000 takes less than a second to calculate). Thank you.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s