Subset of object creates bigger RDA file size than original object

This is a funny story, and I will try to tell you how I realized I don’t know anything about R in 400 words.

I work at the Bioinformatic Core at Harvard TH School. People who know us, or collaborate with us, knows that we mainly use bcbio to analyze sequencing data (check it out, super cool tool).

Something that we are working on is to load the data after bcbio finishes into R to ease the downstream analysis. For instance, for RNA-seq we have bcbioRnaSeq and for small RNA-seq we have bcbioSmallRna. In short, these R packages make easy the task to load all the data generated by bcbio and wrap many methods and functions to generate the most used figures/analysis.

That is not the story. The story is that the bcbioRNADataSet object in those packages contains all the data in order to make easy the use of different counts matrix, or subset to a smaller samples/genes. For instance, the object contains the DDS object (DESeq2 DESeqDataSet), so we can use methods for this class or re-normalize at any point.

Well, when we implemented the [ method (this allows to subset the object), we discovered that the object coming from that method was twice as big when saving into an RDA file. For instance, if you have an object with 34K genes, 100 samples that would be 900Mb when saving, then you subset to 1000 genes and 8 samples, you would end up with a file being 1.6G.

That is not nice, so I decided to play around. I don’t want you to read our code, because contains a lot of information, but feel free: subset. I value your time, so this is a dummy example.

The examples mainly creates different objects:

  • dummyBig: 10K genes, 100 samples
  • dummySmall: 500 genes, 5 samples
  • dummySilly: subset of dummyBig using a silly function to subset, same size than dummySmall
  • dummySmart: subset of dummyBig using a smart function to subset, same size than dummySmall

We should expect that dummySilly and dummySmart have a similar size than dummySmall

This is the size of RDA files for each of them (size in Kb):

object file rdaSize objSize
dummyBig big.rda 2828.49 Kb 1588 Kb
dummySmall micro.rda 24.19 Kb 42.6 Kb
dummySilly microSilly.rda 4397.49 Kb 43.2 Kb
dummySmart microSmart.rda 22.38 Kb 43.2 Kb

rdaSize is the size of the file using save(obj, file = fileName.rda function. objSize is the size in memory in R using the function format(object.size(obj))

As you can see, microSilly is almost twice the original data, but microSmart is the expected size.

I don’t know the exact reason, but the chunk of code that made the trick was to create an internal function to be called inside the method to re-generate the DESeq2 object with the smaller size.

I know there should be a better way to avoid this, but this worked. If you have any comment/advice post it here. My R session is below.

“honesty in bioinformatics: accept you develop without being an expert and that creates chaos.”

R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    methods   stats     graphics  grDevices utils
[8] datasets  base

other attached packages:
 [1] dplyr_0.7.2.9000           DESeq2_1.16.1
 [3] SummarizedExperiment_1.6.3 DelayedArray_0.2.7
 [5] matrixStats_0.52.2         Biobase_2.36.2
 [7] GenomicRanges_1.28.4       GenomeInfoDb_1.12.2
 [9] IRanges_2.10.2             S4Vectors_0.14.3
[11] BiocGenerics_0.22.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12            locfit_1.5-9.1          lattice_0.20-35
 [4] assertthat_0.2.0        digest_0.6.12           R6_2.2.2
 [7] plyr_1.8.4              backports_1.1.0         acepack_1.4.1
[10] RSQLite_2.0             highr_0.6               ggplot2_2.2.1
[13] zlibbioc_1.22.0         rlang_0.1.2.9000        lazyeval_0.2.0
[16] data.table_1.10.4       annotate_1.54.0         blob_1.1.0
[19] rpart_4.1-11            Matrix_1.2-10           checkmate_1.8.2
[22] splines_3.4.1           BiocParallel_1.10.1     geneplotter_1.54.0
[25] stringr_1.2.0           foreign_0.8-69          htmlwidgets_0.8
[28] RCurl_1.95-4.8          bit_1.1-12              munsell_0.4.3
[31] compiler_3.4.1          pkgconfig_2.0.1         base64enc_0.1-3
[34] htmltools_0.3.6         nnet_7.3-12             tibble_1.3.3.9001
[37] gridExtra_2.2.1         htmlTable_1.9           GenomeInfoDbData_0.99.0
[40] Hmisc_4.0-3             XML_3.98-1.9            bitops_1.0-6
[43] grid_3.4.1              xtable_1.8-2            gtable_0.2.0
[46] DBI_0.7                 magrittr_1.5            scales_0.4.1
[49] stringi_1.1.5           XVector_0.16.0          genefilter_1.58.1
[52] bindrcpp_0.2            latticeExtra_0.6-28     Formula_1.2-1
[55] RColorBrewer_1.1-2      tools_3.4.1             bit64_0.9-7
[58] glue_1.1.1.9000         survival_2.41-3         AnnotationDbi_1.38.2
[61] colorspace_1.3-2        cluster_2.0.6           memoise_1.1.0
[64] bindr_0.1               knitr_1.17

Director of Bioinformatics Platform

My research interests include genomics, visualizationa and modelling.

Related