Subset of object creates bigger RDA file size than original object
This is a funny story, and I will try to tell you how I realized I don’t know anything about R in 400 words.
I work at the Bioinformatic Core at Harvard TH School. People who know us, or collaborate with us, knows that we mainly use bcbio to analyze sequencing data (check it out, super cool tool).
Something that we are working on is to load the data after bcbio
finishes into R to ease the downstream analysis.
For instance, for RNA-seq we have bcbioRnaSeq and for small RNA-seq we have bcbioSmallRna. In short,
these R packages make easy the task to load all the data generated by bcbio
and wrap many methods and functions to generate the most used figures/analysis.
That is not the story. The story is that the bcbioRNADataSet
object in those packages contains all the data in order to make easy the use of different counts matrix,
or subset to a smaller samples/genes. For instance, the object contains the DDS object (DESeq2 DESeqDataSet
),
so we can use methods for this class or re-normalize at any point.
Well, when we implemented the [
method (this allows to subset the object), we discovered that the
object coming from that method was twice as big when saving into an RDA file. For instance, if you have an object with 34K genes,
100 samples that would be 900Mb when saving, then you subset to 1000 genes and 8 samples, you would end up with a file being 1.6G.
That is not nice, so I decided to play around. I don’t want you to read our code, because contains a lot of information, but feel free: subset. I value your time, so this is a dummy example.
The examples mainly creates different objects:
- dummyBig: 10K genes, 100 samples
- dummySmall: 500 genes, 5 samples
- dummySilly: subset of dummyBig using a silly function to subset, same size than dummySmall
- dummySmart: subset of dummyBig using a smart function to subset, same size than dummySmall
We should expect that dummySilly
and dummySmart
have a similar size than dummySmall
…
This is the size of RDA files for each of them (size in Kb):
object | file | rdaSize | objSize |
---|---|---|---|
dummyBig | big.rda | 2828.49 Kb | 1588 Kb |
dummySmall | micro.rda | 24.19 Kb | 42.6 Kb |
dummySilly | microSilly.rda | 4397.49 Kb | 43.2 Kb |
dummySmart | microSmart.rda | 22.38 Kb | 43.2 Kb |
rdaSize is the size of the file using save(obj, file = fileName.rda
function. objSize is the size in memory in R using the function format(object.size(obj))
As you can see, microSilly
is almost twice the original data, but microSmart
is the expected size.
I don’t know the exact reason, but the chunk of code that made the trick was to create an internal function to be called inside the method to re-generate the DESeq2 object with the smaller size.
I know there should be a better way to avoid this, but this worked. If you have any comment/advice post it here. My R session is below.
“honesty in bioinformatics: accept you develop without being an expert and that creates chaos.”
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 methods stats graphics grDevices utils
[8] datasets base
other attached packages:
[1] dplyr_0.7.2.9000 DESeq2_1.16.1
[3] SummarizedExperiment_1.6.3 DelayedArray_0.2.7
[5] matrixStats_0.52.2 Biobase_2.36.2
[7] GenomicRanges_1.28.4 GenomeInfoDb_1.12.2
[9] IRanges_2.10.2 S4Vectors_0.14.3
[11] BiocGenerics_0.22.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.12 locfit_1.5-9.1 lattice_0.20-35
[4] assertthat_0.2.0 digest_0.6.12 R6_2.2.2
[7] plyr_1.8.4 backports_1.1.0 acepack_1.4.1
[10] RSQLite_2.0 highr_0.6 ggplot2_2.2.1
[13] zlibbioc_1.22.0 rlang_0.1.2.9000 lazyeval_0.2.0
[16] data.table_1.10.4 annotate_1.54.0 blob_1.1.0
[19] rpart_4.1-11 Matrix_1.2-10 checkmate_1.8.2
[22] splines_3.4.1 BiocParallel_1.10.1 geneplotter_1.54.0
[25] stringr_1.2.0 foreign_0.8-69 htmlwidgets_0.8
[28] RCurl_1.95-4.8 bit_1.1-12 munsell_0.4.3
[31] compiler_3.4.1 pkgconfig_2.0.1 base64enc_0.1-3
[34] htmltools_0.3.6 nnet_7.3-12 tibble_1.3.3.9001
[37] gridExtra_2.2.1 htmlTable_1.9 GenomeInfoDbData_0.99.0
[40] Hmisc_4.0-3 XML_3.98-1.9 bitops_1.0-6
[43] grid_3.4.1 xtable_1.8-2 gtable_0.2.0
[46] DBI_0.7 magrittr_1.5 scales_0.4.1
[49] stringi_1.1.5 XVector_0.16.0 genefilter_1.58.1
[52] bindrcpp_0.2 latticeExtra_0.6-28 Formula_1.2-1
[55] RColorBrewer_1.1-2 tools_3.4.1 bit64_0.9-7
[58] glue_1.1.1.9000 survival_2.41-3 AnnotationDbi_1.38.2
[61] colorspace_1.3-2 cluster_2.0.6 memoise_1.1.0
[64] bindr_0.1 knitr_1.17