Subset of object creates bigger RDA file size than original object
This is a funny story, and I will try to tell you how I realized I don’t know anything about R in 400 words.
I work at the Bioinformatic Core at Harvard TH School. People who know us, or collaborate with us, knows that we mainly use bcbio to analyze sequencing data (check it out, super cool tool).
Something that we are working on is to load the data after
bcbio finishes into R to ease the downstream analysis.
For instance, for RNA-seq we have bcbioRnaSeq and for small RNA-seq we have bcbioSmallRna. In short,
these R packages make easy the task to load all the data generated by
bcbio and wrap many methods and functions to generate the most used figures/analysis.
That is not the story. The story is that the
bcbioRNADataSet object in those packages contains all the data in order to make easy the use of different counts matrix,
or subset to a smaller samples/genes. For instance, the object contains the DDS object (DESeq2
so we can use methods for this class or re-normalize at any point.
Well, when we implemented the
[ method (this allows to subset the object), we discovered that the
object coming from that method was twice as big when saving into an RDA file. For instance, if you have an object with 34K genes,
100 samples that would be 900Mb when saving, then you subset to 1000 genes and 8 samples, you would end up with a file being 1.6G.
The examples mainly creates different objects:
- dummyBig: 10K genes, 100 samples
- dummySmall: 500 genes, 5 samples
- dummySilly: subset of dummyBig using a silly function to subset, same size than dummySmall
- dummySmart: subset of dummyBig using a smart function to subset, same size than dummySmall
We should expect that
dummySmart have a similar size than
This is the size of RDA files for each of them (size in Kb):
|dummyBig||big.rda||2828.49 Kb||1588 Kb|
|dummySmall||micro.rda||24.19 Kb||42.6 Kb|
|dummySilly||microSilly.rda||4397.49 Kb||43.2 Kb|
|dummySmart||microSmart.rda||22.38 Kb||43.2 Kb|
rdaSize is the size of the file using
save(obj, file = fileName.rda function. objSize is the size in memory in R using the function
As you can see,
microSilly is almost twice the original data, but
microSmart is the expected size.
I know there should be a better way to avoid this, but this worked. If you have any comment/advice post it here. My R session is below.
“honesty in bioinformatics: accept you develop without being an expert and that creates chaos.”
R version 3.4.1 (2017-06-30) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Sierra 10.12.6 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib locale:  en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages:  parallel stats4 methods stats graphics grDevices utils  datasets base other attached packages:  dplyr_0.7.2.9000 DESeq2_1.16.1  SummarizedExperiment_1.6.3 DelayedArray_0.2.7  matrixStats_0.52.2 Biobase_2.36.2  GenomicRanges_1.28.4 GenomeInfoDb_1.12.2  IRanges_2.10.2 S4Vectors_0.14.3  BiocGenerics_0.22.0 loaded via a namespace (and not attached):  Rcpp_0.12.12 locfit_1.5-9.1 lattice_0.20-35  assertthat_0.2.0 digest_0.6.12 R6_2.2.2  plyr_1.8.4 backports_1.1.0 acepack_1.4.1  RSQLite_2.0 highr_0.6 ggplot2_2.2.1  zlibbioc_1.22.0 rlang_0.1.2.9000 lazyeval_0.2.0  data.table_1.10.4 annotate_1.54.0 blob_1.1.0  rpart_4.1-11 Matrix_1.2-10 checkmate_1.8.2  splines_3.4.1 BiocParallel_1.10.1 geneplotter_1.54.0  stringr_1.2.0 foreign_0.8-69 htmlwidgets_0.8  RCurl_1.95-4.8 bit_1.1-12 munsell_0.4.3  compiler_3.4.1 pkgconfig_2.0.1 base64enc_0.1-3  htmltools_0.3.6 nnet_7.3-12 tibble_22.214.171.12401  gridExtra_2.2.1 htmlTable_1.9 GenomeInfoDbData_0.99.0  Hmisc_4.0-3 XML_3.98-1.9 bitops_1.0-6  grid_3.4.1 xtable_1.8-2 gtable_0.2.0  DBI_0.7 magrittr_1.5 scales_0.4.1  stringi_1.1.5 XVector_0.16.0 genefilter_1.58.1  bindrcpp_0.2 latticeExtra_0.6-28 Formula_1.2-1  RColorBrewer_1.1-2 tools_3.4.1 bit64_0.9-7  glue_126.96.36.19900 survival_2.41-3 AnnotationDbi_1.38.2  colorspace_1.3-2 cluster_2.0.6 memoise_1.1.0  bindr_0.1 knitr_1.17