How to set up public dataset analysis with bcbio-nextgen
We use bcbio-nextgen for the analysis of sequencing data, mainly, (sc)RNAseq, smallRNAseq, DNASeq and ChIPSeq. It is not rare that we get collaborators who wants to re-analyze public data-set.
Inside bcbio
, we have bcbio_prepare_samples.py
to help to merge multiple
files that belong to the same sample into one file to make easier the configuration
of bcbio. We extended this script to pull down data from GEO and
SRA repository.
If you have bcbio. installed, you can create a example.csv
file like this:
samplenames,description
GSM3508215,HEK293T
SRR8311268,Hela
And then run:
bcbio_prepare_samples.py --csv example.csv --out fastq
This will download and create all the files inside fastq
folder. If the samples
is paired-end, it will generated the two associated files: R1 and R2.
Cool options to use:
--remove-source
: if you want to keep only final files.- you can use full FTP addresses as well under
samplenames
column
NextFlow accepts these SRA ids as well, take a look.
And sra-explorer from Phil Ewels
will create a bash script to download the FASTQ files and It has a great search engine where
you can use any of these terms to find your public data:
GSE30567, SRP043510, PRJEB8073, ERP009109 or human liver miRNA.
The only advantage of bcbio is that in the case of multiple files associated to the same
sample, it will merge the files together. For instance, if you search in sra-explorer for this
term: GSM2598386
, you see that multiple files. bcbio. will merge all of them into one and
you can run your pipeline directly afterwards. And pretty convenient if you
use any of the bcbio pipelines :).
Enjoy!
“happiness in bioinformatics: when your collaborators give you a CSV file with all the metadata for your raw data and they match.