Downloading public data in bulk from the NCBI, ENA, and DDBJ databases has never been easier!
Workflow overview (simplified)
Download SRA file
The five most popular databases that store sequencing data are National Center for Biotechnology Information (NCBI), the European Nucleotide Archive (ENA), the DNA Data Bank of Japan (DDBJ), the Genome Sequence Archive (GSA), and the Encode project (ENCODE). ENA, ENCODE, and GSA store the actual fastq files, and DDBJ and NCBI store the raw data (as a sra file) from which a fastq can be derived. For this reason for each sample on DDBJ and NCBI seq2science will first check if it can be downloaded from ENA as a fastq directly. Otherwise we will download the samples in its raw format. To convert this data to a fastq it has to be “dumped“.
Filling out the samples.tsv
Before running a workflow you will have to specify which samples you want to run the workflow on.
Each workflow starts with a
samples.tsv as an example, and you should adapt it to your specific needs.
As an example, the
samples.tsv could look something like this:
sample CRX123 <-- GSA experiment DRX890 <-- DDBJ experiment DRR098 <-- DDBJ run ENCSR765 <-- ENCODE assay ENCFF432 <-- ENCODE fastq file ERX123 <-- EBI ENA experiment ERR321 <-- EBI ENA run GSMabc <-- GEO sample SRX456 <-- SRA experiment SRRxzy <-- SRA run
When downloading fastq files there is only one column in the samples.txt. This is the sample column, where each sample is specified. Samples are specified with their name of the accession (e.g. GSM2837484). (Accepted formats start with “CRX”, “DRR”, “DRX”, “ENCFF” “ENCSR”, “ERR”, “ERX”, “GSM”, “SRR”, or “SRX”)
When specifying an ENCODE fastq file, and it belongs to a paired sequencing run, both fastq files will be downloaded. They will have the file name of the sample, and R1 and R2 will correspond to ENCODE.
Feel free to add columns to your liking (these will be ignored).
Filling out the config.yaml
Every workflow has many configurable options, and can be set in the
config.yaml we highlighted a couple options that we think are relevant for that specific workflow, and set (we think) reasonable default values.
When a workflow starts it prints the configuration variables influencing the workflow, and (almost) all these values can be added in the
config.yaml and changed to your liking.
You can see the complete set of configurable options in the extensive docs.
Downloading with ascp
ascp is a downloading protocol that allows for (sometimes much) faster speeds. If you have ascp installed you can make the workflow download through this protocol. You can specify the path to the binary in the config.yaml by key
ascp_path and the ascp key in the config by key
See this gist for an easy installation of ascp.
Filetype extensions and paired-end suffix
The pipeline will save fastq files in the fastq_dir directory, which is located in the result_dir directory by default.
People and tools have different preferences for storing their data. One of these differences is how to name the fastq.gz files. Some people/tools prefer
fastq.gz, while others prefer
fq.gz. By default the pipeline names files with the fastq.gz extension. However if you prefer to change this you can set the variable fqsuffix accordingly.
The same goes for paired-end suffix conventions, most people prefer
sample_R1 which is our default. However you can use your preferred suffix by setting variables fqext1 and fqext2.
fastq_dir: ./my_first_fastq_dir fqsuffix: fastq fqext1: R1 fqext2: R2