RNA-seq

Preprocessing of RNA-seq has never been easier!

Workflow overview (simplified)

An example quality control report can be viewed here.

Downloading of sample(s)

Depending on whether the samples you start seq2science with is your own data, public data, or a mix, the pipeline might start with downloading samples. The downloading of samples is integrated into each workflow, so you don’t have to start a download workflow first. You control which samples are used in the samples.tsv. Background on public data can be found here.

Downloading and indexing of assembly(s)

Depending on whether the assembly and its index you align your samples against already exist seq2science will start with downloading of the assembly through genomepy.

Read trimming

The pipeline starts by trimming the reads with Trim Galore! or Fastp (the default). The trimmer will automatically trim the low quality 3’ ends of reads, and removes short reads. During quality trimming it automatically detects which sequencing adapter was used (if it wasn’t trimmed off yet), and then trims this as well. Trimming parameters for the pipeline can be set in the configuration, for example:

trimmer:
  fastp:
    trimoptions: --trim_front1 3 --trim_front2 14 --trim_poly_x

Alignment

Reads are aligned using HISAT2 or STAR (the default). Sensible defaults have been set, but can be overwritten for either (or both) the indexing and alignment by specifying them in the config.yaml.

After trimming the reads are aligned against an assembly. Currently we support bowtie2, bwa, bwa-mem2, hisat2, minimap2 and STAR as aligners. Choosing which aligner is as easy as setting the aligner variable in the config.yaml, for example: aligner: bwa. Sensible defaults have been set for every aligner, but can be overwritten for either (or both) the indexing and alignment by specifying them in the config.yaml:

aligner:
  bwa-mem:
    index: '-a bwtsw'
    align: '-M'

The pipeline will check if the assembly you specified is present in the genome_dir, and otherwise will download it for you through genomepy. All these aligners require an index to be formed first for each assembly, but don’t worry, the pipeline does this for you.

Bam sieving

After aligning the bam you can choose to remove unmapped reads, low quality mappings, duplicates, and multimappers. Again, sensible defaults have been set, but can be overwritten.

Strandedness

Most sequencing protocols at present are strand-specific. This specificity can be used to help identify pseudogenes originating from antisense DNA, or genes with overlapping regions on opposite strands without ambiguity. Strandedness is inferred automatically for all RNA-seq samples. For aligners it is inferred by RSeQC, the results of which can be reviewed in the MultiQC. RSeQC inference can be overwritten by column strandedness in the samples.tsv. This column may contain identifiers no, forward or reverse. If strandedness is unknown (for some samples), fields may be left blank or filled with nan. Setting ignore_strandedness in the config.yaml will resulting in gene counting to assume all reads are unstranded.

Gene quantification methods

RNA-seq can be performed using gene counting or abundance estimation methods. Gene counting methods require BAM files, which are generated and processed in the Alignment, Bam sieving and Strandedness steps. Gene abundances require trimmed fastqs, and are therefore not influenced by the aforementioned steps.

Gene counts (with HTSeq/featureCounts)

Gene counts are obtained from the filtered BAM files using either HTSeq or featureCounts (default HTSeq). These counts are then combined into a count matrix per assembly for use in downstream analyses.

Gene abundances (with Salmon)

Gene abundances can be estimated using Salmon. Reads are aligned against the transcriptome to obtain transcript abundances (sequence strandedness is inferred automatically by Salmon), then summarized to gene-level using tximeta. The Gene-level counts matrix are output similar to the gene counts method. Additionally, Salmon generates a gene-level TPM matrix and a SingleCellExperiment object which can be opened in R, containing the transcript- and gene-level summaries.

Differential gene expression analysis

Seq2science outputs gene counts matrices for each assembly. Additionally, it can perform differential expression analysis automatically. See the Differential gene/peak analysis page for more information!

Differential transcript usage

Quantifying with Salmon, the transcript-level summaries in the SingleCellExperiment object should be usable for differential transcript analysis with DEXseq, as described in this vignette.

Differential exon usage

Differential exon analysis by DEXseq can be automatically prepared by setting dexseq: True in the config.yaml. This will let seq2science to output an exon counts matrix per assembly, which can be loaded directly into DEXSeqDataSet().

Note: this utilizes scripts implemented by DEXseq, which are built for Ensembl genomes.

Trackhub

A UCSC compatible trackhub can be generated for this workflow. See the trackhub page for more information!

Filling out the samples.tsv

Before running a workflow you will have to specify which samples you want to run the workflow on. Each workflow starts with a samples.tsv as an example, and you should adapt it to your specific needs. As an example, the samples.tsv could look something like this:

sample    assembly    technical_replicates    descriptive_name
GSM123    GRCh38      heart_1      heart_merged        GSM234
GSM321    GRCh38      heart_1      heart_merged        GSM234
GSMabc    GRCh38      heart_2      heart_not_merged    GSM234
GSMxzy    danRer11    stage_8      stage_8             GSM234
GSM890    danRer11    stage_9      stage_9             GSM234

Sample column

If you use the pipeline on public data this should be the name of the accession (e.g. GSM2837484). Accepted formats start with “GSM”, “SRR”, “SRX”, “DRR”, “DRX”, “ERR” or “ERX”.

If you use the pipeline on local data this should be the basename of the file without the extension(s). For example:

/home/user/myfastqs/sample1.fastq.gz ——-> sample1 for single-ended data
/home/user/myfastqs/sample2_R1.fastq.gz ┬> sample2 for paired-ended data
/home/user/myfastqs/sample2_R2.fastq.gz ┘

For local data, some fastq files may have slightly different naming formats. For instance, Illumina may produce a sample named sample3_S1_L001_R1_001.fastq.gz (and the R2 fastq). Seq2science will attempt to recognize these files based on the sample name sample3.

For both local and public data, identifiers used to recognize fastq files are the fastq read extensions (R1 and R2 by default) and the fastq suffix (fastq by default). The directory where seq2science will store (or look for) fastqs is determined by the fastq_dir config option. In the example above, the fastq_dir should be set to /home/user/myfastqs. These setting can be changed in the config.yaml.

Assembly column

Here you simply add the name of the assembly you want your samples aligned against and the workflow will download it for you.

Descriptive_name column

The descriptive_name column is used for the trackhub and multiqc report. In the trackhub your tracks will be called after the descriptive name, and in the multiqc report there will be a button to rename your samples after this column. The descriptive name can not contain ‘-’ characters, but underscores ‘_’ are allowed.

technical_replicates column

Technical replicates, or any fastq file you may wish to merge on the fastq level, are set using the technical_replicates column in the samples.tsv file. All samples with the same name in the technical_replicates column will be concatenated into one file with the replicate name.

Example samples.tsv utilizing replicate merging:

sample    assembly    technical_replicates
GSM123    GRCh38      heart
GSMabc    GRCh38      heart
GSMxzy    GRCh38      stage8
GSM890    GRCh38

Using this file in the alignment workflow will output heart.bam, stage8.bam and GSM890.bam. The MultiQC will inform you of the trimming steps performed on all samples, and subsequent information of the ‘replicate’ files (of which only heart is merged).

Note: If you are working with multiple assemblies in one workflow, replicate names have to be unique between assemblies (you will receive a warning if names overlap).

keep

Replicate merging is turned on by default. It can be turned off by setting technical_replicates in the config.yaml to keep.

Colors column

If you are producing a UCSC trackhub, seq2science will assign an alternating color gradient to your samples to distinguish them. You can optionally specify the colors of each track by adding this column. Colors can be added by name (google “matplotlib colors” for the options), or RGB values. Empty fields are considered black.

Filling out the config.yaml

Every workflow has many configurable options, and can be set in the config.yaml file. In each config.yaml we highlighted a couple options that we think are relevant for that specific workflow, and set (we think) reasonable default values.

When a workflow starts it prints the configuration variables influencing the workflow, and (almost) all these values can be added in the config.yaml and changed to your liking. You can see the complete set of configurable options in the extensive docs.

Best practices

Genome assembly and gene annotation

The choice of genome assembly and gene annotation is of significant influence to the downstream analysis. To explore available options, you could use genomepy, which comes installed in the seq2science conda environment. Installing the desired genome and gene annotation in the {genomes_dir} will cause seq2science to use these files.

If the genome/annotation is missing from the {genomes_dir}, seq2science will attempt to download the named assembly from Ensembl, UCSC and the NCBI (in that order).

Aligners: STAR vs HISAT2

Both aligners have been found to perform well. Selection should be dependent on familiarity and configuration options.

Quantifier: counts vs quantification

Both methods have been found valid for differential gene expression analysis, although results vary somewhat. Caution is advised for the genes found by only one of the two methods.

Genome/Aligner/Quantifier

The most significant choice to be made is the genome assembly and gene annotation.

Reviewing the results

Unless configured not to, Seq2science makes several assumptions on your data which may be incorrect:

that strandedness of each sample can be determined automatically.
that duplicate reads are mostly caused by natural overexpression, not by library artifacts.

These assumptions can be tested by inspecting the MultiQC. Should the results be disappointing, they can be overwritten using the strandedness column in the samples.tsv and the markduplicates variable in the config.yaml respectively.