Per rule explanation

This is an automatically generated list of all supported rules, their docstrings, command, and virtual environment. At the start of each workflow run a list is printed of which rules will be run. And while the workflow is running it prints which rules are being started and finished. This page is here to give an explanation to the user about what each rule does, and for developers to find what is, and isn’t yet supported.

bam2cram

Convert a bam file to the more compressed cram format.

samtools view -T {input.genome} -C {input.bam} -@ {threads} > {output} 2> {log}

bam2cram environment.

bam_bigwig

Convert a bam file into a bigwig file (or two). Can output strand specific bams. The reverse bam is created “secretly” to avoid checkpoints.

if [[ "{params.strandedness}" == "no" ]]; then
  echo "generating unstranded bam" > {log}
  bamCoverage --bam {input.bam} --outFileName {output} --numberOfProcessors {threads} {params.flags} --verbose >> {log} 2>&1
else
  echo "generating stranded bams" > {log}
              out={output}
  rev=$(echo "${{out/{wildcards.sorting}.bw/{wildcards.sorting}.rev.bw}}")
  rm -f $rev  # in case of a rerun
  bamCoverage --bam {input.bam} --outFileName $rev     --filterRNAstrand reverse --numberOfProcessors {threads} {params.flags} --verbose >> {log} 2>&1
  bamCoverage --bam {input.bam} --outFileName {output} --filterRNAstrand forward --numberOfProcessors {threads} {params.flags} --verbose >> {log} 2>&1
fi

bam_bigwig environment.

bedgraph_bigwig

Convert a bedgraph file into a bigwig.

awk -v OFS='\\t' '{{print $1, $2, $3, $4}}' {input.bedgraph} | sed '/experimental/d' |
bedSort /dev/stdin {output.tmp} > {log} 2>&1
bedGraphToBigWig {output.tmp} {input.sizes} {output.out} >> {log} 2>&1

bedgraph_bigwig environment.

bedgraphish_to_bedgraph

Convert the bedgraph-ish file generated by genrich into a true bedgraph file.

splits=$(grep -Pno "([^\/]*)(?=\.bam)" {input})
splits=($splits)
lncnt=$(wc -l {input} | echo "$(grep -Po ".*(?=\ )")":)
splits+=($lncnt)

counter=1
for split in "${{splits[@]::${{#splits[@]}}-1}}";
do
    filename=$(grep -Po "(?<=:).*" <<< $split);
    if [[ $filename =~ {wildcards.sample} ]]; then
        startnr=$(grep -Po ".*(?=\:)" <<< $split);
        endnr=$(grep -Po ".*(?=\:)" <<< ${{splits[counter]}});

        lines="NR>=startnr && NR<=endnr {{ print \$1, \$2, \$3, \$4 }}"
        lines=${{lines/startnr/$((startnr + 2))}}
        lines=${{lines/endnr/$((endnr - 1))}}

        awk "$lines" {input} > {output}
    fi
    ((counter++))
done

bedtools_slop

After combine_peaks we end up with just a bed file of summits. We extend all peaks to the same width, for a fair comparison between peaks.

bedtools slop -i {input.bedfile} -g {input.sizes} -b {params.slop} | uniq > {output} 2> {log}

bedtools_slop environment.

blind_clustering

Create a sample distance cluster heatmap, and various correlation cluster heatmaps.

A (blind) variance stabilizing transformation is applied to the counts, to reduce the impact of genes with low expression levels.

bowtie2_align

Align reads against a genome (index) with bowtie2, and pipe the output to the required sorter(s).

bowtie2 {params.params} --threads {threads} -x {input.index}/{wildcards.assembly} {params.input} 2> {log} | tee {output} 1> /dev/null 2>> {log}

bowtie2_align environment.

bowtie2_index

Make a genome index for bowtie2. This index is required for alignment.

mkdir -p {output}

bowtie2-build {params} --threads {threads} {input} {output}/{wildcards.assembly} > {log} 2>&1

bowtie2_index environment.

bwa_index

Make a genome index for bwa (mem). This index is required for alignment.

mkdir -p {output}

bwa index -p {params.prefix} {params.params} {input} > {log} 2>&1

bwa_index environment.

bwa_mem

Align reads against a genome (index) with bwa-mem, and pipe the output to the required sorter(s).

bwa mem {params.params} -t {threads} {params.index_dir} {input.reads} 2> {log} | tee {output} 1> /dev/null 2>> {log}

bwa_mem environment.

bwa_mem2

Align reads against a genome (index) with bwa-mem2, and pipe the output to the required sorter(s).

bwa-mem2 mem {params.params} -t {threads} {params.index_dir} {input.reads} 2> {log} | tee {output} 1> /dev/null 2>> {log}

bwa_mem2 environment.

bwa_mem2_index

Make a genome index for bwa-mem2. This index is required for alignment.

mkdir -p {output}

bwa-mem2 index -p {params.prefix} {input} > {log} 2>&1

bwa_mem2_index environment.

call_peak_genrich

Call peaks with genrich based on the pileup.

Genrich -P -f {input.log} -o {output.narrowpeak} {params} -v > {log} 2>&1

call_peak_genrich environment.

chipseeker

Make chipseeker plots, with the percentage of peaks in e.g. promoters.

citeseqcount

ADT mapping and quantification with cite-seq-count and output a umi/read matrix per sample.

CITE-seq-Count -R1 {input.reads[0]} -R2 {input.reads[1]} \
-t {input.tags} {params.barcode_arg} -T {threads} \
{params.options} -o {params.outdir} > {log} 2>&1

citeseqcount environment.

combine_biological_reps

Combine biological replicates by taking their mean!

combine_peaks

Uses gimmemotifs’ combine_peaks to “combine” peaks. This finds all peaks close together and takes the most significant one as the true peak.

combine_peaks --genome {input.genome} --window {params.windowsize} \
{input.summitfiles} > {output} 2> {log}

combine_peaks environment.

complement_blacklist

Take the complement of the blacklist. We need this complement to tell samtools to only keep reads that are in the complement of the blacklist.

sortBed -faidx {input.sizes} -i {input.blacklist} 2>> {log} |
complementBed -i stdin -g {input.sizes} > {output} 2>> {log}

complement_blacklist environment.

computeMatrix_gene

Pre-compute correlations between bams using deeptools.

computeMatrix scale-regions -S {input.bw} {params.labels} -R {params.gtf} \
-p {threads} {params.params} -o {output} > {log} 2>&1

computeMatrix_gene environment.

computeMatrix_peak

Pre-compute correlations between bams using deeptools.

computeMatrix scale-regions -S {input.bw} {params.labels} -R {input.peaks} \
-p {threads} --binSize 5 -o {output} > {log} 2>&1

computeMatrix_peak environment.

count_matrix

Combine count tables into one count matrix per assembly

coverage_table

Use gimmemotif’s coverage_table to generate a count table with the nr of reads under each peak per sample.

echo "# The number of reads under each peak" > {output}
coverage_table {input.peaks} {input.replicates} --window {params.peak_width} --nthreads {threads} \
2> {log} | grep -vE "^#" 2>> {log} |
awk 'BEGIN {{ FS = "@" }} NR==1{{gsub("{wildcards.assembly}-|.samtools-coordinate","",$0)}}; \
{{print $0}}' >> {output}

# overwrite sample names with descriptive/replicate names
sed -i "2s/.*/{params.names}/" {output}

coverage_table environment.

create_SNAP_object

Create a snapobject for each BAM file.

These snapobjects can be merged later using snaptools in R.

snaptools snap-pre --input-file={input.bam} --output-snap={output} --genome-name={wildcards.assembly} \
--genome-size={input.sizes} {params.params} {params.chrm} {params.mapq} > {log} 2>&1

create_SNAP_object environment.

create_bins_SNAP_object

Add a Binned genome matrix to the SNAPobject, after which it is renamed and moved to the Snapfiles folder for downstream analysis in R using Snaptools

snaptools snap-add-bmat --snap-file={input} {params} > {log} 2>&1
echo "bmat added, moving file" >> {log}
mv {input} {output} >> {log} 2>&1

create_bins_SNAP_object environment.

cytoband

Generate a cytoband track for each assembly

source: http://genomewiki.ucsc.edu/index.php/Assembly_Hubs#Cytoband_Track

cat {input.sizes} |
bedSort /dev/stdin /dev/stdout |
awk '{{print $1,0,$2,$1,"gneg"}}' > {output.cytoband_bd}

bedToBigBed -type=bed4 {output.cytoband_bd} -as={params.schema} \
{input.sizes} {output.cytoband_bb} >> {log} 2>&1

cytoband environment.

deseq2

Differential gene expression analysis with DESeq2.

dexseq_count

count exon usage

current_conda_env=$(conda env list | grep \* | cut -d "*" -f2-)
DEXseq_path=${{current_conda_env}}/lib/R/library/DEXSeq/python_scripts

python ${{DEXseq_path}}/dexseq_count.py -f bam -r pos {params.endedness} -s {params.strandedness} {input.gff} {input.bam} {output} > {log} 2>&1

dexseq_count environment.

dexseq_count_matrix

Combine DEXSeq counts into one count matrix per assembly for use in function DEXSeqDataSet()

dupRadar

visualize fraction of artifactual reads to normal read duplication (due to natural over-sequencing of highly expressed genes).

dupRadar_combine

Combine the individual images (so we can group them nicely in the MultiQC).

convert {params.good_example} {params.bad_example} {input} -append {output} 2> {log}

dupRadar_combine environment.

edgeR_normalization

edgeR supports three different types of normalization: TMM, RLE, and upperquartile.

TMM: is the weighted trimmed mean of M-values proposed by Robinson and Oshlack (2010). RLE: is the scaling factor method proposed by Anders and Huber (2010). DEseq2 normalisation is based on this. upperquartile: is the upper-quartile normalization method of Bullard et al (2010).

ena2fastq_PE

Download paired-end fastq files directly from the ENA.

ena2fastq_SE

Download single-end fastq files directly from the ENA.

export_sce_obj

Read scRNA UMI counts into a SingleCellExperiment object, add colData and export to RData format.

extend_genome

Append given file(s) to genome

# extend the genome.fa
cp {input.genome} {output.genome}

for FILE in {input.extension}; do
    cat $FILE >> {output.genome}
done

extend_genome_annotation

Append given file(s) to genome annotation

# extend the genome.annotation.gtf
cp {input.gtf} {output.gtf}

for FILE in {input.extension}; do
    cat $FILE >> {output.gtf}
done

# generate an extended genome.annotation.bed
gtfToGenePred {output.gtf} {output.gp}
genePredToBed {output.gp} {output.bed}

extend_genome_blacklist

Copy blacklist to the custom genome directory

cp {input} {output}

fastp_PE

Automated adapter detection, adapter trimming, and quality trimming through fastp (paired-end).

threads=$(( {threads} - 2  > 1 ? {threads} - 2 : 1 ))

fastp -w $threads --in1 {input[0]} --in2 {input[1]} \
--out1 {output.r1} --out2 {output.r2} -h {output.qc_html} -j {output.qc_json} \
{params.config} > {log} 2>&1

fastp_PE environment.

fastp_SE

Automated adapter detection, adapter trimming, and quality trimming through fastp (single-end).

threads=$(( {threads} - 2  > 1 ? {threads} - 2 : 1 ))

fastp -w $threads --in1 {input} --out1 {output.se} -h {output.qc_html} -j {output.qc_json} \
{params.config} > {log} 2>&1

fastp_SE environment.

fastp_qc_PE

Get quality scores for (technical) replicates

fastp -w {threads} --in1 {input[0]} --in2 {input[1]} \
-h {output.qc_html} -j {output.qc_json} \
{params.config} > {log} 2>&1

fastp_qc_PE environment.

fastp_qc_SE

Get quality scores for (technical) replicates

fastp -w {threads} --in1 {input} -h {output.qc_html} -j {output.qc_json} > {log} 2>&1

fastp_qc_SE environment.

fastq_pair

fastq_pair re-writes paired-end fastq files to ensure that each read has a mate and discards singleton reads. This step is required after scRNA trimming since we only trim the fastq containing reads and not the barcode fastq.

gunzip -c {input.r1} > {output.intermediates1[0]} 2> {log}
gunzip -c {input.r2} > {output.intermediates1[1]} 2>> {log}
if [ {params.tused} == true ]
then
  opts="{params.options}"
else
  echo "\nsetting parameter t with the number of reads in the fastq\n" >> {log}
  opts="-p -t "$(wc -l {input.r1} | grep -Po '^\d+' | awk '{{print int($1/4)}}')
fi
fastq_pair $opts {output.intermediates1} >> {log} 2>&1
gzip -c {output.intermediates3[0]} > {output.reads[0]} 2> {log}
gzip -c {output.intermediates3[1]} > {output.reads[1]} 2>> {log}

fastq_pair environment.

fastqc

Generate quality control report for fastq files.

fastqc {input} -O {params} > {log} 2>&1

fastqc environment.

featureCounts

Use featureCounts to generate the fraction reads in peaks score (frips/assigned reads). See: https://www.biostars.org/p/337872/ and https://www.biostars.org/p/228636/

# Make a custom "SAF" file which featureCounts needs:
awk 'BEGIN{{FS=OFS="\t"; print "GeneID\tChr\tStart\tEnd\tStrand"}}{{print $4, $1, $2+1, $3, "."}}' \
{input.peak} 1> {output.tmp_saf} 2> {log}

# run featureCounts
featureCounts -T {threads} -p -a {output.tmp_saf} -F SAF -o {output.real_out} {input.bam} > {log} 2>&1

# move the summary to qc directory
mv $(dirname {output.real_out})/$(basename {output.summary}) {output.summary}

featureCounts environment.

featurecounts

summarize reads to gene level. Outputs a counts table per bam file.

featureCounts -a {input.gtf} {input.bam} {params.endedness} -s {params.strandedness} {params.user_flags} -T {threads} -o {output} > {log} 2>&1

featurecounts environment.

full_decoy_transcripts

Use the entire genome as decoy sequence.

cut -f 1 {input.sizes} > {output.decoys}
cat {input.transcripts} {input.genome} > {output.gentrome}

gcPercent

Generate a gc content track

source: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/gc5Base/

hgGcPercent -wigOut -doGaps -file=stdout -win=5 -verbose=0 {wildcards.assembly} {input.twobit} \
    | gzip > {output.gcpcvar}

wigToBigWig {output.gcpcvar} {input.sizes} {output.gcpc} >> {log} 2>&1

gcPercent environment.

gene_id2name

Parse the gtf file to generate a gene_id to gene_name conversion table.

genrich_pileup

Generate the pileup. We do this separately from peak-calling since these two processes have a very different computational footprint.

input=$(echo {input.reps} | tr ' ' ',')
Genrich -X -t $input -f {output.log} {params.control} -k {output.bedgraphish} \
{params.params} -v > {log} 2>&1

genrich_pileup environment.

get_effective_genome_size

Get the effective genome size for a kmer length. Some tools (e.g. macs2) require an estimation of the effective genome size to better estimate how (un)likely it is to have a certain number of reads on a position. The actual genome size is not the best indication in these cases, since reads in repetitive regions (for a certain kmer length) can not possible align on some locations.

unique-kmers.py {input} -k {wildcards.kmer_size} --quiet 2>&1 | grep -P -o '(?<=\.fa: ).*' > {output} 2> {log}

get_effective_genome_size environment.

get_genome

Download a genome through genomepy.

get_genome_annotation

Download a gene annotation through genomepy.

get_genome_blacklist

Download a genome blacklist for a genome, if it exists, through genomepy.

get_genome_support_files

Generate supporting files for a genome.

get_transcripts

Generate transcripts.fasta using gffread.

Requires genome.fa and annotation.gtf (with matching chromosome/scaffold names)

gffread -w {output} -g {input.fa} {input.gtf} > {log} 2>&1

get_transcripts environment.

gimme_maelstrom

Gimme maelstrom is a method to infer differential motifs between two or more biological replicates.

NEW_CACHE={output.cache}
mkdir -p $NEW_CACHE
if [ -z ${{XDG_CACHE_HOME+x}} ]; then
    export XDG_CACHE_HOME=$HOME/.cache
fi
if [ -d $XDG_CACHE_HOME/gimmemotifs ]; then
    cp -r $XDG_CACHE_HOME/gimmemotifs $NEW_CACHE/
fi
export XDG_CACHE_HOME=$NEW_CACHE

gimme_maelstrom environment.

gsa_or_encode2fastq_PE

Download paired-end fastq files from the GSA or ENCODE DCC.

gsa_or_encode2fastq_SE

Download single-end fastq files from the GSA or ENCODE DCC.

hisat2_align

Align reads against a genome (index) with hisat2, and pipe the output to the required sorter(s).

hisat2 {params.params} --threads {threads} -x {input.index}/part {params.input} 2> {log} | tee {output} 1> /dev/null 2>> {log}

hisat2_align environment.

hisat2_index

Make a genome index for hisat2. This index is required for alignment.

mkdir -p {output}

hisat2-build {params} -p {threads} {input} {output}/part > {log} 2>&1

hisat2_index environment.

hisat2_splice_aware_index

Make an exon-junction and splice aware index for hisat2. This index is required for alignment and quantification of RNA-seq data.

mkdir -p {output}

hp=$(which hisat2)
python3 ${{hp}}_extract_splice_sites.py {input.gtf} > {output}/splice_sites.tsv 2>> {log}
python3 ${{hp}}_extract_exons.py {input.gtf} > {output}/exons.tsv 2>> {log}

hisat2-build {params} -p {threads} --ss {output}/splice_sites.tsv --exon {output}/exons.tsv \
{input.fasta} {output}/part >> {log} 2>&1

hisat2_splice_aware_index environment.

hmmratac

Call gappedpeaks with HMMRATAC.

HMMRATAC --bedgraph true -o {params.basename} {params.hmmratac_params} -Xmx22G -b {input.bam} -i {input.bam_index} -g {input.genome_size} > {log} 2>&1

hmmratac environment.

hmmratac_genome_info

Generate the ‘genome info’ that hmmratac requires for peak calling. https://github.com/LiuLabUB/HMMRATAC/issues/17

TODO: isnt this just .fa.sizes?

samtools view -H {input} 2>  {log} | grep SQ 2>> {log} | cut -f 2-3 2>> {log} | cut -d ':' -f 2   2>> {log} | cut -f 1        > {output.tmp1} 2> {log}
samtools view -H {input} 2>> {log} | grep SQ 2>> {log} | cut -f 2-3 2>> {log} | cut -d ':' -f 2,3 2>> {log} | cut -d ':' -f 2 > {output.tmp2} 2> {log}
paste {output.tmp1} {output.tmp2} > {output.out} 2> {log}

hmmratac_genome_info environment.

htseq_count

summarize reads to gene level. Outputs a counts table per bam file.

htseq-count {input.bam} {input.gtf} -r pos -s {params.strandedness} {params.user_flags} -n {threads} -c {output} > {log} 2>&1

htseq_count environment.

idr

Combine replicates based on the irreproducible discovery rate (IDR). Can only handle two replicates. For more than two replicates use fisher’s method.

When combining narrowpeak files with IDR, the q-score and summit are set to -1 (means not set). However some downstream tools require these to be set. So we set the q-score to zero, and place the summit of the peak in the middle of the peak.

if [ "{params.nr_reps}" == "1" ]; then
    cp {input} {output.true}
    touch {output.temp}
else
    idr --samples {input} {params.rank} --output-file {output.temp} {params.options} > {log} 2>&1
    if [ "{wildcards.ftype}" == "narrowPeak" ]; then
        awk 'IFS="\t",OFS="\t" {{$9=0; $10=int(($3 - $2) / 2); print}}' {output.temp} > {output.true} 2>> {log}
    else
        cp {output.temp} {output.true}
    fi
fi

idr environment.

infer_strandedness

Use RSeqQC’s infer_experiment.py to determine the strandedness of a sample

infer_experiment.py -i {input.bam} -r {input.bed} -q {params} 2> {log} | awk NF > {output}

infer_strandedness environment.

insert_size_metrics

Get insert size metrics from a (paired-end) bam. This score is then used by MultiQC in the report.

picard CollectInsertSizeMetrics INPUT={input} \
OUTPUT={output.tsv} H={output.pdf} > {log} 2>&1

insert_size_metrics environment.

kallistobus_count

Align reads against a transcriptome (index) with kallistobus and output a quantification file per sample.

kb count \
-i {params.basename}.idx \
-t {threads} -g {params.basename}_t2g.txt \
-o {params.outdir} {params.c1} {params.c2} \
{params.barcode_arg} {params.options} {input.reads} > {log} 2>&1
# Validate output
if grep -q 'ERROR\|bad_alloc' "{log}"; then
  exit 1
fi

kallistobus_count environment.

kallistobus_ref

Make a genome index for kallistobus. This index is required for counting.

mkdir -p {output}
kb ref \
{input.fa} {input.gtf} \
-i {params.basename}.idx -g {params.basename}_t2g.txt -f1 {params.basename}_cdna.fa \
{params.f2} {params.c1} {params.c2} \
{params.options} > {log} 2>&1

kallistobus_ref environment.

kallistobus_ref_kite

Make a mismatch index for kallistobus. This index is required to count feature barcodes, such as antibody tags.

mkdir -p {output}
kb ref  \
{input.featurebarcodes} \
{params.options} \
-i {params.basename}.idx -g {params.basename}_t2g.txt -f1 {params.basename}_cdna.fa > {log} 2>&1

kallistobus_ref_kite environment.

keep_mates

In-house script that, after alignment, removes the information that reads are paired. This can be beneficial when peak calling with macs2 when shifting + extending, since macs2 in this case only keeps the first in pair.

linked_txome

Generate a linked transcriptome for tximeta

Also creates a symlink to the gtf in an Ensembl format (required by tximeta)

Required to converting salmon output (estimated transcript abundances) to gene counts

log_normalization

Log1p normalization of a count table.

macs2_callpeak

Call peaks using macs2. Macs2 requires a genome size, which is estimated from the amount of unique kmers of the average read length.

GENSIZE=$(cat {input.genome_size})

# call peaks
macs2 callpeak --bdg -t {input.bam} {params.control} --outdir {config[result_dir]}/macs2/ -n {wildcards.assembly}-{wildcards.sample} \
{params.macs_params} -g $GENSIZE -f {params.format} >> {log} 2>&1

macs2_callpeak environment.

macs_bdgcmp

Prepare p-value files for rule macs_cmbreps

macs2 bdgcmp -t {input.treatment} -c {input.control} -m qpois -o {output} > {log} 2>&1

macs_bdgcmp environment.

macs_cmbreps

Combine biological replicates through Fisher’s method

(Link original peakfile in replicate_processed if there is only 1 sample for a condition)

if [ "{params.nr_reps}" == "1" ]; then
    touch {output.tmpbdg} {output.tmppeaks}
    mkdir -p $(dirname {output.peaks}); cp {input.treatment} {output.peaks}
else
    macs2 cmbreps -i {input.bdgcmp} -o {output.tmpbdg} -m fisher > {log} 2>&1
    macs2 {params.function} {params.config} -i {output.tmpbdg} -o {output.tmppeaks} >> {log} 2>&1
    cat {output.tmppeaks} | tail -n +2 > {output.peaks}
fi

macs_cmbreps environment.

maelstrom_report_preparation

This rule injects the symlinked motif logos into the html so they get rendered in the multiqc report.

mark_duplicates

Mark all duplicate reads in a bam file with picard MarkDuplicates

picard MarkDuplicates TMP_DIR={resources.tmpdir} {params} INPUT={input} \
OUTPUT={output.bam} METRICS_FILE={output.metrics} > {log} 2>&1

mark_duplicates environment.

mean_center

Mean centering of a count table.

merge_replicates

Merge replicates (fastqs) simply by concatenating the files. We also change the name of the read headers to contain the name of the original replicate.

Must happen after trimming due to trim-galore’s automatic adapter trimming method

If a replicate has only 1 sample in it, simply move the file.

merge_volcano_ma

Combine the volcano and maplot resulting of the deseq2 rule into a single figure.

convert {input.maplot} {input.volcanoplot} +append {output} 2> {log}

merge_volcano_ma environment.

minimap2_align

Align reads against a genome (index) with minimap2, and pipe the output to the required sorter(s).

minimap2 -t {threads} -a {input.index} {input.reads} {params} > {output} 2> {log}

minimap2_align environment.

minimap2_index

Make a genome index for minimap2. This index is required for alignment.

mkdir -p $(dirname {output})

minimap2 -t {threads} -d {output} {input} {params} > {log} 2>&1

minimap2_index environment.

motif2factors

Create a gimme pfm/motif2factor file with the gimme motif2factors command. For human/mouse it just uses the default database, for other species it is based on TF orthologs.

mt_nuc_ratio_calculator

Estimate the amount of nuclear and mitochondrial reads in a sample. This metric is especially important in ATAC-seq experiments where mitochondrial DNA can be overrepresented. Reads are classified as mitochondrial reads if they map against either “chrM” or “MT”.

These values are aggregated and displayed in the MultiQC report.

mtnucratio {input.bam} {params.mitochondria}

mt_nuc_ratio_calculator environment.

multiBamSummary

Pre-compute a bam summary with deeptools.

multiBamSummary bins --bamfiles {input.bams} -out {output} {params.names} \
{params.params} -p {threads} > {log} 2>&1

multiBamSummary environment.

multiqc

Aggregate all the quality control metrics for every sample into a single multiqc report.

The input can get very long (causing problems with the shell), so we have to write the input to a file, and then we can use that file as input to multiqc.

multiqc {input.files} -o {params.dir} -n multiqc_{wildcards.assembly}.html \
--config {input.schema}                                                    \
--config {input.header}                                                    \
--sample-names {input.sample_names}                                        \
{params.filter_buttons}                                                    \
--cl-config "extra_fn_clean_exts: [                                        \
    {{'pattern': ^.*{wildcards.assembly}-, 'type': 'regex'}},              \
    {{'pattern': {params.fqext1},          'type': 'regex'}},              \
    {{'pattern': _allsizes,                'type': 'regex'}},              \
    ]" > {log} 2>&1

multiqc environment.

multiqc_explain

Generate rename buttons for the multiqc report.

multiqc_filter_buttons

Generate filter buttons.

multiqc_header_info

Generate a multiqc header file with contact info and date of multiqc generation.

multiqc_samplesconfig

Add a section in the multiqc report that reports the samples.tsv and config.yaml

multiqc_schema

Generate a multiqc config schema. Used for the ordering of modules.

narrowpeak_summit

Convert a narrowpeak file to a “macs2 summits” file.

awk 'BEGIN {{OFS="\t"}} {{ print $1,$2+$10,$2+$10+1,$4,$9; }}' {input} > {output} 2> {log}

onehot_peaks

Get onehot encodings of which peaks are found in which samples

awk '{{print $1":"$2"-"$3}}' {input.combinedpeaks} > {output.real} 2> {log}

for brep in {input.narrowpeaks}
do
    bedtools sort -i $brep 2> {log} |
    bedtools intersect -a {input.combinedpeaks} -b stdin -c 2> {log} |
    awk '{{print $4}}' 2> {log} |
    paste {output.real} - > {output.tmp}
    mv {output.tmp} {output.real}
done

echo -e "# onehot encoding of which condition contains which peaks\\n{params.names}\n$(cat {output.real})" > {output.tmp} && cp {output.tmp} {output.real}

onehot_peaks environment.

partial_decoy_transcripts

Compute a set of decoy sequences by mapping the annotated transcripts against a hard-masked version of the organism’s genome.

sh {params.script} -j {threads} -g {input.genome} -a {input.gtf} -t {input.transcripts} -o $(dirname {output[0]}) > {log} 2>&1

partial_decoy_transcripts environment.

peak_bigpeak

Convert a narrowpeak file into a bignarrowpeak file. https://genome-source.gi.ucsc.edu/gitlist/kent.git/tree/master/src/hg/lib/

# keep first 10 columns, idr adds extra columns we do not need for our bigpeak
cut -d$'\t' -f 1-{params.columns} {input.narrowpeak} |
awk -v OFS="\t" '{{$5=$5>1000?1000:$5}} {{print}}' |
bedSort /dev/stdin {output.tmp} > {log} 2>&1

bedToBigBed -type={params.type} -as={params.schema} {output.tmp} \
{input.sizes} {output.out} >> {log} 2>&1

peak_bigpeak environment.

plotCorrelation

Calculate the correlation between bams with deepTools.

plotCorrelation --corData {input.cordata} --plotFile {output} -c {wildcards.method} \
-p heatmap --plotTitle {params.title} {params.params} > {log} 2>&1

plotCorrelation environment.

plotFingerprint

Plot the “fingerprint” of your bams, using deeptools.

plotFingerprint -b {input.bams} {params} --outRawCounts {output} -p {threads} > {log} 2>&1

plotFingerprint environment.

plotHeatmap_peak

Plot the peak heatmap using deepTools.

plotHeatmap --matrixFile {input} --outFileName {output.img} {params.params} --startLabel="-{params.slop}" --endLabel={params.slop} > {log} 2>&1

plotHeatmap_peak environment.

plotPCA

Plot a PCA between bams using deepTools.

plotPCA --corData {input} --outFileNameData {output} > {log} 2>&1

plotPCA environment.

plotProfile_gene

Plot the gene profile using deeptools.

plotProfile -m {input} --outFileName {output.img} --outFileNameData {output.file} > {log} 2>&1

plotProfile_gene environment.

prepare_DEXseq_annotation

generate a DEXseq annotation.gff from the annotation.gtf

current_conda_env=$(conda env list | grep \* | cut -d "*" -f2-)
DEXseq_path=${{current_conda_env}}/lib/R/library/DEXSeq/python_scripts

python ${{DEXseq_path}}/dexseq_prepare_annotation.py {input} {output} > {log} 2>&1

prepare_DEXseq_annotation environment.

pytxi_count_matrix

Convert transcript abundance estimations to gene count estimations and merge gene counts per assembly.

If the GTF is not used, a taxonomy ID is required for myGene.info. Seq2science will try to read this ID from the genomepy README.txt or assembly database.

quantile_normalization

Quantile normalization is a type of normalization that makes the distribution between samples identical. This means that the actual count distribution within a sample changes. After quantile normalization, samples are CPM normalized (which has no effect between and within samples), for a fair comparison between normalisation techniques.

See wikipedia: https://en.wikipedia.org/wiki/Quantile_normalization

random_subset_peaks

Take a random subset of all summits, and make them all same width.

shuf -n {wildcards.nrpeaks} {input.peaks} | bedtools slop -i stdin -g {input.sizes} -b {params.slop} > {output}

random_subset_peaks environment.

run2sra

Download the SRA of a sample by its unique identifier.

# move to output dir since somehow prefetch sometimes puts files in the cwd...
# and remove the top level folder since prefetch will assume we are done otherwise
mkdir -p {params.outdir}; cd {params.outdir}; rm -r {wildcards.run}

# TODO: for loop can be removed if we dont see the debug message
# three attempts
for i in {{1..3}}
do
    # acquire a lock
    (
        flock -w 30 200 || continue
        sleep 2
    ) 200>{PYSRADB_CACHE_LOCK}

    # dump
    prefetch --max-size u --output-directory ./ --log-level debug --progress {wildcards.run} \
    >> {log} 2>&1 && break

    # retry
    echo "DEBUG: prefetch try ${{i}} of 3 failed" >> {log}
    sleep 10
done

# TODO: section can be removed if we dont see the debug message
# bug report: https://github.com/ncbi/sra-tools/issues/533
if [[ -f "{params.outdir}/{wildcards.run}.sra" ]]; then
    echo "DEBUG: moving output to correct directory" >> {log}
    mkdir -p {params.outdir}/{wildcards.run}
    mv {params.outdir}/{wildcards.run}.sra {output}
fi

# TODO: section can be removed if we dont see the debug message
# If an sralite file was downloaded instead of a sra file, just rename it
if [[ -f "{params.outdir}/{wildcards.run}.sralite" ]]; then
    echo "DEBUG: renaming SRAlite" >> {log}
    mkdir -p {params.outdir}/{wildcards.run}
    mv {params.outdir}/{wildcards.run}.sralite {output}
fi

run2sra environment.

runs2sample

Concatenate a single run or multiple runs together into a fastq

salmon_index

Generate a transcriptome index for Salmon.

mkdir -p {output}

salmon index -t {input[0]} {params.decoys} -i {output} {params.params} \
--threads {threads} > {log} 2>&1

salmon_index environment.

salmon_quant

Align reads against a transcriptome (index) with Salmon (mapping-based mode) and output a quantification file per sample.

salmon quant -i {input.index} -l A {params.input} {params.params} -o $(dirname {output}) \
--threads {threads} > {log} 2>&1

salmon_quant environment.

sam2bam

Convert a file in sam format into a bam format (binary)

samtools view -b {input} > {output}

sam2bam environment.

sambamba_sort

Sort the result of alignment or sieving with the sambamba sorter.

sambamba view --nthreads {threads} -f bam  {input} -o /dev/stdout  2> {log} |
sambamba sort --nthreads {threads} {params} /dev/stdin -o {output}  2> {log}

sambamba_sort environment.

samtools_index

Create an index of a bam/cram file which can be used for e.g. visualization.

samtools index {params} {input} {output}

samtools_index environment.

samtools_presort

(Pre)sort the result of alignment with the samtools sorter.

# we set this trap to remove temp files when prematurely ending the rule
trap "rm -f {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp*bam" INT;
rm -f {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp*bam 2> {log}

# RAM per thread in MB
memory=$((1024*{resources.mem_gb}/{threads}))M

samtools sort -@ {threads} -m $memory {input} -o {output} \
-T {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp 2> {log}

samtools_presort environment.

samtools_sort

Sort the result of alignment or sieving with the samtools sorter.

# we set this trap to remove temp files when prematurely ending the rule
trap "rm -f {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp*bam" INT;
rm -f {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp*bam 2> {log}

# RAM per thread in MB
memory=$((1024*{resources.mem_gb}/{threads}))M

samtools sort {params.sort_order} -@ {threads} -m $memory {input} -o {output} \
-T {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp 2> {log}

samtools_sort environment.

samtools_sort_allsizes

Sort the allsizes output of sieve_bam. This rule is identical to the samtools_sort rule except that the input must contain _allsizes, and the output is temporary.

# we set this trap to remove temp files when prematurely ending the rule
trap "rm -f {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp*bam" INT;
rm -f {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp*bam 2> {log}

# RAM per thread in MB
memory=$((1024*{resources.mem_gb}/{threads}))M

samtools sort {params.sort_order} -@ {threads} -m $memory {input} -o {output} \
-T {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp 2> {log}

samtools_sort_allsizes environment.

samtools_stats

Get general stats from bam files like percentage mapped.

samtools stats {input} 1> {output} 2> {log}

samtools_stats environment.

sctk_qc

Perform scRNA QC with singleCellTK, store output in SingleCellExperiment object and export to RData format.

setup_blacklist

Combine the encode blacklist with mitochondrial DNA depending on config.

sieve_bam

“Sieve” a bam.

This rule does (at least) one of these:

  • filtering on minimum mapping quality

  • tn5 shift adjustment

  • remove multimappers

  • remove reads inside the blacklist

  • remove duplicates

  • filter paired-end reads on transcript length

  • subsample to have a maximum amount of reads (equal across samples)

samtools view -h {params.prim_align} {params.minqual} -L {input.blacklist} \
{input.bam} | {params.atacshift} {params.sizesieve} {params.subsample}
samtools view -b > {output.bam} 2> {log}

# single-end reads never output allsizes so just touch the file when filtering on size
{params.sizesieve_touch}

sieve_bam environment.

softmask_track_1

Generate a track of all softmasked regions

source: https://github.com/Gaius-Augustus/MakeHub/blob/master/make_hub.py

softmask_track_2

Generate a track of all softmasked regions

source: https://github.com/Gaius-Augustus/MakeHub/blob/master/make_hub.py

bedSort {input.mask_unsorted} {output.maskbed} >> {log} 2>&1

bedToBigBed -type=bed3 {output.maskbed} {input.sizes} {output.mask} >> {log} 2>&1

softmask_track_2 environment.

sra2fastq_PE

Downloaded (raw) SRAs are converted to paired-end fastq files. Forward and reverse samples will be switched if forward/reverse names are not lexicographically ordered.

# move to output dir since somehow parallel-fastq-dump sometimes puts files in the cwd...
mkdir -p {output.tmpdir}; cd {output.tmpdir}

# acquire the lock
(
    flock -w 30 200 || exit 1
    sleep 3
) 200>{PYSRADB_CACHE_LOCK}

# dump to tmp dir
parallel-fastq-dump -s {input} -O {output.tmpdir} \
--threads {threads} --split-3 --skip-technical --dumpbase \
--readids --clip --read-filter pass --defline-seq '@$ac.$si.$sg/$ri' \
--defline-qual '+' --gzip >> {log} 2>&1

# check if the files exist
if ! compgen -G "{output.tmpdir}/*_1*" > /dev/null ; then printf "ERROR: Couldn't find read 1.fastq after dumping! Perhaps this is not a paired-end file?\n" >> {log} 2>&1; fi
if ! compgen -G "{output.tmpdir}/*_2*" > /dev/null ; then printf "ERROR: Couldn't find read 2.fastq after dumping! Perhaps this is not a paired-end file?\n" >> {log} 2>&1; fi

# rename file and move to output dir
mv {output.tmpdir}/*_1* {output.fastq[0]} >> {log} 2>&1
mv {output.tmpdir}/*_2* {output.fastq[1]} >> {log} 2>&1

sra2fastq_PE environment.

sra2fastq_SE

Downloaded (raw) SRAs are converted to single-end fastq files.

# move to output dir since somehow parallel-fastq-dump sometimes puts files in the cwd...
mkdir -p {output.tmpdir}; cd {output.tmpdir}

# acquire the lock
(
    flock -w 30 200 || exit 1
    sleep 3
) 200>{PYSRADB_CACHE_LOCK}

# dump to tmp dir
parallel-fastq-dump -s {input} -O {output.tmpdir} \
--threads {threads} --split-spot --skip-technical --dumpbase --readids \
--clip --read-filter pass --defline-seq '@$ac.$si.$sg/$ri' \
--defline-qual '+' --gzip >> {log} 2>&1

# rename file and move to output dir
mv {output.tmpdir}/*.fastq.gz {output.fastq}

sra2fastq_SE environment.

star_align

Align reads against a genome (index) with STAR, and pipe the output to the required sorter(s).

trap "find {log} -type f ! -name Log* -exec rm {{}} \;" EXIT
mkdir -p {log}
mkdir -p {output.dir}

STAR --genomeDir {input.index} --readFilesIn {params.input} --readFilesCommand gunzip -c \
--outSAMtype BAM Unsorted --outStd BAM_Unsorted \
--outFileNamePrefix {log}/ --outTmpDir {output.dir}/STARtmp \
--runThreadN {threads} {params.params} > {output.pipe} 2> {log}/Log.stderr.out

# move all non-log files to output directory (this way the log files are kept on error)
find {log} -type f ! -name Log* -exec mv {{}} {output.dir} \;

star_align environment.

star_index

Make a genome index for STAR.

Troubleshooting:

  1. sufficient RAM & disk space?

  2. increase the RAM available (–limitGenomeGenerateRAM)

  3. reduce the number of threads (seq2science -j 5)

  4. reduce accuracy (–genomeSAsparseD 2)

In your config.yaml, add:
aligner:
star:

index: –limitGenomeGenerateRAM 60000000000 –genomeSAsparseD 1

function log2 {{
        local x=0
        for (( y=$1-1 ; $y > 0; y >>= 1 )) ; do
            let x=$x+1
        done
        echo $x
}}

# set genome dependent variables
NBits=""
NBases=""
GenomeLength=$(awk -F"\t" '{{x+=$2}}END{{printf "%i", x}}' {input.sizes})
NumberOfReferences=$(awk 'END{{print NR}}' {input.sizes})
if [ $NumberOfReferences -gt 5000 ]; then
    # for large genomes, --genomeChrBinNbits should be scaled to min(18,log2[max(GenomeLength/NumberOfReferences,ReadLength)])
    # ReadLength is skipped here, as it is unknown
    LpR=$(log2 $((GenomeLength / NumberOfReferences)))
    NBits="--genomeChrBinNbits $(($LpR<18 ? $LpR : 18))"
    printf "NBits: $NBits\n\n" >> {log} 2>&1
fi

if [ $GenomeLength -lt 268435456 ]; then
    # for small genomes, --genomeSAindexNbases must be scaled down to min(14, log2(GenomeLength)/2-1)
    logG=$(( $(log2 $GenomeLength) / 2 - 1 ))
    NBases="--genomeSAindexNbases $(( $logG<14 ? $logG : 14 ))"
    printf "NBases: $NBases\n\n" >> {log} 2>&1
fi

mkdir -p {output}

STAR --runMode genomeGenerate --genomeFastaFiles {input.genome} --sjdbGTFfile {input.gtf} \
--genomeDir {output} --outFileNamePrefix {output}/ \
--runThreadN {threads} $NBits $NBases {params} >> {log} 2>&1

star_index environment.

tpm_matrix

Create a TPM normalized table from a counts table.

trackhub

Generate a UCSC track hub/assembly hub.

To view the hub, the output directory must be hosted on an web accessible location, and the location of the hub.txt file given the UCSC genome browser at My Data > Track Hubs > My Hubs

trackhub_index

Generate a searchable annotation & index for each assembly

sources: https://genome.ucsc.edu/goldenPath/help/hubQuickStartSearch.html, https://www.biostars.org/p/272649/

# filter out tiny tiny transcripts that make gtftogenepred fail
min_transcript_length=15
(awk -v minlen=$min_transcript_length 'BEGIN {{ FS = "\\t"; OFS="\\t" }}; {{ if(($3=="transcript") && ($5 - $4 <= minlen)) {{ print $9 }} }}' {input.gtf} | grep . || echo "transcript_id \\\"DONTMATCHANYTHING\\\"") | grep -oP '(?<=transcript_id ").+?(?=")' | grep -v -f - {input.gtf} > {output.gtf_min_size} 2>> {log}

# generate annotation files
gtfToGenePred -allErrors -ignoreGroupsWithoutExons -geneNameAsName2 -genePredExt {output.gtf_min_size} {output.genePred} -infoOut={output.info} >> {log} 2>&1

# if gtf has gene_names, use gene_names, else transcript_id
if grep -Fq "gene_name" {output.gtf_min_size}; then n=1; else n=0; fi

# switch columns 1 (transcript_id) and 12 (gene_name)
awk -v n=$n 'BEGIN {{ FS = "\\t"; OFS="\\t" }}; {{ if(n==1) {{ t = $1; $1 = $12; $12 = t; print; }} else {{ print; }} }}' {output.genePred} > {output.genePrednamed}

# remove lines with an empty first column (e.g. no gene names for ERCC RNA spike-in)
grep -vP "^\t" {output.genePrednamed} > {output.genePred}

# remove spaces from the genePred (error in annotation)
sed -i 's/ //g' {output.genePred}

genePredToBed {output.genePred} {output.genePredbed} >> {log} 2>&1

bedSort {output.genePredbed} {output.genePredbed} >> {log} 2>&1

bedToBigBed -extraIndex=name {output.genePredbed} {input.sizes} {output.genePredbigbed} >> {log} 2>&1

# generate searchable indexes (by 1: transcriptID, 2: geneId, 8: proteinID, 9: geneName, 10: transcriptName)
grep -v "^#" {output.info} | awk -v n=$n 'BEGIN {{ FS = "\\t"; OFS="\\t" }} ; {{ if(n==1) {{print $9, $1, $2, $8, $9, $10}} else {{print $1, $1, $2, $8, $9, $10}} }}' > {output.indexinfo}

ixIxx {output.indexinfo} {output.ix} {output.ixx} >> {log} 2>&1

trackhub_index environment.

trimgalore_PE

Automated adapter detection, adapter trimming, and quality trimming through trim galore (paired-end).

trim_galore --paired -j {threads} {params.config} -o $(dirname {output.r1}) {input.r1} {input.r2} > {log} 2>&1

# now rename to proper output
for f in $(find "$(dirname {output.r1})/" -name "{wildcards.sample}_*val_*.fq.gz"); do
    mv "$f" "$(echo "$f" | sed s/.fq/.{params.fqsuffix}/)"; done
for f in $(find "$(dirname {output.r1})/" -maxdepth 1 -name "{wildcards.sample}_*val_*.{params.fqsuffix}.gz"); do
    mv "$f" "$(echo "$f" | sed s/_val_./_trimmed/)"; done

# move the trimming reports to qc directory
for f in $(find "$(dirname {output.r1})/" -name "{wildcards.sample}_*.{params.fqsuffix}.gz_trimming_report.txt"); do
    mv "$f" "$(dirname {output.qc[0]})/$(basename $f)"; done

trimgalore_PE environment.

trimgalore_SE

Automated adapter detection, adapter trimming, and quality trimming through trim galore (single-end).

trim_galore -j {threads} {params.config} -o $(dirname {output.se}) {input} > {log} 2>&1

# now rename to proper output
if [ "{params.fqsuffix}" != "fq" ]; then
  mv "$(dirname {output.se})/{wildcards.sample}_trimmed.fq.gz" {output.se}
fi

# move the trimming report to qc directory
report=$(dirname {output.se})/{wildcards.sample}.{params.fqsuffix}.gz_trimming_report.txt
mv $report {output.qc}

trimgalore_SE environment.

twobit

Generate a 2bit file for each assembly

faToTwoBit {input} {output} >> {log} 2>&1

twobit environment.

txi_count_matrix

Convert transcript abundance estimations to gene count estimations and merge gene counts per assembly.

Also outputs a single cell experiment object similar to ARMOR (https://github.com/csoneson/ARMOR).

Only works with Ensembl assemblies.

upset_plot_peaks

make an upset plot which shows the distribution of peaks amongst the biological replicates.