Per rule explanation

This is an automatically generated list of all supported rules, their docstrings, command, and virtual environment. At the start of each workflow run a list is printed of which rules will be run. And while the workflow is running it prints which rules are being started and finished. This page is here to give an explanation to the user about what each rule does, and for developers to find what is, and isn’t yet supported.

bam2cram

Convert a bam file to the more compressed cram format.

samtools view -T {input.genome} -C {input.bam} -@ {threads} > {output} 2> {log}

bam2cram environment.

bam_bigwig

Convert a bam file into a bigwig file (or two). Can output strand specific bams. The reverse bam is created “secretly” to avoid checkpoints.

if [[ "{params.strandedness}" == "no" ]]; then
  echo "generating unstranded bam" > {log}
  bamCoverage --bam {input.bam} --outFileName {output} --numberOfProcessors {threads} {params.flags} --verbose >> {log} 2>&1
else
  echo "generating stranded bams" > {log}
              out={output}
  rev=$(echo "${{out/{wildcards.sorting}.bw/{wildcards.sorting}.rev.bw}}")
  rm -f $rev  # in case of a rerun
  bamCoverage --bam {input.bam} --outFileName $rev     --filterRNAstrand reverse --numberOfProcessors {threads} {params.flags} --verbose >> {log} 2>&1
  bamCoverage --bam {input.bam} --outFileName {output} --filterRNAstrand forward --numberOfProcessors {threads} {params.flags} --verbose >> {log} 2>&1
fi

bam_bigwig environment.

bedgraph_bigwig

Convert a bedgraph file into a bigwig.

awk -v OFS='\\t' '{{print $1, $2, $3, $4}}' {input.bedgraph} | sed '/experimental/d' |
bedSort /dev/stdin {output.tmp} > {log} 2>&1
bedGraphToBigWig {output.tmp} {input.sizes} {output.out} >> {log} 2>&1

bedgraph_bigwig environment.

bedgraphish_to_bedgraph

Convert the bedgraph-ish file generated by genrich into a true bedgraph file.

splits=$(grep -Pno "([^\/]*)(?=\.bam)" {input})
splits=($splits)
lncnt=$(wc -l {input} | echo "$(grep -Po ".*(?=\ )")":)
splits+=($lncnt)

counter=1
for split in "${{splits[@]::${{#splits[@]}}-1}}";
do
    filename=$(grep -Po "(?<=:).*" <<< $split);
    if [[ $filename =~ {wildcards.sample} ]]; then
        startnr=$(grep -Po ".*(?=\:)" <<< $split);
        endnr=$(grep -Po ".*(?=\:)" <<< ${{splits[counter]}});

        lines="NR>=startnr && NR<=endnr {{ print \$1, \$2, \$3, \$4 }}"
        lines=${{lines/startnr/$((startnr + 2))}}
        lines=${{lines/endnr/$((endnr - 1))}}

        awk "$lines" {input} > {output}
    fi
    ((counter++))
done

bedtools_slop

After combine_peaks we end up with just a bed file of summits. We extend all peaks to the same width, for a fair comparison between peaks.

bedtools slop -i {input.bedfile} -g {input.sizes} -b {params.slop} | uniq > {output} 2> {log}

bedtools_slop environment.

bowtie2_align

Align reads against a genome (index) with bowtie2, and pipe the output to the required sorter(s).

bowtie2 {params.params} --threads {threads} -x {input.index}/{wildcards.assembly} {params.input} 2> {log} | tee {output} 1> /dev/null 2>> {log}

bowtie2_align environment.

bowtie2_index

Make a genome index for bowtie2. This index is required for alignment.

mkdir -p {output}

bowtie2-build {params} --threads {threads} {input} {output}/{wildcards.assembly} > {log} 2>&1

bowtie2_index environment.

bwa_index

Make a genome index for bwa (mem). This index is required for alignment.

mkdir -p {output}

bwa index -p {params.prefix} {params.params} {input} > {log} 2>&1

bwa_index environment.

bwa_mem

Align reads against a genome (index) with bwa-mem, and pipe the output to the required sorter(s).

bwa mem {params.params} -t {threads} {params.index_dir} {input.reads} 2> {log} | tee {output} 1> /dev/null 2>> {log}

bwa_mem environment.

bwa_mem2

Align reads against a genome (index) with bwa-mem2, and pipe the output to the required sorter(s).

bwa-mem2 mem {params.params} -t {threads} {params.index_dir} {input.reads} 2> {log} | tee {output} 1> /dev/null 2>> {log}

bwa_mem2 environment.

bwa_mem2_index

Make a genome index for bwa-mem2. This index is required for alignment.

mkdir -p {output}

bwa-mem2 index -p {params.prefix} {input} > {log} 2>&1

bwa_mem2_index environment.

call_peak_genrich

Call peaks with genrich based on the pileup.

Genrich -P -f {input.log} -o {output.narrowpeak} {params} -v > {log} 2>&1

call_peak_genrich environment.

chipseeker

Make chipseeker plots, with the percentage of peaks in e.g. promoters.

citeseqcount

ADT mapping and quantification with cite-seq-count and output a umi/read matrix per sample.

CITE-seq-Count -R1 {input.reads[0]} -R2 {input.reads[1]} \
-t {input.tags} {params.barcode_arg} -T {threads} \
{params.options} -o {params.outdir} > {log} 2>&1

citeseqcount environment.

combine_biological_reps

Combine biological replicates by taking their mean!

combine_peaks

Uses gimmemotifs’ combine_peaks to “combine” peaks. This finds all peaks close together and takes the most significant one as the true peak.

combine_peaks --genome {input.genome} --window {params.windowsize} \
{input.summitfiles} > {output} 2> {log}

combine_peaks environment.

complement_blacklist

Take the complement of the blacklist. We need this complement to tell samtools to only keep reads that are in the complement of the blacklist.

sortBed -faidx {input.sizes} -i {input.blacklist} 2>> {log} |
complementBed -i stdin -g {input.sizes} > {output} 2>> {log}

complement_blacklist environment.

computeMatrix_gene

Pre-compute correlations between bams using deeptools.

computeMatrix scale-regions -S {input.bw} {params.labels} -R {params.gtf} \
-p {threads} {params.params} -o {output} > {log} 2>&1

computeMatrix_gene environment.

computeMatrix_peak

Pre-compute correlations between bams using deeptools.

computeMatrix scale-regions -S {input.bw} {params.labels} -R {input.peaks} \
-p {threads} --binSize 5 -o {output} > {log} 2>&1

computeMatrix_peak environment.

count_matrix

Combine count tables into one count matrix per assembly

coverage_table

Use gimmemotif’s coverage_table to generate a count table with the nr of reads under each peak per sample.

echo "# The number of reads under each peak" > {output}
coverage_table {input.peaks} {input.replicates} --window {params.peak_width} --nthreads {threads} \
2> {log} | grep -vE "^#" 2>> {log} |
awk 'BEGIN {{ FS = "@" }} NR==1{{gsub("{wildcards.assembly}-|.samtools-coordinate","",$0)}}; \
{{print $0}}' >> {output}

# overwrite sample names with descriptive/replicate names
sed -i "2s/.*/{params.names}/" {output}

coverage_table environment.

create_SNAP_object

Create a snapobject for each BAM file.

These snapobjects can be merged later using snaptools in R.

snaptools snap-pre --input-file={input.bam} --output-snap={output} --genome-name={wildcards.assembly} \
--genome-size={input.sizes} {params.params} {params.chrm} {params.mapq} > {log} 2>&1

create_SNAP_object environment.

create_bins_SNAP_object

Add a Binned genome matrix to the SNAPobject, after which it is renamed and moved to the Snapfiles folder for downstream analysis in R using Snaptools

snaptools snap-add-bmat --snap-file={input} {params} > {log} 2>&1
echo "bmat added, moving file" >> {log}
mv {input} {output} >> {log} 2>&1

create_bins_SNAP_object environment.

cytoband

Generate a cytoband track for each assembly

source: http://genomewiki.ucsc.edu/index.php/Assembly_Hubs#Cytoband_Track

cat {input.sizes} |
bedSort /dev/stdin /dev/stdout |
awk '{{print $1,0,$2,$1,"gneg"}}' > {output.cytoband_bd}

bedToBigBed -type=bed4 {output.cytoband_bd} -as={params.schema} \
{input.sizes} {output.cytoband_bb} >> {log} 2>&1

cytoband environment.

deseq2

Differential gene expression analysis with DESeq2.

dexseq_count

count exon usage

current_conda_env=$(conda env list | grep \* | cut -d "*" -f2-)
DEXseq_path=${{current_conda_env}}/lib/R/library/DEXSeq/python_scripts

python ${{DEXseq_path}}/dexseq_count.py -f bam -r pos {params.endedness} -s {params.strandedness} {input.gff} {input.bam} {output} > {log} 2>&1

dexseq_count environment.

dexseq_count_matrix

Combine DEXSeq counts into one count matrix per assembly for use in function DEXSeqDataSet()

dupRadar

visualize fraction of artifactual reads to normal read duplication (due to natural over-sequencing of highly expressed genes).

dupRadar_combine

Combine the individual images (so we can group them nicely in the MultiQC).

convert {params.good_example} {params.bad_example} {input} -append {output} 2> {log}

dupRadar_combine environment.

edgeR_normalization

edgeR supports three different types of normalization: TMM, RLE, and upperquartile.

TMM: is the weighted trimmed mean of M-values proposed by Robinson and Oshlack (2010). RLE: is the scaling factor method proposed by Anders and Huber (2010). DEseq2 normalisation is based on this. upperquartile: is the upper-quartile normalization method of Bullard et al (2010).

ena2fastq_PE

Download paired-end fastq files directly from the ENA.

ena2fastq_SE

Download single-end fastq files directly from the ENA.

export_sce_obj

Read scRNA UMI counts into a SingleCellExperiment object, add colData and export to RData format.

extend_genome

Append given file(s) to genome

# extend the genome.fa
cp {input.genome} {output.genome}

for FILE in {input.extension}; do
    cat $FILE >> {output.genome}
done

extend_genome_annotation

Append given file(s) to genome annotation

# extend the genome.annotation.gtf
cp {input.gtf} {output.gtf}

for FILE in {input.extension}; do
    cat $FILE >> {output.gtf}
done

# generate an extended genome.annotation.bed
gtfToGenePred {output.gtf} {output.gp}
genePredToBed {output.gp} {output.bed}

extend_genome_blacklist

Copy blacklist to the custom genome directory

cp {input} {output}

fastp_PE

Automated adapter detection, adapter trimming, and quality trimming through fastp (paired-end).

threads=$(( {threads} - 2  > 1 ? {threads} - 2 : 1 ))

fastp -w $threads --in1 {input[0]} --in2 {input[1]} \
--out1 {output.r1} --out2 {output.r2} -h {output.qc_html} -j {output.qc_json} \
{params.config} > {log} 2>&1

fastp_PE environment.

fastp_SE

Automated adapter detection, adapter trimming, and quality trimming through fastp (single-end).

threads=$(( {threads} - 2  > 1 ? {threads} - 2 : 1 ))

fastp -w $threads --in1 {input} --out1 {output.se} -h {output.qc_html} -j {output.qc_json} \
{params.config} > {log} 2>&1

fastp_SE environment.

fastp_qc_PE

Get quality scores for (technical) replicates

fastp -w {threads} --in1 {input[0]} --in2 {input[1]} \
-h {output.qc_html} -j {output.qc_json} \
{params.config} > {log} 2>&1

fastp_qc_PE environment.

fastp_qc_SE

Get quality scores for (technical) replicates

fastp -w {threads} --in1 {input} -h {output.qc_html} -j {output.qc_json} > {log} 2>&1

fastp_qc_SE environment.

fastq_pair

fastq_pair re-writes paired-end fastq files to ensure that each read has a mate and discards singleton reads. This step is required after scRNA trimming since we only trim the fastq containing reads and not the barcode fastq.

gunzip -c {input.r1} > {output.intermediates1[0]} 2> {log}
gunzip -c {input.r2} > {output.intermediates1[1]} 2>> {log}
if [ {params.tused} == true ]
then
  opts="{params.options}"
else
  echo "\nsetting parameter t with the number of reads in the fastq\n" >> {log}
  opts="-p -t "$(wc -l {input.r1} | grep -Po '^\d+' | awk '{{print int($1/4)}}')
fi
fastq_pair $opts {output.intermediates1} >> {log} 2>&1
gzip -c {output.intermediates3[0]} > {output.reads[0]} 2> {log}
gzip -c {output.intermediates3[1]} > {output.reads[1]} 2>> {log}

fastq_pair environment.

fastqc

Generate quality control report for fastq files.

fastqc {input} -O {params} > {log} 2>&1

fastqc environment.

featureCounts

Use featureCounts to generate the fraction reads in peaks score (frips/assigned reads). See: https://www.biostars.org/p/337872/ and https://www.biostars.org/p/228636/

# Make a custom "SAF" file which featureCounts needs:
awk 'BEGIN{{FS=OFS="\t"; print "GeneID\tChr\tStart\tEnd\tStrand"}}{{print $4, $1, $2+1, $3, "."}}' \
{input.peak} 1> {output.tmp_saf} 2> {log}

# run featureCounts
featureCounts -T {threads} -p -a {output.tmp_saf} -F SAF -o {output.real_out} {input.bam} > {log} 2>&1

# move the summary to qc directory
mv $(dirname {output.real_out})/$(basename {output.summary}) {output.summary}

featureCounts environment.

featurecounts

summarize reads to gene level. Outputs a counts table per bam file.

featureCounts -a {input.gtf} {input.bam} {params.endedness} -s {params.strandedness} {params.user_flags} -T {threads} -o {output} > {log} 2>&1

featurecounts environment.

full_decoy_transcripts

Use the entire genome as decoy sequence.

cut -f 1 {input.sizes} > {output.decoys}
cat {input.transcripts} {input.genome} > {output.gentrome}

gcPercent

Generate a gc content track

source: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/gc5Base/

hgGcPercent -wigOut -doGaps -file=stdout -win=5 -verbose=0 {wildcards.assembly} {input.twobit} \
    | gzip > {output.gcpcvar}

wigToBigWig {output.gcpcvar} {input.sizes} {output.gcpc} >> {log} 2>&1

gcPercent environment.

gene_id2name

Parse the gtf file to generate a gene_id to gene_name conversion table.

genrich_pileup

Generate the pileup. We do this separately from peak-calling since these two processes have a very different computational footprint.

input=$(echo {input.reps} | tr ' ' ',')
Genrich -X -t $input -f {output.log} {params.control} -k {output.bedgraphish} \
{params.params} -v > {log} 2>&1

genrich_pileup environment.

get_effective_genome_size

Get the effective genome size for a kmer length. Some tools (e.g. macs2) require an estimation of the effective genome size to better estimate how (un)likely it is to have a certain number of reads on a position. The actual genome size is not the best indication in these cases, since reads in repetitive regions (for a certain kmer length) can not possible align on some locations.

unique-kmers.py {input} -k {wildcards.kmer_size} --quiet 2>&1 | grep -P -o '(?<=\.fa: ).*' > {output} 2> {log}

get_effective_genome_size environment.

get_genome

Download a genome through genomepy.

get_genome_annotation

Download a gene annotation through genomepy.

get_genome_blacklist

Download a genome blacklist for a genome, if it exists, through genomepy.

get_genome_support_files

Generate supporting files for a genome.

get_transcripts

Generate transcripts.fasta using gffread.

Requires genome.fa and annotation.gtf (with matching chromosome/scaffold names)

gffread -w {output} -g {input.fa} {input.gtf} > {log} 2>&1

get_transcripts environment.

gimme_maelstrom

Gimme maelstrom is a method to infer differential motifs between two or more biological replicates.

NEW_CACHE={output.cache}
mkdir -p $NEW_CACHE
if [ -z ${{XDG_CACHE_HOME+x}} ]; then
    export XDG_CACHE_HOME=$HOME/.cache
fi
if [ -d $XDG_CACHE_HOME/gimmemotifs ]; then
    cp -r $XDG_CACHE_HOME/gimmemotifs $NEW_CACHE/
fi
export XDG_CACHE_HOME=$NEW_CACHE

gimme_maelstrom environment.

gsa_or_encode2fastq_PE

Download paired-end fastq files from the GSA or ENCODE DCC.

gsa_or_encode2fastq_SE

Download single-end fastq files from the GSA or ENCODE DCC.

hisat2_align

Align reads against a genome (index) with hisat2, and pipe the output to the required sorter(s).

hisat2 {params.params} --threads {threads} -x {input.index}/part {params.input} 2> {log} | tee {output} 1> /dev/null 2>> {log}

hisat2_align environment.

hisat2_index

Make a genome index for hisat2. This index is required for alignment.

mkdir -p {output}

hisat2-build {params} -p {threads} {input} {output}/part > {log} 2>&1

hisat2_index environment.

hisat2_splice_aware_index

Make an exon-junction and splice aware index for hisat2. This index is required for alignment and quantification of RNA-seq data.

mkdir -p {output}

hp=$(which hisat2)
python3 ${{hp}}_extract_splice_sites.py {input.gtf} > {output}/splice_sites.tsv 2>> {log}
python3 ${{hp}}_extract_exons.py {input.gtf} > {output}/exons.tsv 2>> {log}

hisat2-build {params} -p {threads} --ss {output}/splice_sites.tsv --exon {output}/exons.tsv \
{input.fasta} {output}/part >> {log} 2>&1

hisat2_splice_aware_index environment.

hmmratac

Call gappedpeaks with HMMRATAC.

HMMRATAC --bedgraph true -o {params.basename} {params.hmmratac_params} -Xmx22G -b {input.bam} -i {input.bam_index} -g {input.genome_size} > {log} 2>&1

hmmratac environment.

hmmratac_genome_info

Generate the ‘genome info’ that hmmratac requires for peak calling. https://github.com/LiuLabUB/HMMRATAC/issues/17

TODO: isnt this just .fa.sizes?

samtools view -H {input} 2>  {log} | grep SQ 2>> {log} | cut -f 2-3 2>> {log} | cut -d ':' -f 2   2>> {log} | cut -f 1        > {output.tmp1} 2> {log}
samtools view -H {input} 2>> {log} | grep SQ 2>> {log} | cut -f 2-3 2>> {log} | cut -d ':' -f 2,3 2>> {log} | cut -d ':' -f 2 > {output.tmp2} 2> {log}
paste {output.tmp1} {output.tmp2} > {output.out} 2> {log}

hmmratac_genome_info environment.

htseq_count

summarize reads to gene level. Outputs a counts table per bam file.

htseq-count {input.bam} {input.gtf} -r pos -s {params.strandedness} {params.user_flags} -n {threads} -c {output} > {log} 2>&1

htseq_count environment.

idr

Combine replicates based on the irreproducible discovery rate (IDR). Can only handle two replicates. For more than two replicates use fisher’s method.

When combining narrowpeak files with IDR, the q-score and summit are set to -1 (means not set). However some downstream tools require these to be set. So we set the q-score to zero, and place the summit of the peak in the middle of the peak.

if [ "{params.nr_reps}" == "1" ]; then
    cp {input} {output.true}
    touch {output.temp}
else
    idr --samples {input} {params.rank} --output-file {output.temp} {params.options} > {log} 2>&1
    if [ "{wildcards.ftype}" == "narrowPeak" ]; then
        awk 'IFS="\t",OFS="\t" {{$9=0; $10=int(($3 - $2) / 2); print}}' {output.temp} > {output.true} 2>> {log}
    else
        cp {output.temp} {output.true}
    fi
fi

idr environment.

infer_strandedness

Use RSeqQC’s infer_experiment.py to determine the strandedness of a sample

infer_experiment.py -i {input.bam} -r {input.bed} -q {params} 2> {log} | awk NF > {output}

infer_strandedness environment.

insert_size_metrics

Get insert size metrics from a (paired-end) bam. This score is then used by MultiQC in the report.

picard CollectInsertSizeMetrics INPUT={input} \
OUTPUT={output.tsv} H={output.pdf} > {log} 2>&1

insert_size_metrics environment.

kallistobus_count

Align reads against a transcriptome (index) with kallistobus and output a quantification file per sample.

kb count \
-i {params.basename}.idx \
-t {threads} -g {params.basename}_t2g.txt \
-o {params.outdir} {params.c1} {params.c2} \
{params.barcode_arg} {params.options} {input.reads} > {log} 2>&1
# Validate output
if grep -q 'ERROR\|bad_alloc' "{log}"; then
  exit 1
fi

kallistobus_count environment.

kallistobus_ref

Make a genome index for kallistobus. This index is required for counting.

mkdir -p {output}
kb ref \
{input.fa} {input.gtf} \
-i {params.basename}.idx -g {params.basename}_t2g.txt -f1 {params.basename}_cdna.fa \
{params.f2} {params.c1} {params.c2} \
{params.options} > {log} 2>&1

kallistobus_ref environment.

kallistobus_ref_kite

Make a mismatch index for kallistobus. This index is required to count feature barcodes, such as antibody tags.

mkdir -p {output}
kb ref  \
{input.featurebarcodes} \
{params.options} \
-i {params.basename}.idx -g {params.basename}_t2g.txt -f1 {params.basename}_cdna.fa > {log} 2>&1

kallistobus_ref_kite environment.

keep_mates

In-house script that, after alignment, removes the information that reads are paired. This can be beneficial when peak calling with macs2 when shifting + extending, since macs2 in this case only keeps the first in pair.

linked_txome

Generate a linked transcriptome for tximeta

Also creates a symlink to the gtf in an Ensembl format (required by tximeta)

Required to converting salmon output (estimated transcript abundances) to gene counts

log_normalization

Log1p normalization of a count table.

macs2_callpeak

Call peaks using macs2. Macs2 requires a genome size, which is estimated from the amount of unique kmers of the average read length.

GENSIZE=$(cat {input.genome_size})

# call peaks
macs2 callpeak --bdg -t {input.bam} {params.control} --outdir {config[result_dir]}/macs2/ -n {wildcards.assembly}-{wildcards.sample} \
{params.macs_params} -g $GENSIZE -f {params.format} >> {log} 2>&1

macs2_callpeak environment.

macs_bdgcmp

Prepare p-value files for rule macs_cmbreps

macs2 bdgcmp -t {input.treatment} -c {input.control} -m qpois -o {output} > {log} 2>&1

macs_bdgcmp environment.

macs_cmbreps

Combine biological replicates through Fisher’s method

(Link original peakfile in replicate_processed if there is only 1 sample for a condition)

if [ "{params.nr_reps}" == "1" ]; then
    touch {output.tmpbdg} {output.tmppeaks}
    mkdir -p $(dirname {output.peaks}); cp {input.treatment} {output.peaks}
else
    macs2 cmbreps -i {input.bdgcmp} -o {output.tmpbdg} -m fisher > {log} 2>&1
    macs2 {params.function} {params.config} -i {output.tmpbdg} -o {output.tmppeaks} >> {log} 2>&1
    cat {output.tmppeaks} | tail -n +2 > {output.peaks}
fi

macs_cmbreps environment.

maelstrom_report_preparation

This rule injects the symlinked motif logos into the html so they get rendered in the multiqc report.

mark_duplicates

Mark all duplicate reads in a bam file with picard MarkDuplicates

picard MarkDuplicates TMP_DIR={resources.tmpdir} {params} INPUT={input} \
OUTPUT={output.bam} METRICS_FILE={output.metrics} > {log} 2>&1

mark_duplicates environment.

mean_center

Mean centering of a count table.

merge_replicates

Merge replicates (fastqs) simply by concatenating the files. We also change the name of the read headers to contain the name of the original replicate.

Must happen after trimming due to trim-galore’s automatic adapter trimming method

If a replicate has only 1 sample in it, simply move the file.

merge_volcano_ma

Combine the volcano and maplot resulting of the deseq2 rule into a single figure.

convert {input.maplot} {input.volcanoplot} +append {output} 2> {log}

merge_volcano_ma environment.

minimap2_align

Align reads against a genome (index) with minimap2, and pipe the output to the required sorter(s).

minimap2 -t {threads} -a {input.index} {input.reads} {params} > {output} 2> {log}

minimap2_align environment.

minimap2_index

Make a genome index for minimap2. This index is required for alignment.

mkdir -p $(dirname {output})

minimap2 -t {threads} -d {output} {input} {params} > {log} 2>&1

minimap2_index environment.

motif2factors

Create a gimme pfm/motif2factor file with the gimme motif2factors command. For human/mouse it just uses the default database, for other species it is based on TF orthologs.

mt_nuc_ratio_calculator

Estimate the amount of nuclear and mitochondrial reads in a sample. This metric is especially important in ATAC-seq experiments where mitochondrial DNA can be overrepresented. Reads are classified as mitochondrial reads if they map against either “chrM” or “MT”.

These values are aggregated and displayed in the MultiQC report.

mtnucratio {input.bam} {params.mitochondria}

mt_nuc_ratio_calculator environment.

multiBamSummary

Pre-compute a bam summary with deeptools.

multiBamSummary bins --bamfiles {input.bams} -out {output} {params.names} \
{params.params} -p {threads} > {log} 2>&1

multiBamSummary environment.

multiqc

Aggregate all the quality control metrics for every sample into a single multiqc report.

The input can get very long (causing problems with the shell), so we have to write the input to a file, and then we can use that file as input to multiqc.

multiqc {input.files} -o {params.dir} -n multiqc_{wildcards.assembly}.html \
--config {input.schema}                                                    \
--config {input.header}                                                    \
--sample-names {input.sample_names}                                        \
{params.filter_buttons}                                                    \
--cl-config "extra_fn_clean_exts: [                                        \
    {{'pattern': ^.*{wildcards.assembly}-, 'type': 'regex'}},              \
    {{'pattern': {params.fqext1},          'type': 'regex'}},              \
    {{'pattern': _allsizes,                'type': 'regex'}},              \
    ]" > {log} 2>&1

multiqc environment.

multiqc_explain

Generate rename buttons for the multiqc report.

multiqc_filter_buttons

Generate filter buttons.

multiqc_header_info

Generate a multiqc header file with contact info and date of multiqc generation.

multiqc_samplesconfig

Add a section in the multiqc report that reports the samples.tsv and config.yaml

multiqc_schema

Generate a multiqc config schema. Used for the ordering of modules.

narrowpeak_summit

Convert a narrowpeak file to a “macs2 summits” file.

awk 'BEGIN {{OFS="\t"}} {{ print $1,$2+$10,$2+$10+1,$4,$9; }}' {input} > {output} 2> {log}

onehot_peaks

Get onehot encodings of which peaks are found in which samples

awk '{{print $1":"$2"-"$3}}' {input.combinedpeaks} > {output.real} 2> {log}

for brep in {input.narrowpeaks}
do
    bedtools sort -i $brep 2> {log} |
    bedtools intersect -a {input.combinedpeaks} -b stdin -c 2> {log} |
    awk '{{print $4}}' 2> {log} |
    paste {output.real} - > {output.tmp}
    mv {output.tmp} {output.real}
done

echo -e "# onehot encoding of which condition contains which peaks\\n{params.names}\n$(cat {output.real})" > {output.tmp} && cp {output.tmp} {output.real}

onehot_peaks environment.

partial_decoy_transcripts

Compute a set of decoy sequences by mapping the annotated transcripts against a hard-masked version of the organism’s genome.

sh {params.script} -j {threads} -g {input.genome} -a {input.gtf} -t {input.transcripts} -o $(dirname {output[0]}) > {log} 2>&1

partial_decoy_transcripts environment.

peak_bigpeak

Convert a narrowpeak file into a bignarrowpeak file. https://genome-source.gi.ucsc.edu/gitlist/kent.git/tree/master/src/hg/lib/

# keep first 10 columns, idr adds extra columns we do not need for our bigpeak
cut -d$'\t' -f 1-{params.columns} {input.narrowpeak} |
awk -v OFS="\t" '{{$5=$5>1000?1000:$5}} {{print}}' |
bedSort /dev/stdin {output.tmp} > {log} 2>&1

bedToBigBed -type={params.type} -as={params.schema} {output.tmp} \
{input.sizes} {output.out} >> {log} 2>&1

peak_bigpeak environment.

plotCorrelation

Calculate the correlation between bams with deepTools.

plotCorrelation --corData {input.cordata} --plotFile {output} -c {wildcards.method} \
-p heatmap --plotTitle {params.title} {params.params} > {log} 2>&1

plotCorrelation environment.

plotFingerprint

Plot the “fingerprint” of your bams, using deeptools.

plotFingerprint -b {input.bams} {params} --outRawCounts {output} -p {threads} > {log} 2>&1

plotFingerprint environment.

plotHeatmap_peak

Plot the peak heatmap using deepTools.

plotHeatmap --matrixFile {input} --outFileName {output.img} {params.params} --startLabel="-{params.slop}" --endLabel={params.slop} > {log} 2>&1

plotHeatmap_peak environment.

plotPCA

Plot a PCA between bams using deepTools.

plotPCA --corData {input} --outFileNameData {output} > {log} 2>&1

plotPCA environment.

plotProfile_gene

Plot the gene profile using deeptools.

plotProfile -m {input} --outFileName {output.img} --outFileNameData {output.file} > {log} 2>&1

plotProfile_gene environment.

prepare_DEXseq_annotation

generate a DEXseq annotation.gff from the annotation.gtf

current_conda_env=$(conda env list | grep \* | cut -d "*" -f2-)
DEXseq_path=${{current_conda_env}}/lib/R/library/DEXSeq/python_scripts

python ${{DEXseq_path}}/dexseq_prepare_annotation.py {input} {output} > {log} 2>&1

prepare_DEXseq_annotation environment.

pytxi_count_matrix

Convert transcript abundance estimations to gene count estimations and merge gene counts per assembly.

If the GTF is not used, a taxonomy ID is required for myGene.info. Seq2science will try to read this ID from the genomepy README.txt or assembly database.

quantile_normalization

Quantile normalization is a type of normalization that makes the distribution between samples identical. This means that the actual count distribution within a sample changes. After quantile normalization, samples are CPM normalized (which has no effect between and within samples), for a fair comparison between normalisation techniques.

See wikipedia: https://en.wikipedia.org/wiki/Quantile_normalization

random_subset_peaks

Take a random subset of all summits, and make them all same width.

shuf -n {wildcards.nrpeaks} {input.peaks} | bedtools slop -i stdin -g {input.sizes} -b {params.slop} > {output}

random_subset_peaks environment.

run2sra

Download the SRA of a sample by its unique identifier.

# move to output dir since somehow prefetch sometimes puts files in the cwd...
# and remove the top level folder since prefetch will assume we are done otherwise
mkdir -p {params.outdir}; cd {params.outdir}; rm -r {wildcards.run}

# TODO: for loop can be removed if we dont see the debug message
# three attempts
for i in {{1..3}}
do
    # acquire a lock
    (
        flock -w 30 200 || continue
        sleep 2
    ) 200>{PYSRADB_CACHE_LOCK}

    # dump
    prefetch --max-size u --output-directory ./ --log-level debug --progress {wildcards.run} \
    >> {log} 2>&1 && break

    # retry
    echo "DEBUG: prefetch try ${{i}} of 3 failed" >> {log}
    sleep 10
done

# TODO: section can be removed if we dont see the debug message
# bug report: https://github.com/ncbi/sra-tools/issues/533
if [[ -f "{params.outdir}/{wildcards.run}.sra" ]]; then
    echo "DEBUG: moving output to correct directory" >> {log}
    mkdir -p {params.outdir}/{wildcards.run}
    mv {params.outdir}/{wildcards.run}.sra {output}
fi

# TODO: section can be removed if we dont see the debug message
# If an sralite file was downloaded instead of a sra file, just rename it
if [[ -f "{params.outdir}/{wildcards.run}.sralite" ]]; then
    echo "DEBUG: renaming SRAlite" >> {log}
    mkdir -p {params.outdir}/{wildcards.run}
    mv {params.outdir}/{wildcards.run}.sralite {output}
fi

run2sra environment.

runs2sample

Concatenate a single run or multiple runs together into a fastq

salmon_index

Generate a transcriptome index for Salmon.

mkdir -p {output}

salmon index -t {input[0]} {params.decoys} -i {output} {params.params} \
--threads {threads} > {log} 2>&1

salmon_index environment.

salmon_quant

Align reads against a transcriptome (index) with Salmon (mapping-based mode) and output a quantification file per sample.

salmon quant -i {input.index} -l A {params.input} {params.params} -o $(dirname {output}) \
--threads {threads} > {log} 2>&1

salmon_quant environment.

sam2bam

Convert a file in sam format into a bam format (binary)

samtools view -b {input} > {output}

sam2bam environment.

sambamba_sort

Sort the result of alignment or sieving with the sambamba sorter.

sambamba view --nthreads {threads} -f bam  {input} -o /dev/stdout  2> {log} |
sambamba sort --nthreads {threads} {params} /dev/stdin -o {output}  2> {log}

sambamba_sort environment.

samtools_index

Create an index of a bam/cram file which can be used for e.g. visualization.

samtools index {params} {input} {output}

samtools_index environment.

samtools_presort

(Pre)sort the result of alignment with the samtools sorter.

# we set this trap to remove temp files when prematurely ending the rule
trap "rm -f {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp*bam" INT;
rm -f {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp*bam 2> {log}

# RAM per thread in MB
memory=$((1024*{resources.mem_gb}/{threads}))M

samtools sort -@ {threads} -m $memory {input} -o {output} \
-T {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp 2> {log}

samtools_presort environment.

samtools_sort

Sort the result of alignment or sieving with the samtools sorter.

# we set this trap to remove temp files when prematurely ending the rule
trap "rm -f {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp*bam" INT;
rm -f {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp*bam 2> {log}

# RAM per thread in MB
memory=$((1024*{resources.mem_gb}/{threads}))M

samtools sort {params.sort_order} -@ {threads} -m $memory {input} -o {output} \
-T {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp 2> {log}

samtools_sort environment.

samtools_sort_allsizes

Sort the allsizes output of sieve_bam. This rule is identical to the samtools_sort rule except that the input must contain _allsizes, and the output is temporary.

# we set this trap to remove temp files when prematurely ending the rule
trap "rm -f {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp*bam" INT;
rm -f {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp*bam 2> {log}

# RAM per thread in MB
memory=$((1024*{resources.mem_gb}/{threads}))M

samtools sort {params.sort_order} -@ {threads} -m $memory {input} -o {output} \
-T {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp 2> {log}

samtools_sort_allsizes environment.

samtools_stats

Get general stats from bam files like percentage mapped.

samtools stats {input} 1> {output} 2> {log}

samtools_stats environment.

sctk_qc

Perform scRNA QC with singleCellTK, store output in SingleCellExperiment object and export to RData format.

setup_blacklist

Combine the encode blacklist with mitochondrial DNA depending on config.

sieve_bam

“Sieve” a bam.

This rule does (at least) one of these:

filtering on minimum mapping quality
tn5 shift adjustment
remove multimappers
remove reads inside the blacklist
remove duplicates
filter paired-end reads on transcript length
subsample to have a maximum amount of reads (equal across samples)

samtools view -h {params.prim_align} {params.minqual} -L {input.blacklist} \
{input.bam} | {params.atacshift} {params.sizesieve} {params.subsample}
samtools view -b > {output.bam} 2> {log}

# single-end reads never output allsizes so just touch the file when filtering on size
{params.sizesieve_touch}

sieve_bam environment.

softmask_track_1

Generate a track of all softmasked regions

source: https://github.com/Gaius-Augustus/MakeHub/blob/master/make_hub.py

softmask_track_2

Generate a track of all softmasked regions

source: https://github.com/Gaius-Augustus/MakeHub/blob/master/make_hub.py

bedSort {input.mask_unsorted} {output.maskbed} >> {log} 2>&1

bedToBigBed -type=bed3 {output.maskbed} {input.sizes} {output.mask} >> {log} 2>&1

softmask_track_2 environment.

sra2fastq_PE

Downloaded (raw) SRAs are converted to paired-end fastq files. Forward and reverse samples will be switched if forward/reverse names are not lexicographically ordered.

# move to output dir since somehow parallel-fastq-dump sometimes puts files in the cwd...
mkdir -p {output.tmpdir}; cd {output.tmpdir}

# acquire the lock
(
    flock -w 30 200 || exit 1
    sleep 3
) 200>{PYSRADB_CACHE_LOCK}

# dump to tmp dir
parallel-fastq-dump -s {input} -O {output.tmpdir} \
--threads {threads} --split-3 --skip-technical --dumpbase \
--readids --clip --read-filter pass --defline-seq '@$ac.$si.$sg/$ri' \
--defline-qual '+' --gzip >> {log} 2>&1

# check if the files exist
if ! compgen -G "{output.tmpdir}/*_1*" > /dev/null ; then printf "ERROR: Couldn't find read 1.fastq after dumping! Perhaps this is not a paired-end file?\n" >> {log} 2>&1; fi
if ! compgen -G "{output.tmpdir}/*_2*" > /dev/null ; then printf "ERROR: Couldn't find read 2.fastq after dumping! Perhaps this is not a paired-end file?\n" >> {log} 2>&1; fi

# rename file and move to output dir
mv {output.tmpdir}/*_1* {output.fastq[0]} >> {log} 2>&1
mv {output.tmpdir}/*_2* {output.fastq[1]} >> {log} 2>&1

sra2fastq_PE environment.

sra2fastq_SE

Downloaded (raw) SRAs are converted to single-end fastq files.

# move to output dir since somehow parallel-fastq-dump sometimes puts files in the cwd...
mkdir -p {output.tmpdir}; cd {output.tmpdir}

# acquire the lock
(
    flock -w 30 200 || exit 1
    sleep 3
) 200>{PYSRADB_CACHE_LOCK}

# dump to tmp dir
parallel-fastq-dump -s {input} -O {output.tmpdir} \
--threads {threads} --split-spot --skip-technical --dumpbase --readids \
--clip --read-filter pass --defline-seq '@$ac.$si.$sg/$ri' \
--defline-qual '+' --gzip >> {log} 2>&1

# rename file and move to output dir
mv {output.tmpdir}/*.fastq.gz {output.fastq}

sra2fastq_SE environment.

star_align

Align reads against a genome (index) with STAR, and pipe the output to the required sorter(s).

trap "find {log} -type f ! -name Log* -exec rm {{}} \;" EXIT
mkdir -p {log}
mkdir -p {output.dir}

STAR --genomeDir {input.index} --readFilesIn {params.input} --readFilesCommand gunzip -c \
--outSAMtype BAM Unsorted --outStd BAM_Unsorted \
--outFileNamePrefix {log}/ --outTmpDir {output.dir}/STARtmp \
--runThreadN {threads} {params.params} > {output.pipe} 2> {log}/Log.stderr.out

# move all non-log files to output directory (this way the log files are kept on error)
find {log} -type f ! -name Log* -exec mv {{}} {output.dir} \;

star_align environment.

star_index

Make a genome index for STAR.

Troubleshooting:

sufficient RAM & disk space?
increase the RAM available (–limitGenomeGenerateRAM)
reduce the number of threads (seq2science -j 5)
reduce accuracy (–genomeSAsparseD 2)

In your config.yaml, add:

aligner:

star:: index: –limitGenomeGenerateRAM 60000000000 –genomeSAsparseD 1

function log2 {{
        local x=0
        for (( y=$1-1 ; $y > 0; y >>= 1 )) ; do
            let x=$x+1
        done
        echo $x
}}

# set genome dependent variables
NBits=""
NBases=""
GenomeLength=$(awk -F"\t" '{{x+=$2}}END{{printf "%i", x}}' {input.sizes})
NumberOfReferences=$(awk 'END{{print NR}}' {input.sizes})
if [ $NumberOfReferences -gt 5000 ]; then
    # for large genomes, --genomeChrBinNbits should be scaled to min(18,log2[max(GenomeLength/NumberOfReferences,ReadLength)])
    # ReadLength is skipped here, as it is unknown
    LpR=$(log2 $((GenomeLength / NumberOfReferences)))
    NBits="--genomeChrBinNbits $(($LpR<18 ? $LpR : 18))"
    printf "NBits: $NBits\n\n" >> {log} 2>&1
fi

if [ $GenomeLength -lt 268435456 ]; then
    # for small genomes, --genomeSAindexNbases must be scaled down to min(14, log2(GenomeLength)/2-1)
    logG=$(( $(log2 $GenomeLength) / 2 - 1 ))
    NBases="--genomeSAindexNbases $(( $logG<14 ? $logG : 14 ))"
    printf "NBases: $NBases\n\n" >> {log} 2>&1
fi

mkdir -p {output}

STAR --runMode genomeGenerate --genomeFastaFiles {input.genome} --sjdbGTFfile {input.gtf} \
--genomeDir {output} --outFileNamePrefix {output}/ \
--runThreadN {threads} $NBits $NBases {params} >> {log} 2>&1

star_index environment.

tpm_matrix

Create a TPM normalized table from a counts table.

trackhub

Generate a UCSC track hub/assembly hub.

To view the hub, the output directory must be hosted on an web accessible location, and the location of the hub.txt file given the UCSC genome browser at My Data > Track Hubs > My Hubs

trackhub_index

Generate a searchable annotation & index for each assembly

sources: https://genome.ucsc.edu/goldenPath/help/hubQuickStartSearch.html, https://www.biostars.org/p/272649/

# filter out tiny tiny transcripts that make gtftogenepred fail
min_transcript_length=15
(awk -v minlen=$min_transcript_length 'BEGIN {{ FS = "\\t"; OFS="\\t" }}; {{ if(($3=="transcript") && ($5 - $4 <= minlen)) {{ print $9 }} }}' {input.gtf} | grep . || echo "transcript_id \\\"DONTMATCHANYTHING\\\"") | grep -oP '(?<=transcript_id ").+?(?=")' | grep -v -f - {input.gtf} > {output.gtf_min_size} 2>> {log}

# generate annotation files
gtfToGenePred -allErrors -ignoreGroupsWithoutExons -geneNameAsName2 -genePredExt {output.gtf_min_size} {output.genePred} -infoOut={output.info} >> {log} 2>&1

# if gtf has gene_names, use gene_names, else transcript_id
if grep -Fq "gene_name" {output.gtf_min_size}; then n=1; else n=0; fi

# switch columns 1 (transcript_id) and 12 (gene_name)
awk -v n=$n 'BEGIN {{ FS = "\\t"; OFS="\\t" }}; {{ if(n==1) {{ t = $1; $1 = $12; $12 = t; print; }} else {{ print; }} }}' {output.genePred} > {output.genePrednamed}

# remove lines with an empty first column (e.g. no gene names for ERCC RNA spike-in)
grep -vP "^\t" {output.genePrednamed} > {output.genePred}

# remove spaces from the genePred (error in annotation)
sed -i 's/ //g' {output.genePred}

genePredToBed {output.genePred} {output.genePredbed} >> {log} 2>&1

bedSort {output.genePredbed} {output.genePredbed} >> {log} 2>&1

bedToBigBed -extraIndex=name {output.genePredbed} {input.sizes} {output.genePredbigbed} >> {log} 2>&1

# generate searchable indexes (by 1: transcriptID, 2: geneId, 8: proteinID, 9: geneName, 10: transcriptName)
grep -v "^#" {output.info} | awk -v n=$n 'BEGIN {{ FS = "\\t"; OFS="\\t" }} ; {{ if(n==1) {{print $9, $1, $2, $8, $9, $10}} else {{print $1, $1, $2, $8, $9, $10}} }}' > {output.indexinfo}

ixIxx {output.indexinfo} {output.ix} {output.ixx} >> {log} 2>&1

trackhub_index environment.

trimgalore_PE

Automated adapter detection, adapter trimming, and quality trimming through trim galore (paired-end).

trim_galore --paired -j {threads} {params.config} -o $(dirname {output.r1}) {input.r1} {input.r2} > {log} 2>&1

# now rename to proper output
for f in $(find "$(dirname {output.r1})/" -name "{wildcards.sample}_*val_*.fq.gz"); do
    mv "$f" "$(echo "$f" | sed s/.fq/.{params.fqsuffix}/)"; done
for f in $(find "$(dirname {output.r1})/" -maxdepth 1 -name "{wildcards.sample}_*val_*.{params.fqsuffix}.gz"); do
    mv "$f" "$(echo "$f" | sed s/_val_./_trimmed/)"; done

# move the trimming reports to qc directory
for f in $(find "$(dirname {output.r1})/" -name "{wildcards.sample}_*.{params.fqsuffix}.gz_trimming_report.txt"); do
    mv "$f" "$(dirname {output.qc[0]})/$(basename $f)"; done

trimgalore_PE environment.

trimgalore_SE

Automated adapter detection, adapter trimming, and quality trimming through trim galore (single-end).

trim_galore -j {threads} {params.config} -o $(dirname {output.se}) {input} > {log} 2>&1

# now rename to proper output
if [ "{params.fqsuffix}" != "fq" ]; then
  mv "$(dirname {output.se})/{wildcards.sample}_trimmed.fq.gz" {output.se}
fi

# move the trimming report to qc directory
report=$(dirname {output.se})/{wildcards.sample}.{params.fqsuffix}.gz_trimming_report.txt
mv $report {output.qc}

trimgalore_SE environment.

twobit

Generate a 2bit file for each assembly

faToTwoBit {input} {output} >> {log} 2>&1

twobit environment.

txi_count_matrix