Per rule explanation
This is an automatically generated list of all supported rules, their docstrings, command, and virtual environment. At the start of each workflow run a list is printed of which rules will be run. And while the workflow is running it prints which rules are being started and finished. This page is here to give an explanation to the user about what each rule does, and for developers to find what is, and isn’t yet supported.
bam2cram
Convert a bam file to the more compressed cram format.
samtools view -T {input.genome} -C {input.bam} -@ {threads} > {output} 2> {log}
bam_bigwig
Convert a bam file into a bigwig file (or two). Can output strand specific bams. The reverse bam is created “secretly” to avoid checkpoints.
if [[ "{params.strandedness}" == "no" ]]; then
echo "generating unstranded bam" > {log}
bamCoverage --bam {input.bam} --outFileName {output} --numberOfProcessors {threads} {params.flags} --verbose >> {log} 2>&1
else
echo "generating stranded bams" > {log}
out={output}
rev=$(echo "${{out/{wildcards.sorting}.bw/{wildcards.sorting}.rev.bw}}")
rm -f $rev # in case of a rerun
bamCoverage --bam {input.bam} --outFileName $rev --filterRNAstrand reverse --numberOfProcessors {threads} {params.flags} --verbose >> {log} 2>&1
bamCoverage --bam {input.bam} --outFileName {output} --filterRNAstrand forward --numberOfProcessors {threads} {params.flags} --verbose >> {log} 2>&1
fi
bedgraph_bigwig
Convert a bedgraph file into a bigwig.
awk -v OFS='\\t' '{{print $1, $2, $3, $4}}' {input.bedgraph} | sed '/experimental/d' |
bedSort /dev/stdin {output.tmp} > {log} 2>&1
bedGraphToBigWig {output.tmp} {input.sizes} {output.out} >> {log} 2>&1
bedgraphish_to_bedgraph
Convert the bedgraph-ish file generated by genrich into a true bedgraph file.
splits=$(grep -Pno "([^\/]*)(?=\.bam)" {input})
splits=($splits)
lncnt=$(wc -l {input} | echo "$(grep -Po ".*(?=\ )")":)
splits+=($lncnt)
counter=1
for split in "${{splits[@]::${{#splits[@]}}-1}}";
do
filename=$(grep -Po "(?<=:).*" <<< $split);
if [[ $filename =~ {wildcards.sample} ]]; then
startnr=$(grep -Po ".*(?=\:)" <<< $split);
endnr=$(grep -Po ".*(?=\:)" <<< ${{splits[counter]}});
lines="NR>=startnr && NR<=endnr {{ print \$1, \$2, \$3, \$4 }}"
lines=${{lines/startnr/$((startnr + 2))}}
lines=${{lines/endnr/$((endnr - 1))}}
awk "$lines" {input} > {output}
fi
((counter++))
done
bedtools_slop
After combine_peaks we end up with just a bed file of summits. We extend all peaks to the same width, for a fair comparison between peaks.
bedtools slop -i {input.bedfile} -g {input.sizes} -b {params.slop} | uniq > {output} 2> {log}
blind_clustering
Create a sample distance cluster heatmap, and various correlation cluster heatmaps.
A (blind) variance stabilizing transformation is applied to the counts, to reduce the impact of genes with low expression levels.
bowtie2_align
Align reads against a genome (index) with bowtie2, and pipe the output to the required sorter(s).
bowtie2 {params.params} --threads {threads} -x {input.index}/{wildcards.assembly} {params.input} 2> {log} | tee {output} 1> /dev/null 2>> {log}
bowtie2_index
Make a genome index for bowtie2. This index is required for alignment.
mkdir -p {output}
bowtie2-build {params} --threads {threads} {input} {output}/{wildcards.assembly} > {log} 2>&1
bwa_index
Make a genome index for bwa (mem). This index is required for alignment.
mkdir -p {output}
bwa index -p {params.prefix} {params.params} {input} > {log} 2>&1
bwa_mem
Align reads against a genome (index) with bwa-mem, and pipe the output to the required sorter(s).
bwa mem {params.params} -t {threads} {params.index_dir} {input.reads} 2> {log} | tee {output} 1> /dev/null 2>> {log}
bwa_mem2
Align reads against a genome (index) with bwa-mem2, and pipe the output to the required sorter(s).
bwa-mem2 mem {params.params} -t {threads} {params.index_dir} {input.reads} 2> {log} | tee {output} 1> /dev/null 2>> {log}
bwa_mem2_index
Make a genome index for bwa-mem2. This index is required for alignment.
mkdir -p {output}
bwa-mem2 index -p {params.prefix} {input} > {log} 2>&1
call_peak_genrich
Call peaks with genrich based on the pileup.
Genrich -P -f {input.log} -o {output.narrowpeak} {params} -v > {log} 2>&1
chipseeker
Make chipseeker plots, with the percentage of peaks in e.g. promoters.
citeseqcount
ADT mapping and quantification with cite-seq-count and output a umi/read matrix per sample.
CITE-seq-Count -R1 {input.reads[0]} -R2 {input.reads[1]} \
-t {input.tags} {params.barcode_arg} -T {threads} \
{params.options} -o {params.outdir} > {log} 2>&1
combine_biological_reps
Combine biological replicates by taking their mean!
combine_peaks
Uses gimmemotifs’ combine_peaks to “combine” peaks. This finds all peaks close together and takes the most significant one as the true peak.
combine_peaks --genome {input.genome} --window {params.windowsize} \
{input.summitfiles} > {output} 2> {log}
complement_blacklist
Take the complement of the blacklist. We need this complement to tell samtools to only keep reads that are in the complement of the blacklist.
sortBed -faidx {input.sizes} -i {input.blacklist} 2>> {log} |
complementBed -i stdin -g {input.sizes} > {output} 2>> {log}
computeMatrix_gene
Pre-compute correlations between bams using deeptools.
computeMatrix scale-regions -S {input.bw} {params.labels} -R {params.gtf} \
-p {threads} {params.params} -o {output} > {log} 2>&1
computeMatrix_peak
Pre-compute correlations between bams using deeptools.
computeMatrix scale-regions -S {input.bw} {params.labels} -R {input.peaks} \
-p {threads} --binSize 5 -o {output} > {log} 2>&1
count_matrix
Combine count tables into one count matrix per assembly
coverage_table
Use gimmemotif’s coverage_table to generate a count table with the nr of reads under each peak per sample.
echo "# The number of reads under each peak" > {output}
coverage_table {input.peaks} {input.replicates} --window {params.peak_width} --nthreads {threads} \
2> {log} | grep -vE "^#" 2>> {log} |
awk 'BEGIN {{ FS = "@" }} NR==1{{gsub("{wildcards.assembly}-|.samtools-coordinate","",$0)}}; \
{{print $0}}' >> {output}
# overwrite sample names with descriptive/replicate names
sed -i "2s/.*/{params.names}/" {output}
create_SNAP_object
Create a snapobject for each BAM file.
These snapobjects can be merged later using snaptools in R.
snaptools snap-pre --input-file={input.bam} --output-snap={output} --genome-name={wildcards.assembly} \
--genome-size={input.sizes} {params.params} {params.chrm} {params.mapq} > {log} 2>&1
create_bins_SNAP_object
Add a Binned genome matrix to the SNAPobject, after which it is renamed and moved to the Snapfiles folder for downstream analysis in R using Snaptools
snaptools snap-add-bmat --snap-file={input} {params} > {log} 2>&1
echo "bmat added, moving file" >> {log}
mv {input} {output} >> {log} 2>&1
cytoband
Generate a cytoband track for each assembly
source: http://genomewiki.ucsc.edu/index.php/Assembly_Hubs#Cytoband_Track
cat {input.sizes} |
bedSort /dev/stdin /dev/stdout |
awk '{{print $1,0,$2,$1,"gneg"}}' > {output.cytoband_bd}
bedToBigBed -type=bed4 {output.cytoband_bd} -as={params.schema} \
{input.sizes} {output.cytoband_bb} >> {log} 2>&1
deseq2
Differential gene expression analysis with DESeq2.
dexseq_count
count exon usage
current_conda_env=$(conda env list | grep \* | cut -d "*" -f2-)
DEXseq_path=${{current_conda_env}}/lib/R/library/DEXSeq/python_scripts
python ${{DEXseq_path}}/dexseq_count.py -f bam -r pos {params.endedness} -s {params.strandedness} {input.gff} {input.bam} {output} > {log} 2>&1
dexseq_count_matrix
Combine DEXSeq counts into one count matrix per assembly
for use in function DEXSeqDataSet()
dupRadar
visualize fraction of artifactual reads to normal read duplication (due to natural over-sequencing of highly expressed genes).
dupRadar_combine
Combine the individual images (so we can group them nicely in the MultiQC).
convert {params.good_example} {params.bad_example} {input} -append {output} 2> {log}
edgeR_normalization
edgeR supports three different types of normalization: TMM, RLE, and upperquartile.
TMM: is the weighted trimmed mean of M-values proposed by Robinson and Oshlack (2010). RLE: is the scaling factor method proposed by Anders and Huber (2010). DEseq2 normalisation is based on this. upperquartile: is the upper-quartile normalization method of Bullard et al (2010).
ena2fastq_PE
Download paired-end fastq files directly from the ENA.
ena2fastq_SE
Download single-end fastq files directly from the ENA.
export_sce_obj
Read scRNA UMI counts into a SingleCellExperiment object, add colData and export to RData format.
extend_genome
Append given file(s) to genome
# extend the genome.fa
cp {input.genome} {output.genome}
for FILE in {input.extension}; do
cat $FILE >> {output.genome}
done
extend_genome_annotation
Append given file(s) to genome annotation
# extend the genome.annotation.gtf
cp {input.gtf} {output.gtf}
for FILE in {input.extension}; do
cat $FILE >> {output.gtf}
done
# generate an extended genome.annotation.bed
gtfToGenePred {output.gtf} {output.gp}
genePredToBed {output.gp} {output.bed}
extend_genome_blacklist
Copy blacklist to the custom genome directory
cp {input} {output}
fastp_PE
Automated adapter detection, adapter trimming, and quality trimming through fastp (paired-end).
threads=$(( {threads} - 2 > 1 ? {threads} - 2 : 1 ))
fastp -w $threads --in1 {input[0]} --in2 {input[1]} \
--out1 {output.r1} --out2 {output.r2} -h {output.qc_html} -j {output.qc_json} \
{params.config} > {log} 2>&1
fastp_SE
Automated adapter detection, adapter trimming, and quality trimming through fastp (single-end).
threads=$(( {threads} - 2 > 1 ? {threads} - 2 : 1 ))
fastp -w $threads --in1 {input} --out1 {output.se} -h {output.qc_html} -j {output.qc_json} \
{params.config} > {log} 2>&1
fastp_qc_PE
Get quality scores for (technical) replicates
fastp -w {threads} --in1 {input[0]} --in2 {input[1]} \
-h {output.qc_html} -j {output.qc_json} \
{params.config} > {log} 2>&1
fastp_qc_SE
Get quality scores for (technical) replicates
fastp -w {threads} --in1 {input} -h {output.qc_html} -j {output.qc_json} > {log} 2>&1
fastq_pair
fastq_pair re-writes paired-end fastq files to ensure that each read has a mate and discards singleton reads. This step is required after scRNA trimming since we only trim the fastq containing reads and not the barcode fastq.
gunzip -c {input.r1} > {output.intermediates1[0]} 2> {log}
gunzip -c {input.r2} > {output.intermediates1[1]} 2>> {log}
if [ {params.tused} == true ]
then
opts="{params.options}"
else
echo "\nsetting parameter t with the number of reads in the fastq\n" >> {log}
opts="-p -t "$(wc -l {input.r1} | grep -Po '^\d+' | awk '{{print int($1/4)}}')
fi
fastq_pair $opts {output.intermediates1} >> {log} 2>&1
gzip -c {output.intermediates3[0]} > {output.reads[0]} 2> {log}
gzip -c {output.intermediates3[1]} > {output.reads[1]} 2>> {log}
fastqc
Generate quality control report for fastq files.
fastqc {input} -O {params} > {log} 2>&1
featureCounts
Use featureCounts to generate the fraction reads in peaks score (frips/assigned reads). See: https://www.biostars.org/p/337872/ and https://www.biostars.org/p/228636/
# Make a custom "SAF" file which featureCounts needs:
awk 'BEGIN{{FS=OFS="\t"; print "GeneID\tChr\tStart\tEnd\tStrand"}}{{print $4, $1, $2+1, $3, "."}}' \
{input.peak} 1> {output.tmp_saf} 2> {log}
# run featureCounts
featureCounts -T {threads} -p -a {output.tmp_saf} -F SAF -o {output.real_out} {input.bam} > {log} 2>&1
# move the summary to qc directory
mv $(dirname {output.real_out})/$(basename {output.summary}) {output.summary}
featurecounts
summarize reads to gene level. Outputs a counts table per bam file.
featureCounts -a {input.gtf} {input.bam} {params.endedness} -s {params.strandedness} {params.user_flags} -T {threads} -o {output} > {log} 2>&1
full_decoy_transcripts
Use the entire genome as decoy sequence.
cut -f 1 {input.sizes} > {output.decoys}
cat {input.transcripts} {input.genome} > {output.gentrome}
gcPercent
Generate a gc content track
source: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/gc5Base/
hgGcPercent -wigOut -doGaps -file=stdout -win=5 -verbose=0 {wildcards.assembly} {input.twobit} \
| gzip > {output.gcpcvar}
wigToBigWig {output.gcpcvar} {input.sizes} {output.gcpc} >> {log} 2>&1
gene_id2name
Parse the gtf file to generate a gene_id to gene_name conversion table.
genrich_pileup
Generate the pileup. We do this separately from peak-calling since these two processes have a very different computational footprint.
input=$(echo {input.reps} | tr ' ' ',')
Genrich -X -t $input -f {output.log} {params.control} -k {output.bedgraphish} \
{params.params} -v > {log} 2>&1
get_effective_genome_size
Get the effective genome size for a kmer length. Some tools (e.g. macs2) require an estimation of the effective genome size to better estimate how (un)likely it is to have a certain number of reads on a position. The actual genome size is not the best indication in these cases, since reads in repetitive regions (for a certain kmer length) can not possible align on some locations.
unique-kmers.py {input} -k {wildcards.kmer_size} --quiet 2>&1 | grep -P -o '(?<=\.fa: ).*' > {output} 2> {log}
get_genome
Download a genome through genomepy.
get_genome_annotation
Download a gene annotation through genomepy.
get_genome_blacklist
Download a genome blacklist for a genome, if it exists, through genomepy.
get_genome_support_files
Generate supporting files for a genome.
get_transcripts
Generate transcripts.fasta using gffread.
Requires genome.fa and annotation.gtf (with matching chromosome/scaffold names)
gffread -w {output} -g {input.fa} {input.gtf} > {log} 2>&1
gimme_maelstrom
Gimme maelstrom is a method to infer differential motifs between two or more biological replicates.
NEW_CACHE={output.cache}
mkdir -p $NEW_CACHE
if [ -z ${{XDG_CACHE_HOME+x}} ]; then
export XDG_CACHE_HOME=$HOME/.cache
fi
if [ -d $XDG_CACHE_HOME/gimmemotifs ]; then
cp -r $XDG_CACHE_HOME/gimmemotifs $NEW_CACHE/
fi
export XDG_CACHE_HOME=$NEW_CACHE
gsa_or_encode2fastq_PE
Download paired-end fastq files from the GSA or ENCODE DCC.
gsa_or_encode2fastq_SE
Download single-end fastq files from the GSA or ENCODE DCC.
hisat2_align
Align reads against a genome (index) with hisat2, and pipe the output to the required sorter(s).
hisat2 {params.params} --threads {threads} -x {input.index}/part {params.input} 2> {log} | tee {output} 1> /dev/null 2>> {log}
hisat2_index
Make a genome index for hisat2. This index is required for alignment.
mkdir -p {output}
hisat2-build {params} -p {threads} {input} {output}/part > {log} 2>&1
hisat2_splice_aware_index
Make an exon-junction and splice aware index for hisat2. This index is required for alignment and quantification of RNA-seq data.
mkdir -p {output}
hp=$(which hisat2)
python3 ${{hp}}_extract_splice_sites.py {input.gtf} > {output}/splice_sites.tsv 2>> {log}
python3 ${{hp}}_extract_exons.py {input.gtf} > {output}/exons.tsv 2>> {log}
hisat2-build {params} -p {threads} --ss {output}/splice_sites.tsv --exon {output}/exons.tsv \
{input.fasta} {output}/part >> {log} 2>&1
hmmratac
Call gappedpeaks with HMMRATAC.
HMMRATAC --bedgraph true -o {params.basename} {params.hmmratac_params} -Xmx22G -b {input.bam} -i {input.bam_index} -g {input.genome_size} > {log} 2>&1
hmmratac_genome_info
Generate the ‘genome info’ that hmmratac requires for peak calling. https://github.com/LiuLabUB/HMMRATAC/issues/17
TODO: isnt this just .fa.sizes?
samtools view -H {input} 2> {log} | grep SQ 2>> {log} | cut -f 2-3 2>> {log} | cut -d ':' -f 2 2>> {log} | cut -f 1 > {output.tmp1} 2> {log}
samtools view -H {input} 2>> {log} | grep SQ 2>> {log} | cut -f 2-3 2>> {log} | cut -d ':' -f 2,3 2>> {log} | cut -d ':' -f 2 > {output.tmp2} 2> {log}
paste {output.tmp1} {output.tmp2} > {output.out} 2> {log}
htseq_count
summarize reads to gene level. Outputs a counts table per bam file.
htseq-count {input.bam} {input.gtf} -r pos -s {params.strandedness} {params.user_flags} -n {threads} -c {output} > {log} 2>&1
idr
Combine replicates based on the irreproducible discovery rate (IDR). Can only handle two replicates. For more than two replicates use fisher’s method.
When combining narrowpeak files with IDR, the q-score and summit are set to -1 (means not set). However some downstream tools require these to be set. So we set the q-score to zero, and place the summit of the peak in the middle of the peak.
if [ "{params.nr_reps}" == "1" ]; then
cp {input} {output.true}
touch {output.temp}
else
idr --samples {input} {params.rank} --output-file {output.temp} {params.options} > {log} 2>&1
if [ "{wildcards.ftype}" == "narrowPeak" ]; then
awk 'IFS="\t",OFS="\t" {{$9=0; $10=int(($3 - $2) / 2); print}}' {output.temp} > {output.true} 2>> {log}
else
cp {output.temp} {output.true}
fi
fi
infer_strandedness
Use RSeqQC’s infer_experiment.py to determine the strandedness of a sample
infer_experiment.py -i {input.bam} -r {input.bed} -q {params} 2> {log} | awk NF > {output}
insert_size_metrics
Get insert size metrics from a (paired-end) bam. This score is then used by MultiQC in the report.
picard CollectInsertSizeMetrics INPUT={input} \
OUTPUT={output.tsv} H={output.pdf} > {log} 2>&1
kallistobus_count
Align reads against a transcriptome (index) with kallistobus and output a quantification file per sample.
kb count \
-i {params.basename}.idx \
-t {threads} -g {params.basename}_t2g.txt \
-o {params.outdir} {params.c1} {params.c2} \
{params.barcode_arg} {params.options} {input.reads} > {log} 2>&1
# Validate output
if grep -q 'ERROR\|bad_alloc' "{log}"; then
exit 1
fi
kallistobus_ref
Make a genome index for kallistobus. This index is required for counting.
mkdir -p {output}
kb ref \
{input.fa} {input.gtf} \
-i {params.basename}.idx -g {params.basename}_t2g.txt -f1 {params.basename}_cdna.fa \
{params.f2} {params.c1} {params.c2} \
{params.options} > {log} 2>&1
kallistobus_ref_kite
Make a mismatch index for kallistobus. This index is required to count feature barcodes, such as antibody tags.
mkdir -p {output}
kb ref \
{input.featurebarcodes} \
{params.options} \
-i {params.basename}.idx -g {params.basename}_t2g.txt -f1 {params.basename}_cdna.fa > {log} 2>&1
keep_mates
In-house script that, after alignment, removes the information that reads are paired. This can be beneficial when peak calling with macs2 when shifting + extending, since macs2 in this case only keeps the first in pair.
linked_txome
Generate a linked transcriptome for tximeta
Also creates a symlink to the gtf in an Ensembl format (required by tximeta)
Required to converting salmon output (estimated transcript abundances) to gene counts
log_normalization
Log1p normalization of a count table.
macs2_callpeak
Call peaks using macs2. Macs2 requires a genome size, which is estimated from the amount of unique kmers of the average read length.
GENSIZE=$(cat {input.genome_size})
# call peaks
macs2 callpeak --bdg -t {input.bam} {params.control} --outdir {config[result_dir]}/macs2/ -n {wildcards.assembly}-{wildcards.sample} \
{params.macs_params} -g $GENSIZE -f {params.format} >> {log} 2>&1
macs_bdgcmp
Prepare p-value files for rule macs_cmbreps
macs2 bdgcmp -t {input.treatment} -c {input.control} -m qpois -o {output} > {log} 2>&1
macs_cmbreps
Combine biological replicates through Fisher’s method
(Link original peakfile in replicate_processed if there is only 1 sample for a condition)
if [ "{params.nr_reps}" == "1" ]; then
touch {output.tmpbdg} {output.tmppeaks}
mkdir -p $(dirname {output.peaks}); cp {input.treatment} {output.peaks}
else
macs2 cmbreps -i {input.bdgcmp} -o {output.tmpbdg} -m fisher > {log} 2>&1
macs2 {params.function} {params.config} -i {output.tmpbdg} -o {output.tmppeaks} >> {log} 2>&1
cat {output.tmppeaks} | tail -n +2 > {output.peaks}
fi
maelstrom_report_preparation
This rule injects the symlinked motif logos into the html so they get rendered in the multiqc report.
mark_duplicates
Mark all duplicate reads in a bam file with picard MarkDuplicates
picard MarkDuplicates TMP_DIR={resources.tmpdir} {params} INPUT={input} \
OUTPUT={output.bam} METRICS_FILE={output.metrics} > {log} 2>&1
mean_center
Mean centering of a count table.
merge_replicates
Merge replicates (fastqs) simply by concatenating the files. We also change the name of the read headers to contain the name of the original replicate.
Must happen after trimming due to trim-galore’s automatic adapter trimming method
If a replicate has only 1 sample in it, simply move the file.
merge_volcano_ma
Combine the volcano and maplot resulting of the deseq2 rule into a single figure.
convert {input.maplot} {input.volcanoplot} +append {output} 2> {log}
minimap2_align
Align reads against a genome (index) with minimap2, and pipe the output to the required sorter(s).
minimap2 -t {threads} -a {input.index} {input.reads} {params} > {output} 2> {log}
minimap2_index
Make a genome index for minimap2. This index is required for alignment.
mkdir -p $(dirname {output})
minimap2 -t {threads} -d {output} {input} {params} > {log} 2>&1
motif2factors
Create a gimme pfm/motif2factor file with the gimme motif2factors command. For human/mouse it just uses the default database, for other species it is based on TF orthologs.
mt_nuc_ratio_calculator
Estimate the amount of nuclear and mitochondrial reads in a sample. This metric is especially important in ATAC-seq experiments where mitochondrial DNA can be overrepresented. Reads are classified as mitochondrial reads if they map against either “chrM” or “MT”.
These values are aggregated and displayed in the MultiQC report.
mtnucratio {input.bam} {params.mitochondria}
multiBamSummary
Pre-compute a bam summary with deeptools.
multiBamSummary bins --bamfiles {input.bams} -out {output} {params.names} \
{params.params} -p {threads} > {log} 2>&1
multiqc
Aggregate all the quality control metrics for every sample into a single multiqc report.
The input can get very long (causing problems with the shell), so we have to write the input to a file, and then we can use that file as input to multiqc.
multiqc {input.files} -o {params.dir} -n multiqc_{wildcards.assembly}.html \
--config {input.schema} \
--config {input.header} \
--sample-names {input.sample_names} \
{params.filter_buttons} \
--cl-config "extra_fn_clean_exts: [ \
{{'pattern': ^.*{wildcards.assembly}-, 'type': 'regex'}}, \
{{'pattern': {params.fqext1}, 'type': 'regex'}}, \
{{'pattern': _allsizes, 'type': 'regex'}}, \
]" > {log} 2>&1
multiqc_explain
Generate rename buttons for the multiqc report.
multiqc_header_info
Generate a multiqc header file with contact info and date of multiqc generation.
multiqc_samplesconfig
Add a section in the multiqc report that reports the samples.tsv and config.yaml
multiqc_schema
Generate a multiqc config schema. Used for the ordering of modules.
narrowpeak_summit
Convert a narrowpeak file to a “macs2 summits” file.
awk 'BEGIN {{OFS="\t"}} {{ print $1,$2+$10,$2+$10+1,$4,$9; }}' {input} > {output} 2> {log}
onehot_peaks
Get onehot encodings of which peaks are found in which samples
awk '{{print $1":"$2"-"$3}}' {input.combinedpeaks} > {output.real} 2> {log}
for brep in {input.narrowpeaks}
do
bedtools sort -i $brep 2> {log} |
bedtools intersect -a {input.combinedpeaks} -b stdin -c 2> {log} |
awk '{{print $4}}' 2> {log} |
paste {output.real} - > {output.tmp}
mv {output.tmp} {output.real}
done
echo -e "# onehot encoding of which condition contains which peaks\\n{params.names}\n$(cat {output.real})" > {output.tmp} && cp {output.tmp} {output.real}
partial_decoy_transcripts
Compute a set of decoy sequences by mapping the annotated transcripts against a hard-masked version of the organism’s genome.
sh {params.script} -j {threads} -g {input.genome} -a {input.gtf} -t {input.transcripts} -o $(dirname {output[0]}) > {log} 2>&1
peak_bigpeak
Convert a narrowpeak file into a bignarrowpeak file. https://genome-source.gi.ucsc.edu/gitlist/kent.git/tree/master/src/hg/lib/
# keep first 10 columns, idr adds extra columns we do not need for our bigpeak
cut -d$'\t' -f 1-{params.columns} {input.narrowpeak} |
awk -v OFS="\t" '{{$5=$5>1000?1000:$5}} {{print}}' |
bedSort /dev/stdin {output.tmp} > {log} 2>&1
bedToBigBed -type={params.type} -as={params.schema} {output.tmp} \
{input.sizes} {output.out} >> {log} 2>&1
plotCorrelation
Calculate the correlation between bams with deepTools.
plotCorrelation --corData {input.cordata} --plotFile {output} -c {wildcards.method} \
-p heatmap --plotTitle {params.title} {params.params} > {log} 2>&1
plotFingerprint
Plot the “fingerprint” of your bams, using deeptools.
plotFingerprint -b {input.bams} {params} --outRawCounts {output} -p {threads} > {log} 2>&1
plotHeatmap_peak
Plot the peak heatmap using deepTools.
plotHeatmap --matrixFile {input} --outFileName {output.img} {params.params} --startLabel="-{params.slop}" --endLabel={params.slop} > {log} 2>&1
plotPCA
Plot a PCA between bams using deepTools.
plotPCA --corData {input} --outFileNameData {output} > {log} 2>&1
plotProfile_gene
Plot the gene profile using deeptools.
plotProfile -m {input} --outFileName {output.img} --outFileNameData {output.file} > {log} 2>&1
prepare_DEXseq_annotation
generate a DEXseq annotation.gff from the annotation.gtf
current_conda_env=$(conda env list | grep \* | cut -d "*" -f2-)
DEXseq_path=${{current_conda_env}}/lib/R/library/DEXSeq/python_scripts
python ${{DEXseq_path}}/dexseq_prepare_annotation.py {input} {output} > {log} 2>&1
pytxi_count_matrix
Convert transcript abundance estimations to gene count estimations and merge gene counts per assembly.
If the GTF is not used, a taxonomy ID is required for myGene.info. Seq2science will try to read this ID from the genomepy README.txt or assembly database.
quantile_normalization
Quantile normalization is a type of normalization that makes the distribution between samples identical. This means that the actual count distribution within a sample changes. After quantile normalization, samples are CPM normalized (which has no effect between and within samples), for a fair comparison between normalisation techniques.
See wikipedia: https://en.wikipedia.org/wiki/Quantile_normalization
random_subset_peaks
Take a random subset of all summits, and make them all same width.
shuf -n {wildcards.nrpeaks} {input.peaks} | bedtools slop -i stdin -g {input.sizes} -b {params.slop} > {output}
run2sra
Download the SRA of a sample by its unique identifier.
# move to output dir since somehow prefetch sometimes puts files in the cwd...
# and remove the top level folder since prefetch will assume we are done otherwise
mkdir -p {params.outdir}; cd {params.outdir}; rm -r {wildcards.run}
# TODO: for loop can be removed if we dont see the debug message
# three attempts
for i in {{1..3}}
do
# acquire a lock
(
flock -w 30 200 || continue
sleep 2
) 200>{PYSRADB_CACHE_LOCK}
# dump
prefetch --max-size u --output-directory ./ --log-level debug --progress {wildcards.run} \
>> {log} 2>&1 && break
# retry
echo "DEBUG: prefetch try ${{i}} of 3 failed" >> {log}
sleep 10
done
# TODO: section can be removed if we dont see the debug message
# bug report: https://github.com/ncbi/sra-tools/issues/533
if [[ -f "{params.outdir}/{wildcards.run}.sra" ]]; then
echo "DEBUG: moving output to correct directory" >> {log}
mkdir -p {params.outdir}/{wildcards.run}
mv {params.outdir}/{wildcards.run}.sra {output}
fi
# TODO: section can be removed if we dont see the debug message
# If an sralite file was downloaded instead of a sra file, just rename it
if [[ -f "{params.outdir}/{wildcards.run}.sralite" ]]; then
echo "DEBUG: renaming SRAlite" >> {log}
mkdir -p {params.outdir}/{wildcards.run}
mv {params.outdir}/{wildcards.run}.sralite {output}
fi
runs2sample
Concatenate a single run or multiple runs together into a fastq
salmon_index
Generate a transcriptome index for Salmon.
mkdir -p {output}
salmon index -t {input[0]} {params.decoys} -i {output} {params.params} \
--threads {threads} > {log} 2>&1
salmon_quant
Align reads against a transcriptome (index) with Salmon (mapping-based mode) and output a quantification file per sample.
salmon quant -i {input.index} -l A {params.input} {params.params} -o $(dirname {output}) \
--threads {threads} > {log} 2>&1
sam2bam
Convert a file in sam format into a bam format (binary)
samtools view -b {input} > {output}
sambamba_sort
Sort the result of alignment or sieving with the sambamba sorter.
sambamba view --nthreads {threads} -f bam {input} -o /dev/stdout 2> {log} |
sambamba sort --nthreads {threads} {params} /dev/stdin -o {output} 2> {log}
samtools_index
Create an index of a bam/cram file which can be used for e.g. visualization.
samtools index {params} {input} {output}
samtools_presort
(Pre)sort the result of alignment with the samtools sorter.
# we set this trap to remove temp files when prematurely ending the rule
trap "rm -f {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp*bam" INT;
rm -f {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp*bam 2> {log}
# RAM per thread in MB
memory=$((1024*{resources.mem_gb}/{threads}))M
samtools sort -@ {threads} -m $memory {input} -o {output} \
-T {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp 2> {log}
samtools_sort
Sort the result of alignment or sieving with the samtools sorter.
# we set this trap to remove temp files when prematurely ending the rule
trap "rm -f {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp*bam" INT;
rm -f {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp*bam 2> {log}
# RAM per thread in MB
memory=$((1024*{resources.mem_gb}/{threads}))M
samtools sort {params.sort_order} -@ {threads} -m $memory {input} -o {output} \
-T {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp 2> {log}
samtools_sort_allsizes
Sort the allsizes output of sieve_bam. This rule is identical to the samtools_sort rule except that the input must contain _allsizes, and the output is temporary.
# we set this trap to remove temp files when prematurely ending the rule
trap "rm -f {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp*bam" INT;
rm -f {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp*bam 2> {log}
# RAM per thread in MB
memory=$((1024*{resources.mem_gb}/{threads}))M
samtools sort {params.sort_order} -@ {threads} -m $memory {input} -o {output} \
-T {resources.tmpdir}/{wildcards.assembly}-{wildcards.sample}.tmp 2> {log}
samtools_stats
Get general stats from bam files like percentage mapped.
samtools stats {input} 1> {output} 2> {log}
sctk_qc
Perform scRNA QC with singleCellTK, store output in SingleCellExperiment object and export to RData format.
setup_blacklist
Combine the encode blacklist with mitochondrial DNA depending on config.
sieve_bam
“Sieve” a bam.
This rule does (at least) one of these:
filtering on minimum mapping quality
tn5 shift adjustment
remove multimappers
remove reads inside the blacklist
remove duplicates
filter paired-end reads on transcript length
subsample to have a maximum amount of reads (equal across samples)
samtools view -h {params.prim_align} {params.minqual} -L {input.blacklist} \
{input.bam} | {params.atacshift} {params.sizesieve} {params.subsample}
samtools view -b > {output.bam} 2> {log}
# single-end reads never output allsizes so just touch the file when filtering on size
{params.sizesieve_touch}
softmask_track_1
Generate a track of all softmasked regions
source: https://github.com/Gaius-Augustus/MakeHub/blob/master/make_hub.py
softmask_track_2
Generate a track of all softmasked regions
source: https://github.com/Gaius-Augustus/MakeHub/blob/master/make_hub.py
bedSort {input.mask_unsorted} {output.maskbed} >> {log} 2>&1
bedToBigBed -type=bed3 {output.maskbed} {input.sizes} {output.mask} >> {log} 2>&1
sra2fastq_PE
Downloaded (raw) SRAs are converted to paired-end fastq files. Forward and reverse samples will be switched if forward/reverse names are not lexicographically ordered.
# move to output dir since somehow parallel-fastq-dump sometimes puts files in the cwd...
mkdir -p {output.tmpdir}; cd {output.tmpdir}
# acquire the lock
(
flock -w 30 200 || exit 1
sleep 3
) 200>{PYSRADB_CACHE_LOCK}
# dump to tmp dir
parallel-fastq-dump -s {input} -O {output.tmpdir} \
--threads {threads} --split-3 --skip-technical --dumpbase \
--readids --clip --read-filter pass --defline-seq '@$ac.$si.$sg/$ri' \
--defline-qual '+' --gzip >> {log} 2>&1
# check if the files exist
if ! compgen -G "{output.tmpdir}/*_1*" > /dev/null ; then printf "ERROR: Couldn't find read 1.fastq after dumping! Perhaps this is not a paired-end file?\n" >> {log} 2>&1; fi
if ! compgen -G "{output.tmpdir}/*_2*" > /dev/null ; then printf "ERROR: Couldn't find read 2.fastq after dumping! Perhaps this is not a paired-end file?\n" >> {log} 2>&1; fi
# rename file and move to output dir
mv {output.tmpdir}/*_1* {output.fastq[0]} >> {log} 2>&1
mv {output.tmpdir}/*_2* {output.fastq[1]} >> {log} 2>&1
sra2fastq_SE
Downloaded (raw) SRAs are converted to single-end fastq files.
# move to output dir since somehow parallel-fastq-dump sometimes puts files in the cwd...
mkdir -p {output.tmpdir}; cd {output.tmpdir}
# acquire the lock
(
flock -w 30 200 || exit 1
sleep 3
) 200>{PYSRADB_CACHE_LOCK}
# dump to tmp dir
parallel-fastq-dump -s {input} -O {output.tmpdir} \
--threads {threads} --split-spot --skip-technical --dumpbase --readids \
--clip --read-filter pass --defline-seq '@$ac.$si.$sg/$ri' \
--defline-qual '+' --gzip >> {log} 2>&1
# rename file and move to output dir
mv {output.tmpdir}/*.fastq.gz {output.fastq}
star_align
Align reads against a genome (index) with STAR, and pipe the output to the required sorter(s).
trap "find {log} -type f ! -name Log* -exec rm {{}} \;" EXIT
mkdir -p {log}
mkdir -p {output.dir}
STAR --genomeDir {input.index} --readFilesIn {params.input} --readFilesCommand gunzip -c \
--outSAMtype BAM Unsorted --outStd BAM_Unsorted \
--outFileNamePrefix {log}/ --outTmpDir {output.dir}/STARtmp \
--runThreadN {threads} {params.params} > {output.pipe} 2> {log}/Log.stderr.out
# move all non-log files to output directory (this way the log files are kept on error)
find {log} -type f ! -name Log* -exec mv {{}} {output.dir} \;
star_index
Make a genome index for STAR.
Troubleshooting:
sufficient RAM & disk space?
increase the RAM available (–limitGenomeGenerateRAM)
reduce the number of threads (seq2science -j 5)
reduce accuracy (–genomeSAsparseD 2)
- In your config.yaml, add:
- aligner:
- star:
index: –limitGenomeGenerateRAM 60000000000 –genomeSAsparseD 1
function log2 {{
local x=0
for (( y=$1-1 ; $y > 0; y >>= 1 )) ; do
let x=$x+1
done
echo $x
}}
# set genome dependent variables
NBits=""
NBases=""
GenomeLength=$(awk -F"\t" '{{x+=$2}}END{{printf "%i", x}}' {input.sizes})
NumberOfReferences=$(awk 'END{{print NR}}' {input.sizes})
if [ $NumberOfReferences -gt 5000 ]; then
# for large genomes, --genomeChrBinNbits should be scaled to min(18,log2[max(GenomeLength/NumberOfReferences,ReadLength)])
# ReadLength is skipped here, as it is unknown
LpR=$(log2 $((GenomeLength / NumberOfReferences)))
NBits="--genomeChrBinNbits $(($LpR<18 ? $LpR : 18))"
printf "NBits: $NBits\n\n" >> {log} 2>&1
fi
if [ $GenomeLength -lt 268435456 ]; then
# for small genomes, --genomeSAindexNbases must be scaled down to min(14, log2(GenomeLength)/2-1)
logG=$(( $(log2 $GenomeLength) / 2 - 1 ))
NBases="--genomeSAindexNbases $(( $logG<14 ? $logG : 14 ))"
printf "NBases: $NBases\n\n" >> {log} 2>&1
fi
mkdir -p {output}
STAR --runMode genomeGenerate --genomeFastaFiles {input.genome} --sjdbGTFfile {input.gtf} \
--genomeDir {output} --outFileNamePrefix {output}/ \
--runThreadN {threads} $NBits $NBases {params} >> {log} 2>&1
tpm_matrix
Create a TPM normalized table from a counts table.
trackhub
Generate a UCSC track hub/assembly hub.
To view the hub, the output directory must be hosted on an web accessible location, and the location of the hub.txt file given the UCSC genome browser at My Data > Track Hubs > My Hubs
trackhub_index
Generate a searchable annotation & index for each assembly
sources: https://genome.ucsc.edu/goldenPath/help/hubQuickStartSearch.html, https://www.biostars.org/p/272649/
# filter out tiny tiny transcripts that make gtftogenepred fail
min_transcript_length=15
(awk -v minlen=$min_transcript_length 'BEGIN {{ FS = "\\t"; OFS="\\t" }}; {{ if(($3=="transcript") && ($5 - $4 <= minlen)) {{ print $9 }} }}' {input.gtf} | grep . || echo "transcript_id \\\"DONTMATCHANYTHING\\\"") | grep -oP '(?<=transcript_id ").+?(?=")' | grep -v -f - {input.gtf} > {output.gtf_min_size} 2>> {log}
# generate annotation files
gtfToGenePred -allErrors -ignoreGroupsWithoutExons -geneNameAsName2 -genePredExt {output.gtf_min_size} {output.genePred} -infoOut={output.info} >> {log} 2>&1
# if gtf has gene_names, use gene_names, else transcript_id
if grep -Fq "gene_name" {output.gtf_min_size}; then n=1; else n=0; fi
# switch columns 1 (transcript_id) and 12 (gene_name)
awk -v n=$n 'BEGIN {{ FS = "\\t"; OFS="\\t" }}; {{ if(n==1) {{ t = $1; $1 = $12; $12 = t; print; }} else {{ print; }} }}' {output.genePred} > {output.genePrednamed}
# remove lines with an empty first column (e.g. no gene names for ERCC RNA spike-in)
grep -vP "^\t" {output.genePrednamed} > {output.genePred}
# remove spaces from the genePred (error in annotation)
sed -i 's/ //g' {output.genePred}
genePredToBed {output.genePred} {output.genePredbed} >> {log} 2>&1
bedSort {output.genePredbed} {output.genePredbed} >> {log} 2>&1
bedToBigBed -extraIndex=name {output.genePredbed} {input.sizes} {output.genePredbigbed} >> {log} 2>&1
# generate searchable indexes (by 1: transcriptID, 2: geneId, 8: proteinID, 9: geneName, 10: transcriptName)
grep -v "^#" {output.info} | awk -v n=$n 'BEGIN {{ FS = "\\t"; OFS="\\t" }} ; {{ if(n==1) {{print $9, $1, $2, $8, $9, $10}} else {{print $1, $1, $2, $8, $9, $10}} }}' > {output.indexinfo}
ixIxx {output.indexinfo} {output.ix} {output.ixx} >> {log} 2>&1
trimgalore_PE
Automated adapter detection, adapter trimming, and quality trimming through trim galore (paired-end).
trim_galore --paired -j {threads} {params.config} -o $(dirname {output.r1}) {input.r1} {input.r2} > {log} 2>&1
# now rename to proper output
for f in $(find "$(dirname {output.r1})/" -name "{wildcards.sample}_*val_*.fq.gz"); do
mv "$f" "$(echo "$f" | sed s/.fq/.{params.fqsuffix}/)"; done
for f in $(find "$(dirname {output.r1})/" -maxdepth 1 -name "{wildcards.sample}_*val_*.{params.fqsuffix}.gz"); do
mv "$f" "$(echo "$f" | sed s/_val_./_trimmed/)"; done
# move the trimming reports to qc directory
for f in $(find "$(dirname {output.r1})/" -name "{wildcards.sample}_*.{params.fqsuffix}.gz_trimming_report.txt"); do
mv "$f" "$(dirname {output.qc[0]})/$(basename $f)"; done
trimgalore_SE
Automated adapter detection, adapter trimming, and quality trimming through trim galore (single-end).
trim_galore -j {threads} {params.config} -o $(dirname {output.se}) {input} > {log} 2>&1
# now rename to proper output
if [ "{params.fqsuffix}" != "fq" ]; then
mv "$(dirname {output.se})/{wildcards.sample}_trimmed.fq.gz" {output.se}
fi
# move the trimming report to qc directory
report=$(dirname {output.se})/{wildcards.sample}.{params.fqsuffix}.gz_trimming_report.txt
mv $report {output.qc}
twobit
Generate a 2bit file for each assembly
faToTwoBit {input} {output} >> {log} 2>&1
txi_count_matrix
Convert transcript abundance estimations to gene count estimations and merge gene counts per assembly.
Also outputs a single cell experiment object similar to ARMOR (https://github.com/csoneson/ARMOR).
Only works with Ensembl assemblies.
upset_plot_peaks
make an upset plot which shows the distribution of peaks amongst the biological replicates.