scRNA-seq
Preprocessing of scRNA-seq has never been easier!
Workflow overview (simplified)
Downloading of sample(s)
Depending on whether the samples you start seq2science with is your own data, public data, or a mix, the pipeline might start with downloading samples. You control which samples are used in the samples.tsv. Background on public data can be found here.
Downloading and indexing of assembly(s)
Depending on whether the assembly and its index you align your samples against already exist seq2science will start with downloading of the assembly through genomepy.
Read trimming
The pipeline starts by trimming the reads with fastp. Fastp automatically detects sequencing adapters, removes short reads and performs quality filtering. The parameters of fastp for the pipeline can be set in the configuration by the variable fastp.
Note: fastp currently runs in single-end mode even if you provide paired-end fastq files.
This is due to the fact that reads, barcode and umi sequences are stored in separate fastq files.
After trimming, the fastq containing the reads, usually R2, may contains less reads then R1.
Therefore, we perform an intermediate step by running fastq-pair
to remove singleton reads before proceeding any further.
Quantification
Quantification is performed by running either the kb-python wrapper for Kallisto bustools or CITE-seq-Count for ADT tags. Kallisto bustools relies on pseudo-alignment of scRNA reads against a reference transcriptome index. The resulting count matrices can be further processed with scRNA toolkits, such as Seurat or Scanpy.
Filling out the samples.tsv
Before running a workflow you will have to specify which samples you want to run the workflow on.
Each workflow starts with a samples.tsv
as an example, and you should adapt it to your specific needs.
As an example, the samples.tsv
could look something like this:
sample assembly descriptive_name
pbmc GRCh38.p13 pbmc
Sample column
If you use the pipeline on public data this should be the name of the accession (e.g. GSM2837484). Accepted formats start with “GSM”, “SRR”, “SRX”, “DRR”, “DRX”, “ERR” or “ERX”.
If you use the pipeline on local data this should be the basename of the file without the extension(s). For example:
/home/user/myfastqs/sample1.fastq.gz
——->sample1
for single-ended data/home/user/myfastqs/sample2_R1.fastq.gz
┬>sample2
for paired-ended data
/home/user/myfastqs/sample2_R2.fastq.gz
┘
For local data, some fastq files may have slightly different naming formats.
For instance, Illumina may produce a sample named sample3_S1_L001_R1_001.fastq.gz
(and the R2
fastq).
Seq2science will attempt to recognize these files based on the sample name sample3
.
For both local and public data, identifiers used to recognize fastq files are the fastq read extensions (R1
and R2
by default) and the fastq suffix (fastq
by default).
The directory where seq2science will store (or look for) fastqs is determined by the fastq_dir
config option.
In the example above, the fastq_dir
should be set to /home/user/myfastqs
.
These setting can be changed in the config.yaml
.
Assembly column
Here you simply add the name of the assembly you want your samples aligned against and the workflow will download it for you. In case your intend to run kb-python’s kite workflow, the assembly name becomes the basename of the tab delimited feature barcode file.
Descriptive_name column
The descriptive_name column is used for the multiqc report. In the multiqc report there will be a button to rename your samples after this column.
Filling out the config.yaml
Every workflow has many configurable options, and can be set in the config.yaml file. In each config.yaml we highlighted a couple options that we think are relevant for that specific workflow, and set (we think) reasonable default values.
Custom assembly extensions
The genome and/or gene annotation can be extended with custom files, such as ERCC spike-ins for scRNA-seq.
To do so, add custom_genome_extension: path/to/spike_in.fa
and custom_annotation_extension: path/to/spike_in.gtf
to the config.
Seq2science will place the customized assembly in a separate folder in the genome_dir
.
You can control the name of the customized assembly by setting custom_assembly_suffix
in the config.
Quantification with kallistobus
After initializing your working directory and editing the samples.tsv
file, specify the desired arguments for kb-pyhon via the ref (kb ref) and count (kb count) properties except for the barcode whitelist (-w
). The path to the barcode whiltelist can be supplied via the barcodefile
property. This step is optional since kb python python provides several pre-installed whitelists for the following technologies.
10XV1
10XV2
10XV3
INDROPSV3
The white-list will be installed automatically if the appropiate technology argument is provided via the -x
parameter in short-hand syntax.
BUS (Barcode/UMI/Set) format
The -x
argument indicates the read and file positions of the UMI and barcode. Kallisto bustools should auto-detect the correct settings barcode/umi layout for the following technologies if the name is supplied:
name whitelist barcode umi cDNA
--------- --------- --------------------- ------- -----------------------
10XV1 yes 0,0,14 1,0,10 2,None,None
10XV2 yes 0,0,16 0,16,26 1,None,None
10XV3 yes 0,0,16 0,16,28 1,None,None
CELSEQ 0,0,8 0,8,12 1,None,None
CELSEQ2 0,6,12 0,0,6 1,None,None
DROPSEQ 0,0,12 0,12,20 1,None,None
INDROPSV1 0,0,11 0,30,38 0,42,48 1,None,None
INDROPSV2 1,0,11 1,30,38 1,42,48 0,None,None
INDROPSV3 yes 0,0,8 1,0,8 1,8,14 2,None,None
SCRUBSEQ 0,0,6 0,6,16 1,None,None
SURECELL 0,0,6 0,21,27 0,42,48 0,51,59 1,None,None
SMARTSEQ 0,None,None 1,None,None
`
Alternatively, the layout can be specified as a bc:umi:set
triplet. The first position indicates the read, the second position the start of the feature and the third position the end of the feature. For more information and examples on the BUS format, consider the Bus section in the Kallisto manual.
Input preparations for KITE workflow
The steps to prepare a scRNA analysis for Feature Barcoding experiments deviates slighlty from the standard seq2science workflow. In essence, we quantify the abundance of sequence features, such as antibody barcodes, rather than a set of transcripts for a certain species. Therefore, our index does not rely on a particular assembly but is build from these sequence features. Please consider the offical kite documentation for more details.
1. Prepare a two-column, tab-delimited file with your feature barcode in the first column and feature names in the second.
Example
sequence |
name |
---|---|
AACAAGACCCTTGAG |
barcode 1 |
TACCCGTAATAGCGT |
barcode 2 |
We save this file as fb.tsv.
2. Copy this file to the genome folder specified in config.yaml
where seq2science searches for assemblies.
3. Add the basename of the feature barcode tabel, in this case fb, to the assembly column in your samples.tsv file.
sample assembly
pbmc fb
An example of configuring kb-python for feature barcode analysis is shown below. Add the appropiate settings to your config.
Examples
RNA Quantification (10XV3)
quantifier:
kallistobus:
count: '-x 10xv3 --h5ad --verbose'
RNA velocity (CEL-Seq2)
quantifier:
kallistobus:
ref: '--workflow lamanno'
count: '-x 1,8,16:1,0,8:0,0,0 --h5ad --verbose --workflow lamanno'
barcodefile: "1col_barcode_384.tab"
The RNA velocity workflow produces count matrices for unspliced/spliced mRNA counts.
KITE feature barcoding (CEL-Seq2)
quantifier:
kallistobus:
ref: '--workflow kite'
count: '-x 1,8,16:1,0,8:0,0,0 --h5ad --verbose --workflow kite'
barcodefile: "1col_barcode_384.tab"
Quantification with CITE-seq-Count
CITE-seq-Count count can be used as an alternative quantifier to pre-process ADT/Cell-hashing experiments and generate read/umi count matrices. This option cannot be used in conjunction with kallistobus.
To enable quantification with CITE-Seq-count, add the following section to your config file
Example
quantifier:
citeseqcount:
count: '-cbf 9 -cbl 16 -umif 1 -umil 8 -cells 372 --max-error 1 --bc_collapsing_dist 1 --umi_collapsing_dist 1 -T 10 --debug'
barcodefile: "barcodes.tab"
scRNA-seq pre-processing and quality control
The seq2science scRNA workflow provides the option to perform automated pre-processing of raw scRNA-seq UMI count matrices. This is achieved by incorporating several quality control steps from the singleCellTK Bioconductor package, such as cell-calling, doublet detection and assessment of cell-wise mitochondrial RNA content.
The QC results are reported in comprehensive R Markdown reports and processed UMI count matrices are stored as SingleCellExperiment S4 objects. Any sample level metadata that has been added to the samples.tsv
file will be transferred to the colData
slot of the corresponding object and assigned to each cell identifier.
After running seq2science, these objects can be directly imported into R with the readRDS
function for further down-stream analysis with your favorite R packages.
To perform scRNA-seq pre-processing, add the following section to your seq2science config.yaml
. In this example, we pre-process a plate-based experiment from human tissue.
sc_preprocess:
export_sce_objects: False
run_sctk_qc: True
sctk_data_type: cell
use_alt_expr: False
alt_exp_name: ""
alt_exp_reg: ""
sctk_mito_set: human-ensembl
sctk_detect_mito: True
sctk_detect_cell: False
sctk_cell_calling: Knee
To enable the sctk_qc workflow, set run_sctk_qc=True
. Alternatively, one can skip quality control and export the UMI count matrix to a SingleCellExperiment object by setting export_sce_objects=True
. However, there is no need to set export_sce_objects
manually when run_sctk_qc
is enabled.
Next, select the type of UMI count matrix with the sctk_data_type
parameter. Valid options are either cell
or droplet
, depending on the type of experiment.
droplet
The UMI count matrix contains empty droplets. These empty droplets will be removed (cell calling) before further processing.cell
The UMI count matrix does not contain empty droplets but has not been processed yet.
We do not perform any gene/cell level filtering, except for empty droplets that are considered ambient RNA.
Advanced settings
To perform additional (optional) QC steps, consider the following parameters:
sctk_detect_mito
(default:True
)
Quantify the percentage of mitochondrial RNA for each cell.sctk_mito_set
(default:human-ensembl
)
The mitochondrial gene set to use for quantification with syntax[human,mouse]-[ensembl,entrez,symbol]
. At the moment, only human and mouse gene annotations are supported. This option is only considered whensctk_detect_mito=True
.sctk_detect_cell
(default:True
)
Perform cell-calling for droplet based experiments. Empty droplets will not be removed if set toFalse
.sctk_cell_calling
(default:Knee
)
Method used for cell calling with DropletUtils, eitherKnee
orEmptyDrops
. By default, EmptyDrops will use an FDR of 0.01 to identify empty droplets. If no option is provided, the inflection point will be used for cell calling. This option is only considered whensctk_detect_cell=True
.sctk_export_formats: (default:
["Seurat"]
)
List of file formats to export SingleCellExperiment objects. Valid options are Seurat, FlatFile and AnnData. Raw and processed sce objects will be exported to rds format by default.sctk_qc_algos: (default:
["QCMetrics", "scDblFinder", "decontX"]
)
List of QC algorithms for runCellQC’salgorithm
parameter.velo assay: (default:
spliced
)
The assay to use for exporting and QC when kb is run in velocity mode with--workflow lamanno
parameter. Valid options are eitherspliced
orunspliced
.
Alternative experiments
Information about alternative sequencing features, such as ERCC spike-ins, can be provided as an alternative experiment. An alternative experiment will be stored within the same SingleCellExperiment object as the main experiment but processed separately. To process alternative experiments, set the following options:
use_alt_expr
(default:False
)
Set toTrue
if you wish to process alternative experiments.alt_exp_name
(default:""
)
The name/title of the alternative experiment. This option is only considered ifuse_alt_expr=True
.alt_exp_reg
(default:""
)
Regular expression to filter alternative features from main experiment (.i.e,;"ERCC-*"
). This option is only considered whenuse_alt_expr=True
.
A previous addition of alternative features to the genome assembly (see section on custom assembly extensions) is a prerequisite.
Output files
After running the scRNA QC workflow, the output can be found in the following locations:
path/to/results/scrna-preprocess/{quantifier}/export
:
This folder contains the exported raw SingleCellExperiment object, without any QC applied.
path/to/results/scrna-preprocess/{quantifier}/sctk
:
This folder contains the QC reports and exported SingleCellExperiment object, divided into several subfolders and files.
export
Subfolder containing the exported SingleCellExperiment object after QC and summary statistics.SCTK_CellQC.html
Cell-level QC report generate by singleCellTK’s runCellQC.SCTK_DropletQC.html
Droplet-level QC report generate by singleCellTK’s runDropletQC.SCTK_DropletQC_figures.pdf
Barcode rank plot generated by runBarcodeRankDrops.SCTK_altexps.pdf
(optional)
Scatter plot displaying the percentage of alternative features vs. detected features.