Preprocessing of scATAC-seq has never been easier!

Does your scATAC protocol generate single cell fastq files? And do they need to be mapped and analysed? Then this pipeline is for your! This pipeline takes single cell fastq files, performs extensive (plate based) QC steps, and generates a binned SNAP-object that can be used for further downstream analysis following this tutorial.

If you have a Cellranger object instead (output 10x), it is relatively easy to directly generate a SNAPobject, look at the documentation on the SnapATAC github instead.

Alternatively, if you have a large FASTQ file containing a cell specific ID in the header, it might be possible to run this pipeline with some alterations. The ‘merge_replica’ outputs a fastq trimmed fastq file with the cell_id after the @ in the header, e.g. @sampleXcellY:NS500173:518:HTYVTBGXB:4:23609:2756:19401 2:N:0:ACTCGCTA+ATTAGACG.

Workflow overview (simplified)

Downloading of sample(s)

Depending on whether the samples you start seq2science with is your own data, public data, or a mix, the pipeline might start with downloading samples. You control which samples are used in the samples.tsv. Background on public data can be found here.

merge cell fastq files based on processing

Single cell fastq files get a barcode id corresponding to the sample name. This is added to the fastq header, after which each single cell fastq is merged to a large FASTQ file with all the technical replica’s (for example all cells from a plate). This results in a multi-QC report containing QC per technical replica’s. This makes lower plate quality easy to spot. Additional single cell QC will be available in downstream analysis of the SNAP-object. See Replicate handling for more information. See Filling out the samples.tsv and config.yaml for more information.


Make sure to take a look at our alignment workflow.

Generate and bin a SNAP_Object for downstream analysis

To find cellular heterogeneity before peak calling, SNAP-ATAC binns the genome. Using this for demensionality reduction of the data and finding cell clusters before peak calling. By default the pipeline includes bins of 5kB, but this is changable in the config.yaml file. After clustering peaks can be called on the aggregated cluster signals. This makes it easier to find peaks present only in a small cluster of cells. This does however mean that the pipeline doesnt perform peakcalling, this is done later in the snapATAC vignette.

Filling out the samples.tsv

Before running a workflow you will have to specify which samples you want to run the workflow on. Each workflow starts with a samples.tsv as an example, and you should adapt it to your specific needs. One thing you need to check before filling a sample into your sample.tsv file, is that the fastq file is not completely empty! If a fastq file has a size of 0 it will crash the pipeline. You can run bash code to automatically remove completely empty fastq files from a directory, e.g. run:

cd fastq-dir-here
find . -size 0 -delete

After removing all the completely empty fastq files, use the remainder to fill your samples.tsv file with. As an example, the samples.tsv could look something like this:

sample  assembly    technical_replicates
GSM1596256  hg38    H1ESC
GSM1596257  hg38    H1ESC
GSM1596258  hg38    H1ESC
GSM1596259  hg38    H1ESC
GSM1596260  hg38    H1ESC
GSM1596261  hg38    H1ESC
GSM1596262  hg38    H1ESC
GSM1596263  hg38    H1ESC
GSM1596351  hg38    GM
GSM1596352  hg38    GM
GSM1596353  hg38    GM
GSM1596354  hg38    GM
GSM1596355  hg38    GM
GSM1596356  hg38    GM
GSM1596357  hg38    GM
GSM1596358  hg38    GM
GSM1596359  hg38    GM
iPSC-2-KC-diff-plate2-16718m701m511 hg38    plate2
iPSC-2-KC-diff-plate2-plate2-16718m701m513  hg38    plate2
iPSC-2-KC-diff-plate2-plate2-16718m701m515  hg38    plate2
iPSC-2-KC-diff-plate2-plate2-16718m701m516  hg38    plate2
iPSC-2-KC-diff-plate2-plate2-16718m701m517  hg38    plate2

Sample column

If you use the pipeline on public data this should be the name of the accession (e.g. GSM2837484). Accepted formats start with “GSM”, “SRR”, “SRX”, “DRR”, “DRX”, “ERR” or “ERX”.

If you use the pipeline on local data this should be the basename of the file without the extension(s). For example:

  • /home/user/myfastqs/sample1.fastq.gz ——-> sample1 for single-ended data

  • /home/user/myfastqs/sample2_R1.fastq.gz ┬> sample2 for paired-ended data

For local data, some fastq files may have slightly different naming formats. For instance, Illumina may produce a sample named sample3_S1_L001_R1_001.fastq.gz (and the R2 fastq). Seq2science will attempt to recognize these files based on the sample name sample3.

For both local and public data, identifiers used to recognize fastq files are the fastq read extensions (R1 and R2 by default) and the fastq suffix (fastq by default). The directory where seq2science will store (or look for) fastqs is determined by the fastq_dir config option. In the example above, the fastq_dir should be set to /home/user/myfastqs. These setting can be changed in the config.yaml.

Assembly column

Here you simply add the name of the assembly you want your samples aligned against and the workflow will download it for you.

technical_replicates column

Here you put to which technical replica each cell corresponds. E.g. the plate from which the cells were sequenced. Per technical replica QC will be generated. For public data you can put all cells under the same technical replica name.

Filling out the config.yaml

Every workflow has many configurable options, and can be set in the config.yaml file. In each config.yaml we highlighted a couple options that we think are relevant for that specific workflow, and set (we think) reasonable default values.

When a workflow starts it prints the complete configuration, and (almost) all these values can be added in the config.yaml and changed to your liking. You can see the complete set of configurable options in the extensive docs.

Best practices

Take a look at the postprocessing example.