genomepy: Genes and Genomes at your fingertips

This manuscript (permalink) was automatically generated from vanheeringen-lab/genomepy_manuscript@0c44a46 on July 28, 2022.

Authors

Abstract

Analyzing a functional genomics experiment, such as ATAC-, ChIP- or RNA-sequencing, requires reference data including a genome assembly and gene annotation. These resources can generally be retrieved from different organizations and in different versions. Most bioinformatic workflows require the user to supply this genomic data manually, which can be a tedious and error-prone process.

Here we present genomepy, which can search, download, and preprocess the right genomic data for your analysis. Genomepy can search genomic data on NCBI, Ensembl, UCSC and GENCODE, and compare available gene annotations to enable an informed decision. The selected genome and gene annotation can be downloaded and preprocessed with sensible, yet controllable, defaults. Additional supporting data can be automatically generated or downloaded, such as aligner indexes, genome metadata and blacklists. These functionalities are available from a command line interface, aimed at ease of use and integration in automated pipelines. Extended functionality is accessible using a Python application programming interface.

Introduction

Data analysis is increasingly important in biological research. Whether you are analyzing gene expression in two samples or transcription factor binding motifs in genomic atlases, you will need external information such as a reference genome or a gene annotation. For these types of data, there are three major providers: Ensembl1, UCSC2 and NCBI3, and many model-system specific providers, such as GENCODE4, ZFIN5, FlyBase6, WormBase7, Xenbase8 and more. Providers have different approaches to compiling genome assemblies and gene annotations, which affect formats, format compliance, naming, data quality, available versions and release cycle. These differences significantly impact compatibility with research9, tools and (data based on) other genomic data.

You could try to find genomic data yourself, but there are many options. Ensembl, UCSC and NCBI each have FTP archives, web portals, and REST APIs, which you can use to search their individual databases. Alternatively, there are several tools that can be used to access some of these databases programmatically, such as ncbi-genome-download10 and ucsc-genomes-downloader11. However, none of these can search, compare or download from all major genome providers data. Furthermore, downloading and processing genomic data manually can be tedious, error-prone, and poorly reproducible. Although the latter could be remedied by a data management tool, such as iGenomes12, refGenie13 or Go Get Data14, data managers still require the user to add new data manually.

We have developed genomepy to 1) find genomic data on major providers, 2) compare gene annotations, 3) select the genomic data best suited to your analysis and 4) provide a suite of functions to peruse and manipulate the data. Selected data can be downloaded from anywhere, and is processed automatically. To ensure reproducibility, data sources and processing steps are documented, and can be enhanced further by using a data manager. Genomic data can be loaded into genomepy, which utilizes and extends on packages including pyfaidx15, pandas16 and MyGene.info17 to rapidly work with gene and genome sequences and metadata. Similarly, genomepy has been incorporated into other packages, such as pybedtools18 and CellOracle19. Genomepy can be used on command line and though the (fully documented) Python API, for a one-time analysis or integration in pipelines and workflow managers.

Features of genomepy

The key features of genomepy are 1) providing an overview of available assemblies with the search function, 2) download and processing of a selected assembly, with the install function and 3) using assembly data through the Python API.

The search function queries the databases of NCBI, Ensembl, UCSC and GENCODE (caching the metadata), for text, taxonomy identifiers or assembly accession identifiers. The input type is automatically recognized and used to find assemblies that have the text in the genome names or various description fields, matches the taxonomy identifier or (partially) matches the assembly accession. The output of the search function is a table with rows of metadata for each assembly found. The metadata contains the assembly name and accession, taxonomy identifier, and indicates whether a gene annotation can be downloaded (or which of the four UCSC annotations) (see fig. 1a). Snippets of available gene annotation(s) can be inspected with the annotation function (fig. 1b).

$ genomepy search GRCh38
name          provider  accession         tax_id  annotation  species                     other_info
                                                   n r e k    <- UCSC options (see help)
GRCh38        GENCODE   GCA_000001405.15    9606      ✓       Homo sapiens                GENCODE annotation + UCSC genome
GRCh38.p13    Ensembl   GCA_000001405.28    9606      ✓       Homo sapiens                2014-01-Ensembl/2021-08
hg38          UCSC      GCA_000001405.15    9606   ✓ ✓ ✗ ✓   Homo sapiens                Dec. 2013 (GRCh38/hg38)
GRCh38        NCBI      GCF_000001405.26    9606      ✓       Homo sapiens                Genome Reference Consortium
 ^
 Use name for genomepy install
$ genomepy annotation GRCh38.p13
12:00:00 | INFO | Ensembl
1       ensembl_havana  gene    1211340 1214153 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
12:00:00 | INFO | NCBI
NC_000001.11    genomepy        transcript      11874   14409   .       +       .       gene_id "DDX11L1"; transcript_id "NR_046018.2";  gene_name "DDX11L1";
$ genomepy install --annotation GRCh38.p13
$ ls -1 ~/.local/share/genomes/GRCh38.p13
GRCh38.p13.fa
GRCh38.p13.fa.fai
GRCh38.p13.fa.sizes
GRCh38.p13.gaps.bed
GRCh38.p13.annotation.bed
GRCh38.p13.annotation.gtf
assembly_report.txt
README.txt
index/

An assembly name can be passed to the install function (fig. 1c). The genome FASTA file is downloaded with the desired sequence masking level20,21 and alternate sequences (softmasked and none by default, respectively). Alternate sequences reflect biological diversity and are often contained in reference assemblies. During sequence alignment however, similar reference sequences result in multiple alignment, leading to loss of data (as discussed in22). Additional filters may be passed to either include or exclude contigs (chromosomes, scaffolds, etc.) by name or regex pattern. Once processed, a genome index is generated using pyfaidx15, as well as contig sizes and contig gap sizes.

Gene annotations come in a variety of recognized formats (GFF3, GTF, BED12). The install function will download the most descriptive format available, to output the commonly used GTF and BED12 formats. Contig names of the genome and gene annotation sometimes mismatch, which makes them incompatible with tools such as splice-aware aligners. Therefore, genomepy will attempt to match the contig names of the gene annotations to those used in the genome FASTA.

The install function can be extended with postprocessing steps via plugins. The options can be inspected and toggled with the plugin function. Briefly, the blacklist plugin downloads blacklists by the Kundaje lab23 for the supported genomes. Other plugins support the generation of aligner indexes, including DNA aligner indexes for Bowtie224, BWA25, GMAP26 or Minimap227, and splice-aware aligners such as STAR28 and HISAT229.

Assemblies not present on the major providers can be processed similarly by supplying the URLs or file paths to the install function. For data provenance and reproducibility, a README file is generated during the installation process with time, source files, processing steps, and filtered contigs.

These features are available on both the command line interface and Python API. Additional features are available on the Python API, focussed around two classes. The Genome class can be used to extract exact or random sequences from the FASTA, filter the FASTA and list the contigs, contig sizes and contig gaps. The Annotation class can be used to browse and filter the BED12 or GTF files as pandas dataframes16, map gene identifiers to other types using mygene.info17, map chromosome names to naming schemes of other major providers, and create a dictionary of any two GTF columns or attribute fields (to easily convert gene identifiers to gene names for instance).

Conclusion

Obtaining suitable genomic data is a principal step in any genomics project. With genomepy, finding available assemblies becomes trivial. A genome, with the desired sequence masking, level of biological diversity, and contigs can be obtained with a single command. Gene annotations in GTF and BED12 format, and matching the genome, can similarly be obtained, with further options available in the Python API. Whatever install options you choose are logged, for reproducibility, allowing you to start your analysis with confidence.

Code availability

Genomepy is freely available under the MIT license and can be installed using Bioconda30, Pip31, or directly used in workflows with our Docker32 image or snakemake33 wrapper. Code and documentation are available on github and github-pages, respectively.

Acknowledgements

The authors would like to thank Vicky Luna Velez for designing the genomepy logo, Dohoon Lee and Jie Zhu for contributing to the genomepy code, as well as the many github users who provided feedback, ideas and suggestions.

References

1.
Yates, A. D. et al. Ensembl 2020. Nucleic Acids Research gkz966 (2019) doi:10.1093/nar/gkz966.
2.
Kent, W. J. et al. The Human Genome Browser at UCSC. Genome Res. 12, 996–1006 (2002).
3.
Karolchik, D. The UCSC Table Browser data retrieval tool. Nucleic Acids Research 32, 493D–496 (2004).
4.
Frankish, A. et al. GENCODE 2021. Nucleic Acids Research 49, D916–D923 (2020).
5.
6.
Thurmond, J. et al. FlyBase 2.0: the next generation. Nucleic Acids Research 47, D759–D765 (2018).
7.
Harris, T. W. et al. WormBase: a modern Model Organism Information Resource. Nucleic Acids Research (2019) doi:10.1093/nar/gkz920.
8.
Karimi, K. et al. Xenbase: a genomic, epigenomic and transcriptomic model organism database. Nucleic Acids Research 46, D861–D868 (2018).
9.
10.
GitHub - kblin/ncbi-genome-download: Scripts to download genomes from the NCBI FTP servers. GitHub https://github.com/kblin/ncbi-genome-download.
11.
GitHub - LucaCappelletti94/ucsc_genomes_downloader: Python package to quickly download genomes from the UCSC. GitHub https://github.com/LucaCappelletti94/ucsc_genomes_downloader.
12.
13.
Stolarczyk, M., Reuter, V. P., Smith, J. P., Magee, N. E. & Sheffield, N. C. Refgenie: a reference genome resource manager. GigaScience 9, (2020).
14.
15.
Shirley, M. D., Ma, Z., Pedersen, B. S. & Wheelan, S. J. Efficient "pythonic" access to FASTA files using pyfaidx. (2015) doi:10.7287/peerj.preprints.970v1.
16.
Reback, J. et al. pandas-dev/pandas: Pandas 1.4.3. (Zenodo, 2022). doi:10.5281/zenodo.3509134.
17.
18.
Dale, R. K., Pedersen, B. S. & Quinlan, A. R. Pybedtools: a flexible Python library for manipulating genomic datasets and annotations. Bioinformatics 27, 3423–3424 (2011).
19.
Kamimoto, K., Hoffmann, C. M. & Morris, S. A. CellOracle: Dissecting cell identity via network inference and in silico gene perturbation. (2020) doi:10.1101/2020.02.17.947416.
20.
21.
Morgulis, A., Gertz, E. M., Schäffer, A. A. & Agarwala, R. A Fast and Symmetric DUST Implementation to Mask Low-Complexity DNA Sequences. Journal of Computational Biology 13, 1028–1040 (2006).
22.
Church, D. M. et al. Extending reference assembly models. Genome Biol 16, (2015).
23.
Amemiya, H. M., Kundaje, A. & Boyle, A. P. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep 9, (2019).
24.
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357–359 (2012).
25.
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
26.
Wu, T. D. & Watanabe, C. K. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859–1875 (2005).
27.
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
28.
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2012).
29.
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12, 357–360 (2015).
30.
31.
PyPI · The Python Package Index. PyPI https://pypi.org/.
32.
33.
Köster, J. & Rahmann, S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 34, 3600–3600 (2018).