genomepy: Genes and Genomes at your fingertips

Siebren Frölich; Maarten van der Sande; Tilman Schäfers; Simon van Heeringen

Analyzing a functional genomics experiment, such as ATAC-, ChIP- or RNA-sequencing, requires reference data including a genome assembly and gene annotation. These resources can generally be retrieved from different organizations and in different versions. Most bioinformatic workflows require the user to supply this genomic data manually, which can be a tedious and error-prone process.

Here we present genomepy, which can search, download, and preprocess the right genomic data for your analysis. Genomepy can search genomic data on NCBI, Ensembl, UCSC and GENCODE, and compare available gene annotations to enable an informed decision. The selected genome and gene annotation can be downloaded and preprocessed with sensible, yet controllable, defaults. Additional supporting data can be automatically generated or downloaded, such as aligner indexes, genome metadata and blacklists. These functionalities are available from a command line interface, aimed at ease of use and integration in automated pipelines. Extended functionality is accessible using a Python application programming interface.

Introduction

Data analysis is increasingly important in biological research. Whether you are analyzing gene expression in two samples or transcription factor binding motifs in genomic atlases, you will need external information such as a reference genome or a gene annotation. For these types of data, there are three major providers: Ensembl¹, UCSC² and NCBI³, and many model-system specific providers, such as GENCODE⁴, ZFIN⁵, FlyBase⁶, WormBase⁷, Xenbase⁸ and more. Providers have different approaches to compiling genome assemblies and gene annotations, which affect formats, format compliance, naming, data quality, available versions and release cycle. These differences significantly impact compatibility with research⁹, tools and (data based on) other genomic data.

You could try to find genomic data yourself, but there are many options. Ensembl, UCSC and NCBI each have FTP archives, web portals, and REST APIs, which you can use to search their individual databases. Alternatively, there are several tools that can be used to access some of these databases programmatically, such as ncbi-genome-download¹⁰ and ucsc-genomes-downloader¹¹. However, none of these can search, compare or download from all major genome providers data. Furthermore, downloading and processing genomic data manually can be tedious, error-prone, and poorly reproducible. Although the latter could be remedied by a data management tool, such as iGenomes¹², refGenie¹³ or Go Get Data¹⁴, data managers still require the user to add new data manually.

We have developed genomepy to 1) find genomic data on major providers, 2) compare gene annotations, 3) select the genomic data best suited to your analysis and 4) provide a suite of functions to peruse and manipulate the data. Selected data can be downloaded from anywhere, and is processed automatically. To ensure reproducibility, data sources and processing steps are documented, and can be enhanced further by using a data manager. Genomic data can be loaded into genomepy, which utilizes and extends on packages including pyfaidx¹⁵, pandas¹⁶ and MyGene.info¹⁷ to rapidly work with gene and genome sequences and metadata. Similarly, genomepy has been incorporated into other packages, such as pybedtools¹⁸ and CellOracle¹⁹. Genomepy can be used on command line and though the (fully documented) Python API, for a one-time analysis or integration in pipelines and workflow managers.

Features of genomepy

The key features of genomepy are 1) providing an overview of available assemblies with the search function, 2) download and processing of a selected assembly, with the install function and 3) using assembly data through the Python API.

The search function queries the databases of NCBI, Ensembl, UCSC and GENCODE (caching the metadata), for text, taxonomy identifiers or assembly accession identifiers. The input type is automatically recognized and used to find assemblies that have the text in the genome names or various description fields, matches the taxonomy identifier or (partially) matches the assembly accession. The output of the search function is a table with rows of metadata for each assembly found. The metadata contains the assembly name and accession, taxonomy identifier, and indicates whether a gene annotation can be downloaded (or which of the four UCSC annotations) (see fig. 1a). Snippets of available gene annotation(s) can be inspected with the annotation function (fig. 1b).

An assembly name can be passed to the install function (fig. 1c). The genome FASTA file is downloaded with the desired sequence masking level^20,21 and alternate sequences (softmasked and none by default, respectively). Alternate sequences reflect biological diversity and are often contained in reference assemblies. During sequence alignment however, similar reference sequences result in multiple alignment, leading to loss of data (as discussed in²²). Additional filters may be passed to either include or exclude contigs (chromosomes, scaffolds, etc.) by name or regex pattern. Once processed, a genome index is generated using pyfaidx¹⁵, as well as contig sizes and contig gap sizes.

Gene annotations come in a variety of recognized formats (GFF3, GTF, BED12). The install function will download the most descriptive format available, to output the commonly used GTF and BED12 formats. Contig names of the genome and gene annotation sometimes mismatch, which makes them incompatible with tools such as splice-aware aligners. Therefore, genomepy will attempt to match the contig names of the gene annotations to those used in the genome FASTA.

The install function can be extended with postprocessing steps via plugins. The options can be inspected and toggled with the plugin function. Briefly, the blacklist plugin downloads blacklists by the Kundaje lab²³ for the supported genomes. Other plugins support the generation of aligner indexes, including DNA aligner indexes for Bowtie2²⁴, BWA²⁵, GMAP²⁶ or Minimap2²⁷, and splice-aware aligners such as STAR²⁸ and HISAT2²⁹.

Assemblies not present on the major providers can be processed similarly by supplying the URLs or file paths to the install function. For data provenance and reproducibility, a README file is generated during the installation process with time, source files, processing steps, and filtered contigs.

These features are available on both the command line interface and Python API. Additional features are available on the Python API, focussed around two classes. The Genome class can be used to extract exact or random sequences from the FASTA, filter the FASTA and list the contigs, contig sizes and contig gaps. The Annotation class can be used to browse and filter the BED12 or GTF files as pandas dataframes¹⁶, map gene identifiers to other types using mygene.info¹⁷, map chromosome names to naming schemes of other major providers, and create a dictionary of any two GTF columns or attribute fields (to easily convert gene identifiers to gene names for instance).

Conclusion

Obtaining suitable genomic data is a principal step in any genomics project. With genomepy, finding available assemblies becomes trivial. A genome, with the desired sequence masking, level of biological diversity, and contigs can be obtained with a single command. Gene annotations in GTF and BED12 format, and matching the genome, can similarly be obtained, with further options available in the Python API. Whatever install options you choose are logged, for reproducibility, allowing you to start your analysis with confidence.

Code availability

Genomepy is freely available under the MIT license and can be installed using Bioconda³⁰, Pip³¹, or directly used in workflows with our Docker³² image or snakemake³³ wrapper. Code and documentation are available on github and github-pages, respectively.

Acknowledgements

The authors would like to thank Vicky Luna Velez for designing the genomepy logo, Dohoon Lee and Jie Zhu for contributing to the genomepy code, as well as the many github users who provided feedback, ideas and suggestions.

References

Yates, A. D. et al. Ensembl 2020. Nucleic Acids Research gkz966 (2019) doi:10.1093/nar/gkz966.

Kent, W. J. et al. The Human Genome Browser at UCSC. Genome Res. 12, 996–1006 (2002).

Karolchik, D. The UCSC Table Browser data retrieval tool. Nucleic Acids Research 32, 493D–496 (2004).

Frankish, A. et al. GENCODE 2021. Nucleic Acids Research 49, D916–D923 (2020).

Ruzicka, L. et al. The Zebrafish Information Network: new support for non-coding genes, richer Gene Ontology annotations and the Alliance of Genome Resources. Nucleic Acids Research 47, D867–D873 (2018).

Thurmond, J. et al. FlyBase 2.0: the next generation. Nucleic Acids Research 47, D759–D765 (2018).

Harris, T. W. et al. WormBase: a modern Model Organism Information Resource. Nucleic Acids Research (2019) doi:10.1093/nar/gkz920.

Karimi, K. et al. Xenbase: a genomic, epigenomic and transcriptomic model organism database. Nucleic Acids Research 46, D861–D868 (2018).

Zhao, S. & Zhang, B. A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification. BMC Genomics 16, (2015).

10.

GitHub - kblin/ncbi-genome-download: Scripts to download genomes from the NCBI FTP servers. GitHub https://github.com/kblin/ncbi-genome-download.

11.

GitHub - LucaCappelletti94/ucsc_genomes_downloader: Python package to quickly download genomes from the UCSC. GitHub https://github.com/LucaCappelletti94/ucsc_genomes_downloader.

12.

iGenomes. https://support.illumina.com/sequencing/sequencing_software/igenome.html.

13.

Stolarczyk, M., Reuter, V. P., Smith, J. P., Magee, N. E. & Sheffield, N. C. Refgenie: a reference genome resource manager. GigaScience 9, (2020).

14.

Cormier, M. J. et al. Go Get Data (GGD) is a framework that facilitates reproducible access to genomic data. Nat Commun 12, (2021).

15.

Shirley, M. D., Ma, Z., Pedersen, B. S. & Wheelan, S. J. Efficient "pythonic" access to FASTA files using pyfaidx. (2015) doi:10.7287/peerj.preprints.970v1.

16.

Reback, J. et al. pandas-dev/pandas: Pandas 1.4.3. (Zenodo, 2022). doi:10.5281/zenodo.3509134.

17.

Xin, J. et al. High-performance web services for querying gene and variant annotation. Genome Biol 17, (2016).

18.

Dale, R. K., Pedersen, B. S. & Quinlan, A. R. Pybedtools: a flexible Python library for manipulating genomic datasets and annotations. Bioinformatics 27, 3423–3424 (2011).

19.

Kamimoto, K., Hoffmann, C. M. & Morris, S. A. CellOracle: Dissecting cell identity via network inference and in silico gene perturbation. (2020) doi:10.1101/2020.02.17.947416.

20.

RepeatMasker. http://repeatmasker.org/.

21.

Morgulis, A., Gertz, E. M., Schäffer, A. A. & Agarwala, R. A Fast and Symmetric DUST Implementation to Mask Low-Complexity DNA Sequences. Journal of Computational Biology 13, 1028–1040 (2006).

22.

Church, D. M. et al. Extending reference assembly models. Genome Biol 16, (2015).

23.

Amemiya, H. M., Kundaje, A. & Boyle, A. P. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep 9, (2019).

24.

Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357–359 (2012).

25.

Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

26.

Wu, T. D. & Watanabe, C. K. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859–1875 (2005).

27.

Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

28.

Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2012).

29.

Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12, 357–360 (2015).

30.

Grüning, B. et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods 15, 475–476 (2018).

31.

PyPI · The Python Package Index. PyPI https://pypi.org/.

32.

Dirk, M. Docker: Lightweight Linux Containers for Consistent Development and Deployment. Linux Journal 2014, 1075–3583 (2014).

33.

Köster, J. & Rahmann, S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 34, 3600–3600 (2018).