Python API documentation (core)

This page described the core genomepy functionality. These classes and functions can be found on the top level of the genomepy module (e.g. genomepy.search), and are made available when running from genomepy import * (we won’t judge you).

Additional functions that do not fit the core functionality, but we feel are still pretty cool, are also described.

Finding genomic data

When looking to download a new genome/gene annotation, your first step would be genomepy.search. This function will check either one, or all, providers. Advanced users may want to specify a provider for their search to speed up the process. To see which providers are available, use genomepy.list_providers or genomepy.list_online_providers:


genomepy.list_providers()

List of providers genomepy supports

Returns

names of providers

Return type

list

genomepy.list_online_providers()

List of providers genomepy supports that are online right now.

Returns

names of online providers

Return type

list

genomepy.search(term: str, provider: str = None, exact=False, size=False)

Search for term in genome names and descriptions (if term contains text. Case-insensitive), assembly accession IDs (if term starts with GCA_ or GCF_), or taxonomy IDs (if term is a number).

If provider is specified, search only that specific provider, else search all providers.

Note: exact accession ID search on UCSC may return different patch levels.

Parameters
  • term (str, int) – Search term, case-insensitive, allows regex.

  • provider (str , optional) – Only search the specified provider (faster).

  • exact (bool, optional) – term must be an exact match

  • size (bool, optional) – Show absolute genome size.

Yields

list – genome name, provider and metadata


If you have no idea what you are looking for, you could even check out all available genomes. Be warned, genomepy.list_available_genomes is like watching the Star Wars title crawl.


genomepy.list_available_genomes(provider=None, size=False) list

List all available genomes.

Parameters
  • provider (str, optional) – List genomes from specific provider. Genomes from all providers will be returned if not specified.

  • size (bool, optional) – Show absolute genome size.

Yields

list – tuples with genome name, provider and metadata


If we search for homo sapiens for instance, we find that GRCh3.p13 and hg38 are the latest versions. These names describe the same genome, but different assemblies, with differences between them.

One of these differences is the quality of the gene annotation. Next, we can inspect these with genomepy.head_annotations:


genomepy.head_annotations(name: str, provider=None, n: int = 2)

Quickly inspect the metadata of each available annotation for the specified genome.

For UCSC, up to 4 gene annotation styles are available: “ncbiRefSeq”, “refGene”, “ensGene”, “knownGene” (respectively).

For NCBI, the chromosome names are not yet sanitized.

Parameters
  • name (str) – genome name

  • provider (str, optional) – only search the specified provider for the genome name

  • n (int, optional) – number of lines to show


Installing genomic data

Now that you have seen whats available, its time to download a genome. The default parameter for genomepy.install_genome are optimized for sequence alignment and gene counting, but you have full control over them, so have a look!

genomepy won’t overwrite any files you already downloaded (unless specified), but you can review your local genomes with genomepy.list_installed_genomes.


genomepy.install_genome(name: str, provider: Optional[str] = None, genomes_dir: Optional[str] = None, localname: Optional[str] = None, mask: Optional[str] = 'soft', keep_alt: Optional[bool] = False, regex: Optional[str] = None, invert_match: Optional[bool] = False, bgzip: Optional[bool] = None, annotation: Optional[bool] = False, only_annotation: Optional[bool] = False, skip_matching: Optional[bool] = False, skip_filter: Optional[bool] = False, threads: Optional[int] = 1, force: Optional[bool] = False, **kwargs: Optional[dict]) Genome

Install a genome (& gene annotation).

Parameters
  • name (str) – Genome name

  • provider (str , optional) – Provider name. will try Gencode, Ensembl, UCSC and NCBI (in that order) if not specified.

  • genomes_dir (str , optional) – Where to create the output folder.

  • localname (str , optional) – Custom name for this genome.

  • mask (str , optional) – Genome masking of repetitive sequences. Options: hard/soft/none, default is soft.

  • keep_alt (bool , optional) – Some genomes contain alternative regions. These regions cause issues with sequence alignment, as they are inherently duplications of the consensus regions. Set to true to keep these alternative regions.

  • regex (str , optional) – Regular expression to select specific chromosome / scaffold names.

  • invert_match (bool , optional) – Set to True to select all chromosomes that don’t match the regex.

  • bgzip (bool , optional) – If set to True the genome FASTA file will be compressed using bgzip, and gene annotation will be compressed with gzip.

  • threads (int , optional) – Build genome index using multithreading (if supported). Default: lowest of 8/all threads.

  • force (bool , optional) – Set to True to overwrite existing files.

  • annotation (bool , optional) – If set to True, download gene annotation in BED and GTF format.

  • only_annotation (bool , optional) – If set to True, only download the gene annotation files.

  • skip_matching (bool , optional) – If set to True, contigs in the annotation not matching those in the genome will not be corrected.

  • skip_filter (bool , optional) – If set to True, the gene annotations will not be filtered to match the genome contigs.

  • kwargs (dict , optional) –

    Provider specific options.

    toplevelbool , optional

    Ensembl only: Always download the toplevel genome. Ignores potential primary assembly.

    versionint , optional

    Ensembl only: Specify release version. Default is latest.

    to_annotationtext , optional

    URL only: direct link to annotation file. Required if this is not the same directory as the fasta.

    path_to_annotationtext, optional

    Local only: path to local annotation file. Required if this is not the same directory as the fasta.

Returns

Genome class with the installed genome

Return type

Genome

genomepy.list_installed_genomes(genomes_dir: str = None) list

List all locally available genomes.

Parameters

genomes_dir (str, optional) – Directory with genomes installed by genomepy.

Returns

genome names

Return type

list


If you want to download a sequence blacklist, or create an aligner index, you might wanna look at plugins! Don’t worry, you can rerun the genome.install_genome command, and genomepy will only run the new parts.


genomepy.manage_plugins(command: str, plugin_names: list = None)

Manage genomepy plugins

Parameters
  • command (str) –

    command to perform. Options:

    list

    show plugins and status

    enable

    enable plugins

    disable

    disable plugins

  • plugin_names (list) – plugin names for the enable/disable command


The genome and gene annotations were installed in the genomes directory (unless specified otherwise). If you have a specific location in mind, you could set this as default in the genomepy config. To find and inspect it, use genomepy.manage_config:


genomepy.manage_config(command)

Manage the genomepy configuration

Parameters

command (str) –

command to perform. Options:

file

return config filepath

show

return config content

generate

create new config file


Errors

Did something go wrong? Oh noes! If the problem persists, clear the genomepy cache with genomepy.clean, and try again.


genomepy.clean()

Remove cached data on providers.


Using a genome

Alright, you’ve got the goods! You can browse the genome’s sequences and metadata with the genomepy.Genome class. This class builds on the pyfaidx.Fasta class to also provide you with several options to get specific sequences from your genome, and save these to file.


class genomepy.Genome(name, genomes_dir=None, *args, **kwargs)

pyfaidx Fasta object of a genome with additional attributes & methods.

Generates a genome index file, sizes file and gaps file of the genome.

Parameters
  • name (str) – Genome name

  • genomes_dir (str, optional) – Genome installation directory

Returns

An object that provides a pygr compatible interface.

Return type

pyfaidx.Fasta

Methods

close()

get_random_sequences([n, length, chroms, ...])

Return random genomic sequences.

get_seq(name, start, end[, rc])

Return a sequence by record name and interval [start, end].

get_spliced_seq(name, intervals[, rc])

Return a sequence by record name and list of intervals

items()

keys()

track2fasta(track[, fastafile, stranded, ...])

Return a list of fasta sequences as Sequence objects as directed from the track(s).

values()

Attributes

gaps

contigs and the number of Ns contained

plugin

dict of all active plugins and their properties

sizes

contigs and their lengths

genomes_dir

path to the genomepy genomes directory

name

genome name

genome_file

path to the genome fasta

genome_dir

path to the genome directory

index_file

path to the genome index

sizes_file

path to the chromosome sizes file

gaps_file

path to the chromosome gaps file

annotation_gtf_file

path to the gene annotation GTF file

annotation_bed_file

path to the gene annotation BED file

readme_file

path to the README file

annotation_bed_file

path to the gene annotation BED file

annotation_gtf_file

path to the gene annotation GTF file

assembly_accession

genome assembly accession

gaps: dict = None

contigs and the number of Ns contained

Type

contents of the gaps file

gaps_file

path to the chromosome gaps file

genome_dir

path to the genome directory

genome_file

path to the genome fasta

genomes_dir

path to the genomepy genomes directory

get_random_sequences(n=10, length=200, chroms=None, max_n=0.1, outtype='list')

Return random genomic sequences.

Parameters
  • n (int , optional) – Number of sequences to return.

  • length (int , optional) – Length of sequences to return.

  • chroms (list , optional) – Return sequences only from these chromosomes.

  • max_n (float , optional) – Maximum fraction of Ns.

  • outtype (string , optional) – return the output as list or string. Options: “list” or “string”, default: “list”.

Returns

coordinates as lists or strings: List with [chrom, start, end] genomic coordinates. String with “chrom:start-end” genomic coordinates (can be used as input for track2fasta).

Return type

list

get_seq(name, start, end, rc=False)

Return a sequence by record name and interval [start, end].

Coordinates are 1-based, closed interval. If rc is set, reverse complement will be returned.

get_spliced_seq(name, intervals, rc=False)

Return a sequence by record name and list of intervals

Interval list is an iterable of [start, end]. Coordinates are 1-based, end-exclusive. If rc is set, reverse complement will be returned.

index_file

path to the genome index

name

genome name

property plugin

dict of all active plugins and their properties

readme_file

path to the README file

sizes: dict = None

contigs and their lengths

Type

contents of the sizes file

sizes_file

path to the chromosome sizes file

tax_id

genome taxonomy identifier

track2fasta(track, fastafile=None, stranded=False, extend_up=0, extend_down=0)

Return a list of fasta sequences as Sequence objects as directed from the track(s).

Parameters
  • track (list/region file/bed file) – region(s) you wish to translate to fasta. Example input files can be found in genomepy/tests/data/regions.*

  • fastafile (bool , optional) – return Sequences as list or save to file? (default: list)

  • stranded (bool , optional) – return sequences for both strands? Required BED6 (or higher) as input (default: False)

  • extend_up (int , optional) – extend the sequences up? (command is strand sensitive, default: 0)

  • extend_down (int , optional) – extend the sequences down? (command is strand sensitive, default: 0)


You can obtain genomic sequences from a wide variety of inputs with as_seqdict. To use the function, it must be explicitly imported with from genomepy.seq import as_seqdict.


genomepy.seq.as_seqdict(to_convert, genome=None, minsize=None)
genomepy.seq.as_seqdict(to_convert: list, genome=None, minsize=None)
genomepy.seq.as_seqdict(to_convert: TextIOWrapper, genome=None, minsize=None)
genomepy.seq.as_seqdict(to_convert: str, genome=None, minsize=None)
genomepy.seq.as_seqdict(to_convert: Fasta, genome=None, minsize=None)
genomepy.seq.as_seqdict(to_convert: ndarray, genome=None, minsize=None)

Convert input to a dictionary with name as key and sequence as value.

If the input contains genomic coordinates, the genome needs to be specified. If minsize is specified all sequences will be checked if they are not shorter than minsize. If regions (or a region file) are used as the input, the genome can optionally be specified in the region using the following format: genome@chrom:start-end.

Current supported input types include: * FASTA, BED and region files. * List or numpy.ndarray of regions. * pyfaidx.Fasta object. * pybedtools.BedTool object.

Parameters
  • to_convert (list, str, pyfaidx.Fasta or pybedtools.BedTool) – Input to convert to FASTA-like dictionary

  • genome (str, optional) – Genomepy genome name.

  • minsize (int or None, optional) – If specified, check if all sequences have at least size minsize.

Returns

sequence names as key and sequences as value.

Return type

dict


A non-core function worth mentioning is genomepy.files.filter_fasta, for when you wish to filter a fasta file by chromosome name using regex, but want the output straight to (another) fasta file.


genomepy.files.filter_fasta(infa: str, outfa: str = None, regex: str = '.*', invert_match: Optional[bool] = False) list

Filter fasta file based on regex.

Parameters
  • infa (str) – Filename of the input fasta file.

  • outfa (str, optional) – Filename of the output fasta file. If None, infa is overwritten.

  • regex (str, optional) – Regular expression used for selecting sequences. Matches everything if left blank.

  • invert_match (bool, optional) – Select all sequence not matching regex if set.

Returns

removed contigs

Return type

list


Using a gene annotation

Similarly, the genomepy.Annotation class helps you get the genes in check. This class returns a number of neat pandas dataframes, such as the named_gtf, or an annotation with the gene or chromosome names remapped to another type. Remapping gene names to another type is also possible with Annotation.map_genes. This feature also comes as separate function genomepy.query_mygene, as it’s just so darn useful.


class genomepy.Annotation(name: str, genomes_dir: str = None, quiet: bool = False)

Manipulate genes and whole gene annotations with pandas dataframes.

Parameters
  • name (str) – Genome name/directory/fasta or gene annotation BED/GTF file.

  • genomes_dir (str, optional) – Genomes installation directory.

  • quiet (bool, optional) – Silence init warnings

Returns

attributes & methods to manipulate gene annotations

Return type

object

annotation_bed_file

path to the gene annotation BED file

annotation_contigs: list = None

Contigs found in the gene annotation BED

annotation_gtf_file

path to the gene annotation GTF file

attributes(annot: Union[str, DataFrame] = 'gtf')

list all attributes present in the GTF attribute field.

Parameters

annot (str or pd.Dataframe, optional) – any GTF in dataframe format, or the default GTF.

Returns

with attributes

Return type

list

bed: DataFrame = None

Dataframe with BED format annotation

filter_regex(annot: Union[str, DataFrame], regex: Optional[str] = '.*', invert_match: Optional[bool] = False, column: Union[str, int] = 0) DataFrame

Filter a dataframe by any column using regex.

Parameters
  • annot (str or pd.Dataframe) – annotation to filter: “bed”, “gtf” or a pandas dataframe

  • regex (str) – regex string to match

  • invert_match (bool, optional) – keep contigs NOT matching the regex string

  • column (str or int, optional) – column name or number to filter (default: 1st, contig name)

Returns

filtered dataframe

Return type

pd.DataFrame

from_attributes(field, annot: Union[str, DataFrame] = 'gtf', check=True)

Convert the specified GTF attribute field to a pandas series

Parameters
  • field (str) – field from the GTF’s attribute column.

  • annot (str or pd.Dataframe, optional) – any GTF in dataframe format, or the default GTF.

  • check (bool, optional) – filter the GTF for rows containing field?

Returns

with the same index as the input GTF and the field column

Return type

pd.Series

gene_coords(genes: Iterable[str], annot: str = 'bed') DataFrame

Retrieve gene locations.

Parameters
  • genes (Iterable) – List of gene names as found in the given annotation file type

  • annot (str, optional) – Annotation file type: ‘bed’ or ‘gtf’ (default: “bed”)

Returns

gene annotation

Return type

pandas.DataFrame

genes(annot: str = 'gtf') list

Retrieve gene names from an annotation.

For BED files, names are taken from the ‘name’ columns.

For GTF files, names are taken from the ‘gene_name’ field in the attribute column, if available.

Parameters

annot (str, optional) – Annotation file type: ‘bed’ or ‘gtf’ (default: “gtf”)

Returns

gene names

Return type

list

genome_contigs: list = None

Contigs found in the genome fasta

genome_dir

path to the genome directory

genome_file

path to the genome fasta

gtf: DataFrame = None

Dataframe with GTF format annotation

gtf_dict(key, value, string_values=True, annot: Union[str, DataFrame] = 'gtf')

Create a dictionary based on the columns or attribute fields in a GTF.

Parameters
  • key (str) – column name or attribute fields (e.g. “seqname”, “gene_name”)

  • value (str) – column name or attribute fields (e.g. “gene_id”, “transcript_name”)

  • string_values (bool, optional) – attempt to format the dict values as strings (only happens if all value lists are length 1)

  • annot (str or pd.Dataframe, optional) – annotation to filter: “gtf” or a pandas dataframe

Returns

with values as lists. If string_values is True and all lists are length 1, values will be strings.

Return type

dict

index_file

path to the genome index

lengths(attribute='gene_name')

Return a series with the selected GTF attribute as index, and its lengths as values.

Parameters

attribute (str) – attribute to provide lengths of. Options: gene_name, gene_id, transcript_name, transcript_id. Attribute must be present in the GTF file.

Returns

attribute indexed series named ‘length’

Return type

pd.Series

map_genes(field: str, product: str = 'protein', annot: Union[str, DataFrame] = 'bed') DataFrame

Use mygene.info to map gene identifiers to any specified field.

Returns the dataframe with remapped “name” column. Drops missing identifiers.

Parameters
  • annot (str or pd.Dataframe) – Annotation dataframe to map (a pandas dataframe or “bed”). Is mapped to a column named “name” (required).

  • field (str, optional) – Identifier for gene annotation. Uses mygene.info to map ids. Valid fields are: ensembl.gene, entrezgene, symbol, name, refseq, entrezgene. Note that refseq will return the protein refseq_id by default, use product=”rna” to return the RNA refseq_id. Currently, mapping to Ensembl transcript ids is not supported.

  • product (str, optional) – Either “protein” or “rna”. Only used when field=”refseq”

Returns

remapped gene annotation

Return type

pandas.DataFrame

map_locations(annot: Union[str, DataFrame], to: str, drop=True) Union[None, DataFrame]

Map chromosome mapping from one assembly to another.

Uses the NCBI assembly reports to find contigs. Drops missing contigs.

Parameters
  • annot (str or pd.Dataframe) – annotation to map: “bed”, “gtf” or a pandas dataframe.

  • to (str) – target provider (UCSC, Ensembl or NCBI)

  • drop (bool, optional) – if True, replace the chromosome column. If False, add a 2nd chromosome column.

Returns

chromosome mapping.

Return type

pandas.DataFrame

name

genome name

named_gtf: DataFrame = None

Dataframe with GTF format annotation, with gene_name as index

readme_file

path to the README file

sanitize(match=True, filter=True, overwrite=False)

Match the contigs names of the gene annotations to the genome’s.

First, match the contig names if possible. Second, remove contig names not found in the genome. Third, save the results and log this in the README.

Parameters
  • match (bool, optional) – match annotation contig names to the genome contig names (default is True)

  • filter (bool, optional) – remove annotation contig names not found in the genome contig names (default is True)

  • overwrite (bool, optional) – update the annotation files on disk, and log this in the README (default is False).

Returns

updated attributes

Return type

Annotation class

sizes_file

path to the chromosome sizes file

tax_id

genome taxonomy identifier

genomepy.query_mygene(query: Iterable[str], tax_id: Union[str, int], field: str = 'genomic_pos') DataFrame

Use mygene.info to map gene identifiers to another type.

Parameters
  • query (iterable) – a list or list-like of gene identifiers

  • tax_id (str or int) – Target genome taxonomy id

  • field (str, optional) – Target identifier to map the query genes to. Valid fields are: ensembl.gene, entrezgene, symbol, name, refseq, entrezgene. Note that refseq will return the protein refseq_id by default, use refseq.translation.rna to return the RNA refseq_id. Currently, mapping to Ensembl transcript ids is not supported.

Returns

mapped gene annotation.

Return type

pandas.DataFrame


Another non-core function worth mentioning is genomepy.annotation.filter_regex, which allows you to filter a dataframe by any columns using regex.


genomepy.annotation.filter_regex(df: DataFrame, regex: str, invert_match: Optional[bool] = False, column: Union[str, int] = 0) DataFrame

Filter a pandas dataframe by a column (default: 1st, contig name).

Parameters
  • df (pd.Dataframe) – annotation to filter (a pandas dataframe)

  • regex (str) – regex string to match

  • invert_match (bool, optional) – keep contigs NOT matching the regex string

  • column (str or int, optional) – column name or number to filter (default: 1st, contig name)

Returns

filtered dataframe

Return type

pd.DataFrame