genomepy.providers.ensembl.EnsemblProvider

class genomepy.providers.ensembl.EnsemblProvider

Bases: BaseProvider

Ensembl genome provider.

Will search both ensembl.org as well as ensemblgenomes.org. The bacteria division is not yet supported.

__init__()

Methods

__init__()

annotation_links(name, **kwargs)

Return available gene annotation links (http/ftp) for a genome

assembly_accession(name)

Return the assembly accession number (GCA* or GCF*) for a genome.

download_annotation(name[, genomes_dir, ...])

Download annotation file to to a specific directory

download_genome(name[, genomes_dir, ...])

Download a (gzipped) genome file to a specific directory

genome_taxid(name)

Return the genome taxonomy ID for a genome.

get_annotation_download_link(name, **kwargs)

Return a functional annotation download link.

get_annotation_download_links(name, **kwargs)

Retrieve functioning gene annotation download link(s).

get_division(name)

Retrieve the division of a genome.

get_genome_download_link(name[, mask])

Return http link to the genome sequence

get_release(is_vertebrate)

Retrieve current Ensembl or EnsemblGenomes release version.

get_releases(is_vertebrate)

Retrieve all Ensembl or EnsemblGenomes release versions.

get_version(name[, version])

Retrieve the latest Ensembl or EnsemblGenomes release version, or check if the requested release version exists.

head_annotation(name[, genomes_dir, n])

Download the first n lines of the annotation.

list_available_genomes([size])

List all available genomes.

ping()

Can the provider be reached?

releases_with_assembly(name)

List all Ensembl or EnsemblGenomes release versions with the specified genome.

search(term[, exact, size])

Search for term in genome names and descriptions (if term contains text.

Attributes

accession_fields

Metadata fields that (can) contain the assembly's accession ID.

description_fields

Metadata fields with assembly related info.

genomes

Dictionary with assembly names as key and assembly metadata dictionary as value.

name

Name of this provider.

taxid_fields

Metadata fields that (can) contain the assembly's taxonomy ID.

accession_fields = ['assembly_accession']

Metadata fields that (can) contain the assembly’s accession ID.

Return available gene annotation links (http/ftp) for a genome

Parameters

name (str) – genome name

Returns

Gene annotation links

Return type

list

assembly_accession(name: str) str

Return the assembly accession number (GCA* or GCF*) for a genome.

Parameters

name (str) – genome name

Returns

Assembly accession number

Return type

str

description_fields = ['name', 'scientific_name', 'url_name', 'display_name']

Metadata fields with assembly related info.

download_annotation(name, genomes_dir=None, localname=None, **kwargs)

Download annotation file to to a specific directory

Parameters
  • name (str) – Genome / species name

  • genomes_dir (str , optional) – Directory to install annotation

  • localname (str , optional) – Custom name for your genome

download_genome(name: str, genomes_dir: str = None, localname: str = None, mask: str = 'soft', **kwargs)

Download a (gzipped) genome file to a specific directory

Parameters
  • name (str) – Genome / species name

  • genomes_dir (str , optional) – Directory to install genome

  • localname (str , optional) – Custom name for your genome

  • mask (str , optional) – Masking, soft, hard or none (all other strings)

genome_taxid(name: str) int

Return the genome taxonomy ID for a genome.

Parameters

name (str) – genome name

Returns

Genome Taxonomy identifier

Return type

int

genomes = {}

Dictionary with assembly names as key and assembly metadata dictionary as value.

Return a functional annotation download link.

Parameters

name (str) – genome name

Returns

http/ftp link

Return type

str

Raises

GenomeDownloadError – if no functional link was found

Retrieve functioning gene annotation download link(s).

Parameters
  • name (str) – genome name

  • **kwargs (dict, optional:) – version : Ensembl version to use. By default the latest version is used

Returns

http link(s)

Return type

list

get_division(name: str)

Retrieve the division of a genome.

Return http link to the genome sequence

Parameters
  • name (str) – Genome name. Current implementation will fail if exact name is not found.

  • mask (str , optional) – Masking level. Options: soft, hard or none. Default is soft.

Return type

str with the http download link.

get_release(is_vertebrate: bool) int

Retrieve current Ensembl or EnsemblGenomes release version.

static get_releases(is_vertebrate: bool)

Retrieve all Ensembl or EnsemblGenomes release versions.

get_version(name: str, version=None) int

Retrieve the latest Ensembl or EnsemblGenomes release version, or check if the requested release version exists.

head_annotation(name: str, genomes_dir=None, n: int = 5, **kwargs)

Download the first n lines of the annotation.

The first line of the GTF is printed for review (of the gene_name field, for instance).

Parameters
  • name (str) – genome name

  • genomes_dir (str, optional) – genomes directory to install the annotation in.

  • n (int, optional) – download the annotation for n genes.

list_available_genomes(size=False)

List all available genomes.

Parameters

size (bool, optional) – Show absolute genome size.

Yields

genomes (list of tuples) – tuples with assembly name, accession, scientific_name, taxonomy id and description

name = 'Ensembl'

Name of this provider.

static ping()

Can the provider be reached?

releases_with_assembly(name: str)

List all Ensembl or EnsemblGenomes release versions with the specified genome.

search(term: str, exact=False, size=False)

Search for term in genome names and descriptions (if term contains text. Case-insensitive), assembly accession IDs (if term starts with GCA_ or GCF_), or taxonomy IDs (if term is a number).

Note: exact accession ID search on UCSC may return different patch levels.

Parameters
  • term (str, int) – Search term, case-insensitive. Can be an assembly name (e.g. hg38), scientific name (Danio rerio), assembly accession ID (GCA_000146045), or taxonomy ID (7227).

  • exact (bool, optional) – term must be an exact match

  • size (bool, optional) – Show absolute genome size.

Yields

tuples with name and metadata

taxid_fields = ['taxonomy_id']

Metadata fields that (can) contain the assembly’s taxonomy ID.