genomepy.providers.ensembl.EnsemblProvider
- class genomepy.providers.ensembl.EnsemblProvider
Bases:
BaseProvider
Ensembl genome provider.
Will search both ensembl.org as well as ensemblgenomes.org. The bacteria division is not yet supported.
- __init__()
Methods
__init__
()annotation_links
(name, **kwargs)Return available gene annotation links (http/ftp) for a genome
assembly_accession
(name)Return the assembly accession number (GCA* or GCF*) for a genome.
download_annotation
(name[, genomes_dir, ...])Download annotation file to to a specific directory
download_genome
(name[, genomes_dir, ...])Download a (gzipped) genome file to a specific directory
genome_taxid
(name)Return the genome taxonomy ID for a genome.
get_annotation_download_link
(name, **kwargs)Return a functional annotation download link.
get_annotation_download_links
(name, **kwargs)Retrieve functioning gene annotation download link(s).
get_division
(name)Retrieve the division of a genome.
get_genome_download_link
(name[, mask])Return http link to the genome sequence
get_release
(is_vertebrate)Retrieve current Ensembl or EnsemblGenomes release version.
get_releases
(is_vertebrate)Retrieve all Ensembl or EnsemblGenomes release versions.
get_version
(name[, version])Retrieve the latest Ensembl or EnsemblGenomes release version, or check if the requested release version exists.
head_annotation
(name[, genomes_dir, n])Download the first n lines of the annotation.
list_available_genomes
([size])List all available genomes.
ping
()Can the provider be reached?
releases_with_assembly
(name)List all Ensembl or EnsemblGenomes release versions with the specified genome.
search
(term[, exact, size])Search for term in genome names and descriptions (if term contains text.
Attributes
Metadata fields that (can) contain the assembly's accession ID.
Metadata fields with assembly related info.
Dictionary with assembly names as key and assembly metadata dictionary as value.
Name of this provider.
Metadata fields that (can) contain the assembly's taxonomy ID.
- accession_fields = ['assembly_accession']
Metadata fields that (can) contain the assembly’s accession ID.
- annotation_links(name: str, **kwargs) List[str]
Return available gene annotation links (http/ftp) for a genome
- Parameters
name (str) – genome name
- Returns
Gene annotation links
- Return type
list
- assembly_accession(name: str) str
Return the assembly accession number (GCA* or GCF*) for a genome.
- Parameters
name (str) – genome name
- Returns
Assembly accession number
- Return type
str
- description_fields = ['name', 'scientific_name', 'url_name', 'display_name']
Metadata fields with assembly related info.
- download_annotation(name, genomes_dir=None, localname=None, **kwargs)
Download annotation file to to a specific directory
- Parameters
name (str) – Genome / species name
genomes_dir (str , optional) – Directory to install annotation
localname (str , optional) – Custom name for your genome
- download_genome(name: str, genomes_dir: str = None, localname: str = None, mask: str = 'soft', **kwargs)
Download a (gzipped) genome file to a specific directory
- Parameters
name (str) – Genome / species name
genomes_dir (str , optional) – Directory to install genome
localname (str , optional) – Custom name for your genome
mask (str , optional) – Masking, soft, hard or none (all other strings)
- genome_taxid(name: str) int
Return the genome taxonomy ID for a genome.
- Parameters
name (str) – genome name
- Returns
Genome Taxonomy identifier
- Return type
int
- genomes = {}
Dictionary with assembly names as key and assembly metadata dictionary as value.
- get_annotation_download_link(name: str, **kwargs) str
Return a functional annotation download link.
- Parameters
name (str) – genome name
- Returns
http/ftp link
- Return type
str
- Raises
GenomeDownloadError – if no functional link was found
- get_annotation_download_links(name, **kwargs)
Retrieve functioning gene annotation download link(s).
- Parameters
name (str) – genome name
**kwargs (dict, optional:) – version : Ensembl version to use. By default the latest version is used
- Returns
http link(s)
- Return type
list
- get_division(name: str)
Retrieve the division of a genome.
- get_genome_download_link(name, mask='soft', **kwargs)
Return http link to the genome sequence
- Parameters
name (str) – Genome name. Current implementation will fail if exact name is not found.
mask (str , optional) – Masking level. Options: soft, hard or none. Default is soft.
- Return type
str with the http download link.
- get_release(is_vertebrate: bool) int
Retrieve current Ensembl or EnsemblGenomes release version.
- static get_releases(is_vertebrate: bool)
Retrieve all Ensembl or EnsemblGenomes release versions.
- get_version(name: str, version=None) int
Retrieve the latest Ensembl or EnsemblGenomes release version, or check if the requested release version exists.
- head_annotation(name: str, genomes_dir=None, n: int = 5, **kwargs)
Download the first n lines of the annotation.
The first line of the GTF is printed for review (of the gene_name field, for instance).
- Parameters
name (str) – genome name
genomes_dir (str, optional) – genomes directory to install the annotation in.
n (int, optional) – download the annotation for n genes.
- list_available_genomes(size=False)
List all available genomes.
- Parameters
size (bool, optional) – Show absolute genome size.
- Yields
genomes (list of tuples) – tuples with assembly name, accession, scientific_name, taxonomy id and description
- name = 'Ensembl'
Name of this provider.
- static ping()
Can the provider be reached?
- releases_with_assembly(name: str)
List all Ensembl or EnsemblGenomes release versions with the specified genome.
- search(term: str, exact=False, size=False)
Search for term in genome names and descriptions (if term contains text. Case-insensitive), assembly accession IDs (if term starts with GCA_ or GCF_), or taxonomy IDs (if term is a number).
Note: exact accession ID search on UCSC may return different patch levels.
- Parameters
term (str, int) – Search term, case-insensitive. Can be an assembly name (e.g. hg38), scientific name (Danio rerio), assembly accession ID (GCA_000146045), or taxonomy ID (7227).
exact (bool, optional) – term must be an exact match
size (bool, optional) – Show absolute genome size.
- Yields
tuples with name and metadata
- taxid_fields = ['taxonomy_id']
Metadata fields that (can) contain the assembly’s taxonomy ID.