genomepy.providers.ucsc.UcscProvider
- class genomepy.providers.ucsc.UcscProvider
Bases:
BaseProvider
UCSC genome provider.
The UCSC API REST server is used to search and list genomes. The UCSC MySQL database is used to find metadata and annotations.
- __init__()
Methods
__init__
()annotation_links
(name, **kwargs)Return a sorted list of available gene annotation types for a genome
assembly_accession
(name)Return the assembly accession (GCA_/GCF_) for a genome.
download_annotation
(name[, genomes_dir, ...])Download the UCSC genePred via their MySQL database, and convert to annotations.
download_genome
(name[, genomes_dir, ...])Download a (gzipped) genome file to a specific directory
genome_taxid
(name)Return the genome taxonomy ID for a genome.
get_annotation_download_link
(name, **kwargs)Return an available annotation type.
get_annotation_download_links
(name, **kwargs)Return available gene annotation table(s) from the UCSC MySQL database.
get_genome_download_link
(name[, mask])Return UCSC http link to genome sequence
head_annotation
(name[, genomes_dir, n])Download the first n genes of each UCSC annotation type.
list_available_genomes
([size])List all available genomes.
ping
()Can the provider be reached?
search
(term[, exact, size])Search for term in genome names and descriptions (if term contains text.
Attributes
Metadata fields that (can) contain the assembly's accession ID.
Metadata fields with assembly related info.
Dictionary with assembly names as key and assembly metadata dictionary as value.
Name of this provider.
Metadata fields that (can) contain the assembly's taxonomy ID.
- accession_fields = ['assembly_accession']
Metadata fields that (can) contain the assembly’s accession ID.
- annotation_links(name, **kwargs) List[str]
Return a sorted list of available gene annotation types for a genome
- Parameters
name (str) – genome name
- Returns
Gene annotation types
- Return type
list
- assembly_accession(name: str) str
Return the assembly accession (GCA_/GCF_) for a genome.
Some accession IDs can be retrieved from the UCSC MySQL hgFixed database. For others, the accession IDs can sometimes be scraped from the readme.html. If not, any linked NCBI assembly pages can also be scraped.
- Parameters
name (str) – genome name
- Returns
Assembly accession.
- Return type
str
- description_fields = ['description', 'scientificName']
Metadata fields with assembly related info.
- download_annotation(name, genomes_dir=None, localname=None, **kwargs)
Download the UCSC genePred via their MySQL database, and convert to annotations.
- download_genome(name: str, genomes_dir: str = None, localname: str = None, mask: str = 'soft', **kwargs)
Download a (gzipped) genome file to a specific directory
- Parameters
name (str) – Genome / species name
genomes_dir (str , optional) – Directory to install genome
localname (str , optional) – Custom name for your genome
mask (str , optional) – Masking, soft, hard or none (all other strings)
- genome_taxid(name: str) int
Return the genome taxonomy ID for a genome.
- Parameters
name (str) – genome name
- Returns
Genome Taxonomy identifier
- Return type
int
- genomes = {}
Dictionary with assembly names as key and assembly metadata dictionary as value.
- get_annotation_download_link(name: str, **kwargs) str
Return an available annotation type.
- Parameters
name (str) – genome name
**kwargs (dict, optional:) – ucsc_annotation_type : specific annotation type to download.
- Returns
http/ftp link
- Return type
str
- Raises
GenomeDownloadError – if no functional link was found
FileNotFoundError – if the specified annotation type is unavailable
- get_annotation_download_links(name, **kwargs)
Return available gene annotation table(s) from the UCSC MySQL database.
Available tables were retrieved on init.
- Parameters
name (str) – genome name
- Returns
annotation types
- Return type
list
- get_genome_download_link(name, mask='soft', **kwargs)
Return UCSC http link to genome sequence
- Parameters
name (str) – Genome name. Current implementation will fail if exact name is not found.
mask (str , optional) – Masking level. Options: soft, hard or none. Default is soft.
- Return type
str with the http/ftp download link.
- head_annotation(name, genomes_dir=None, n: int = 5, **kwargs)
Download the first n genes of each UCSC annotation type.
The first line of the GTF is printed for review (of the gene_name field, for instance).
- Parameters
name (str) – genome name
genomes_dir (str, optional) – genomes directory to install the annotation in.
n (int, optional) – download the annotation for n genes.
kwargs (dict , optional) –
- annotationslist
specify which UCSC annotation types to download. Downloads all available if left blank.
- list_available_genomes(size=False)
List all available genomes.
- Parameters
size (bool, optional) – Show absolute genome size.
- Yields
genomes (list of tuples) – tuples with assembly name, accession, scientific_name, taxonomy id and description
- name = 'UCSC'
Name of this provider.
- static ping()
Can the provider be reached?
- search(term: str, exact=False, size=False)
Search for term in genome names and descriptions (if term contains text. Case-insensitive), assembly accession IDs (if term starts with GCA_ or GCF_), or taxonomy IDs (if term is a number).
Note: exact accession ID search on UCSC may return different patch levels.
- Parameters
term (str, int) – Search term, case-insensitive. Can be an assembly name (e.g. hg38), scientific name (Danio rerio), assembly accession ID (GCA_000146045), or taxonomy ID (7227).
exact (bool, optional) – term must be an exact match
size (bool, optional) – Show absolute genome size.
- Yields
tuples with name and metadata
- taxid_fields = ['taxId']
Metadata fields that (can) contain the assembly’s taxonomy ID.