genomepy.providers.ucsc.UcscProvider

class genomepy.providers.ucsc.UcscProvider

Bases: BaseProvider

UCSC genome provider.

The UCSC API REST server is used to search and list genomes. The UCSC MySQL database is used to find metadata and annotations.

__init__()

Methods

__init__()

annotation_links(name, **kwargs)

Return a sorted list of available gene annotation types for a genome

assembly_accession(name)

Return the assembly accession (GCA_/GCF_) for a genome.

download_annotation(name[, genomes_dir, ...])

Download the UCSC genePred via their MySQL database, and convert to annotations.

download_genome(name[, genomes_dir, ...])

Download a (gzipped) genome file to a specific directory

genome_taxid(name)

Return the genome taxonomy ID for a genome.

get_annotation_download_link(name, **kwargs)

Return an available annotation type.

get_annotation_download_links(name, **kwargs)

Return available gene annotation table(s) from the UCSC MySQL database.

get_genome_download_link(name[, mask])

Return UCSC http link to genome sequence

head_annotation(name[, genomes_dir, n])

Download the first n genes of each UCSC annotation type.

list_available_genomes([size])

List all available genomes.

ping()

Can the provider be reached?

search(term[, exact, size])

Search for term in genome names and descriptions (if term contains text.

Attributes

accession_fields

Metadata fields that (can) contain the assembly's accession ID.

description_fields

Metadata fields with assembly related info.

genomes

Dictionary with assembly names as key and assembly metadata dictionary as value.

name

Name of this provider.

taxid_fields

Metadata fields that (can) contain the assembly's taxonomy ID.

accession_fields = ['assembly_accession']

Metadata fields that (can) contain the assembly’s accession ID.

Return a sorted list of available gene annotation types for a genome

Parameters

name (str) – genome name

Returns

Gene annotation types

Return type

list

assembly_accession(name: str) str

Return the assembly accession (GCA_/GCF_) for a genome.

Some accession IDs can be retrieved from the UCSC MySQL hgFixed database. For others, the accession IDs can sometimes be scraped from the readme.html. If not, any linked NCBI assembly pages can also be scraped.

Parameters

name (str) – genome name

Returns

Assembly accession.

Return type

str

description_fields = ['description', 'scientificName']

Metadata fields with assembly related info.

download_annotation(name, genomes_dir=None, localname=None, **kwargs)

Download the UCSC genePred via their MySQL database, and convert to annotations.

download_genome(name: str, genomes_dir: str = None, localname: str = None, mask: str = 'soft', **kwargs)

Download a (gzipped) genome file to a specific directory

Parameters
  • name (str) – Genome / species name

  • genomes_dir (str , optional) – Directory to install genome

  • localname (str , optional) – Custom name for your genome

  • mask (str , optional) – Masking, soft, hard or none (all other strings)

genome_taxid(name: str) int

Return the genome taxonomy ID for a genome.

Parameters

name (str) – genome name

Returns

Genome Taxonomy identifier

Return type

int

genomes = {}

Dictionary with assembly names as key and assembly metadata dictionary as value.

Return an available annotation type.

Parameters
  • name (str) – genome name

  • **kwargs (dict, optional:) – ucsc_annotation_type : specific annotation type to download.

Returns

http/ftp link

Return type

str

Raises
  • GenomeDownloadError – if no functional link was found

  • FileNotFoundError – if the specified annotation type is unavailable

Return available gene annotation table(s) from the UCSC MySQL database.

Available tables were retrieved on init.

Parameters

name (str) – genome name

Returns

annotation types

Return type

list

Return UCSC http link to genome sequence

Parameters
  • name (str) – Genome name. Current implementation will fail if exact name is not found.

  • mask (str , optional) – Masking level. Options: soft, hard or none. Default is soft.

Return type

str with the http/ftp download link.

head_annotation(name, genomes_dir=None, n: int = 5, **kwargs)

Download the first n genes of each UCSC annotation type.

The first line of the GTF is printed for review (of the gene_name field, for instance).

Parameters
  • name (str) – genome name

  • genomes_dir (str, optional) – genomes directory to install the annotation in.

  • n (int, optional) – download the annotation for n genes.

  • kwargs (dict , optional) –

    annotationslist

    specify which UCSC annotation types to download. Downloads all available if left blank.

list_available_genomes(size=False)

List all available genomes.

Parameters

size (bool, optional) – Show absolute genome size.

Yields

genomes (list of tuples) – tuples with assembly name, accession, scientific_name, taxonomy id and description

name = 'UCSC'

Name of this provider.

static ping()

Can the provider be reached?

search(term: str, exact=False, size=False)

Search for term in genome names and descriptions (if term contains text. Case-insensitive), assembly accession IDs (if term starts with GCA_ or GCF_), or taxonomy IDs (if term is a number).

Note: exact accession ID search on UCSC may return different patch levels.

Parameters
  • term (str, int) – Search term, case-insensitive. Can be an assembly name (e.g. hg38), scientific name (Danio rerio), assembly accession ID (GCA_000146045), or taxonomy ID (7227).

  • exact (bool, optional) – term must be an exact match

  • size (bool, optional) – Show absolute genome size.

Yields

tuples with name and metadata

taxid_fields = ['taxId']

Metadata fields that (can) contain the assembly’s taxonomy ID.