genomepy.providers.gencode.GencodeProvider

class genomepy.providers.gencode.GencodeProvider

Bases: BaseProvider

GENCODE genome provider.

GENCODE sports superb annotations for human and mouse with UCSC-style chromosome names. Genomes on this provider are unmasked, so we use the UCSC genomes instead.

Note: this combination lacks scaffolds and alternate haplotype sequences (their names don’t match between GENCODE and UCSC).

__init__()

Methods

__init__()

annotation_links(name, **kwargs)

Return available gene annotation links (http/ftp) for a genome

assembly_accession(name)

Return the assembly accession number (GCA* or GCF*) for a genome.

download_annotation(name[, genomes_dir, ...])

Download annotation file to to a specific directory

download_genome(name[, genomes_dir, ...])

Download genomes from UCSC, as the GENCODE genomes aren't masked.

genome_taxid(name)

Return the genome taxonomy ID for a genome.

get_annotation_download_link(name, **kwargs)

Return a functional annotation download link.

get_annotation_download_links(name, **kwargs)

Retrieve functioning gene annotation download link(s).

get_genome_download_link(name[, mask])

Return UCSC http link to genome sequence

head_annotation(name[, genomes_dir, n])

Download the first n lines of the annotation.

list_available_genomes([size])

List all available genomes.

ping()

Can the provider be reached?

search(term[, exact, size])

Search for term in genome names and descriptions (if term contains text.

Attributes

accession_fields

Metadata fields that (can) contain the assembly's accession ID.

description_fields

Metadata fields with assembly related info.

genomes

Dictionary with assembly names as key and assembly metadata dictionary as value.

name

Name of this provider.

taxid_fields

Metadata fields that (can) contain the assembly's taxonomy ID.

accession_fields = ['assembly_accession']

Metadata fields that (can) contain the assembly’s accession ID.

Return available gene annotation links (http/ftp) for a genome

Parameters

name (str) – genome name

Returns

Gene annotation links

Return type

list

assembly_accession(name: str) str

Return the assembly accession number (GCA* or GCF*) for a genome.

Parameters

name (str) – genome name

Returns

Assembly accession number

Return type

str

description_fields = ['species', 'other_info', 'text_search']

Metadata fields with assembly related info.

download_annotation(name, genomes_dir=None, localname=None, **kwargs)

Download annotation file to to a specific directory

Parameters
  • name (str) – Genome / species name

  • genomes_dir (str , optional) – Directory to install annotation

  • localname (str , optional) – Custom name for your genome

download_genome(name: str, genomes_dir: str = None, localname: str = None, mask: str = 'soft', **kwargs)

Download genomes from UCSC, as the GENCODE genomes aren’t masked. Contigs between the UCSC genome and GENCODE annotations match.

genome_taxid(name: str) int

Return the genome taxonomy ID for a genome.

Parameters

name (str) – genome name

Returns

Genome Taxonomy identifier

Return type

int

genomes = {}

Dictionary with assembly names as key and assembly metadata dictionary as value.

Return a functional annotation download link.

Parameters

name (str) – genome name

Returns

http/ftp link

Return type

str

Raises

GenomeDownloadError – if no functional link was found

Retrieve functioning gene annotation download link(s).

Parameters

name (str) – genome name

Returns

http/ftp link(s)

Return type

list

Return UCSC http link to genome sequence

Parameters
  • name (str) – Genome name.

  • mask (str , optional) – Masking level. Options: soft, hard or none. Default is soft.

Returns

http/ftp link tp genome.

Return type

str

head_annotation(name: str, genomes_dir=None, n: int = 5, **kwargs)

Download the first n lines of the annotation.

The first line of the GTF is printed for review (of the gene_name field, for instance).

Parameters
  • name (str) – genome name

  • genomes_dir (str, optional) – genomes directory to install the annotation in.

  • n (int, optional) – download the annotation for n genes.

list_available_genomes(size=False)

List all available genomes.

Parameters

size (bool, optional) – Show absolute genome size.

Yields

genomes (list of tuples) – tuples with assembly name, accession, scientific_name, taxonomy id and description

name = 'GENCODE'

Name of this provider.

static ping()

Can the provider be reached?

search(term: str, exact=False, size=False)

Search for term in genome names and descriptions (if term contains text. Case-insensitive), assembly accession IDs (if term starts with GCA_ or GCF_), or taxonomy IDs (if term is a number).

Note: exact accession ID search on UCSC may return different patch levels.

Parameters
  • term (str, int) – Search term, case-insensitive. Can be an assembly name (e.g. hg38), scientific name (Danio rerio), assembly accession ID (GCA_000146045), or taxonomy ID (7227).

  • exact (bool, optional) – term must be an exact match

  • size (bool, optional) – Show absolute genome size.

Yields

tuples with name and metadata

taxid_fields = ['taxonomy_id']

Metadata fields that (can) contain the assembly’s taxonomy ID.