Python API documentation (core)

This page described the core genomepy functionality. These classes and functions can be found on the top level of the genomepy module (e.g. genomepy.search), and are made available when running from genomepy import * (we won’t judge you).

Additional functions that do not fit the core functionality, but we feel are still pretty cool, are also described.

Finding genomic data

When looking to download a new genome/gene annotation, your first step would be genomepy.search. This function will check either one, or all, providers. Advanced users may want to specify a provider for their search to speed up the process. To see which providers are available, use genomepy.list_providers or genomepy.list_online_providers:

genomepy.list_providers()

List of providers genomepy supports

Returns: names of providers
Return type: list

genomepy.list_online_providers()

List of providers genomepy supports that are online right now.

Returns: names of online providers
Return type: list

genomepy.search(term: str, provider: str = None, exact=False, size=False)

Search for term in genome names and descriptions (if term contains text. Case-insensitive), assembly accession IDs (if term starts with GCA_ or GCF_), or taxonomy IDs (if term is a number).

If provider is specified, search only that specific provider, else search all providers.

Note: exact accession ID search on UCSC may return different patch levels.

Parameters

term (str, int) – Search term, case-insensitive, allows regex.
provider (str , optional) – Only search the specified provider (faster).
exact (bool, optional) – term must be an exact match
size (bool, optional) – Show absolute genome size.

Yields

list – genome name, provider and metadata

If you have no idea what you are looking for, you could even check out all available genomes. Be warned, genomepy.list_available_genomes is like watching the Star Wars title crawl.

genomepy.list_available_genomes(provider=None, size=False) → list

List all available genomes.

Parameters

provider (str, optional) – List genomes from specific provider. Genomes from all providers will be returned if not specified.
size (bool, optional) – Show absolute genome size.

Yields

list – tuples with genome name, provider and metadata

If we search for homo sapiens for instance, we find that GRCh3.p13 and hg38 are the latest versions. These names describe the same genome, but different assemblies, with differences between them.

One of these differences is the quality of the gene annotation. Next, we can inspect these with genomepy.head_annotations:

genomepy.head_annotations(name: str, provider=None, n: int = 2)

Quickly inspect the metadata of each available annotation for the specified genome.

For UCSC, up to 4 gene annotation styles are available: “ncbiRefSeq”, “refGene”, “ensGene”, “knownGene” (respectively).

For NCBI, the chromosome names are not yet sanitized.

Parameters

name (str) – genome name
provider (str, optional) – only search the specified provider for the genome name
n (int, optional) – number of lines to show

Installing genomic data

Now that you have seen whats available, its time to download a genome. The default parameter for genomepy.install_genome are optimized for sequence alignment and gene counting, but you have full control over them, so have a look!

genomepy won’t overwrite any files you already downloaded (unless specified), but you can review your local genomes with genomepy.list_installed_genomes.

genomepy.install_genome(name: str, provider: Optional[str] = None, genomes_dir: Optional[str] = None, localname: Optional[str] = None, mask: Optional[str] = 'soft', keep_alt: Optional[bool] = False, regex: Optional[str] = None, invert_match: Optional[bool] = False, bgzip: Optional[bool] = None, annotation: Optional[bool] = False, only_annotation: Optional[bool] = False, skip_matching: Optional[bool] = False, skip_filter: Optional[bool] = False, threads: Optional[int] = 1, force: Optional[bool] = False, **kwargs: Optional[dict]) → Genome

Install a genome (& gene annotation).

Parameters

name (str) – Genome name
provider (str , optional) – Provider name. will try Gencode, Ensembl, UCSC and NCBI (in that order) if not specified.
genomes_dir (str , optional) – Where to create the output folder.
localname (str , optional) – Custom name for this genome.
mask (str , optional) – Genome masking of repetitive sequences. Options: hard/soft/none, default is soft.
keep_alt (bool , optional) – Some genomes contain alternative regions. These regions cause issues with sequence alignment, as they are inherently duplications of the consensus regions. Set to true to keep these alternative regions.
regex (str , optional) – Regular expression to select specific chromosome / scaffold names.
invert_match (bool , optional) – Set to True to select all chromosomes that don’t match the regex.
bgzip (bool , optional) – If set to True the genome FASTA file will be compressed using bgzip, and gene annotation will be compressed with gzip.
threads (int , optional) – Build genome index using multithreading (if supported). Default: lowest of 8/all threads.
force (bool , optional) – Set to True to overwrite existing files.
annotation (bool , optional) – If set to True, download gene annotation in BED and GTF format.
only_annotation (bool , optional) – If set to True, only download the gene annotation files.
skip_matching (bool , optional) – If set to True, contigs in the annotation not matching those in the genome will not be corrected.
skip_filter (bool , optional) – If set to True, the gene annotations will not be filtered to match the genome contigs.
kwargs (dict , optional) –
Provider specific options.

toplevelbool , optional
Ensembl only: Always download the toplevel genome. Ignores potential primary assembly.

versionint , optional
Ensembl only: Specify release version. Default is latest.

to_annotationtext , optional
URL only: direct link to annotation file. Required if this is not the same directory as the fasta.

path_to_annotationtext, optional
Local only: path to local annotation file. Required if this is not the same directory as the fasta.

Returns

Genome class with the installed genome

Return type

Genome

genomepy.list_installed_genomes(genomes_dir: str = None) → list

List all locally available genomes.

Parameters: genomes_dir (str, optional) – Directory with genomes installed by genomepy.
Returns: genome names
Return type: list

If you want to download a sequence blacklist, or create an aligner index, you might wanna look at plugins! Don’t worry, you can rerun the genome.install_genome command, and genomepy will only run the new parts.

genomepy.manage_plugins(command: str, plugin_names: list = None)

Manage genomepy plugins

Parameters

command (str) –
command to perform. Options:

list
show plugins and status

enable
enable plugins

disable
disable plugins
plugin_names (list) – plugin names for the enable/disable command

The genome and gene annotations were installed in the genomes directory (unless specified otherwise). If you have a specific location in mind, you could set this as default in the genomepy config. To find and inspect it, use genomepy.manage_config:

genomepy.manage_config(command)

Manage the genomepy configuration

Parameters

command (str) –

command to perform. Options:

file: return config filepath
show: return config content
generate: create new config file

Errors

Did something go wrong? Oh noes! If the problem persists, clear the genomepy cache with genomepy.clean, and try again.

genomepy.clean(): Remove cached data on providers.

Using a genome

Alright, you’ve got the goods! You can browse the genome’s sequences and metadata with the genomepy.Genome class. This class builds on the pyfaidx.Fasta class to also provide you with several options to get specific sequences from your genome, and save these to file.

class genomepy.Genome(name, genomes_dir=None, *args, **kwargs)

pyfaidx Fasta object of a genome with additional attributes & methods.

Generates a genome index file, sizes file and gaps file of the genome.

Parameters

name (str) – Genome name
genomes_dir (str, optional) – Genome installation directory

Returns

An object that provides a pygr compatible interface.

Return type

pyfaidx.Fasta

Methods

`close`()
`get_random_sequences`([n, length, chroms, ...])	Return random genomic sequences.
`get_seq`(name, start, end[, rc])	Return a sequence by record name and interval [start, end].
`get_spliced_seq`(name, intervals[, rc])	Return a sequence by record name and list of intervals
`items`()
`keys`()
`track2fasta`(track[, fastafile, stranded, ...])	Return a list of fasta sequences as Sequence objects as directed from the track(s).
`values`()

Attributes

`gaps`	contigs and the number of Ns contained
`plugin`	dict of all active plugins and their properties
`sizes`	contigs and their lengths
`genomes_dir`	path to the genomepy genomes directory
`name`	genome name
`genome_file`	path to the genome fasta
`genome_dir`	path to the genome directory
`index_file`	path to the genome index
`sizes_file`	path to the chromosome sizes file
`gaps_file`	path to the chromosome gaps file
`annotation_gtf_file`	path to the gene annotation GTF file
`annotation_bed_file`	path to the gene annotation BED file
`readme_file`	path to the README file

annotation_bed_file: path to the gene annotation BED file

annotation_gtf_file: path to the gene annotation GTF file

assembly_accession: genome assembly accession

gaps: dict = None

contigs and the number of Ns contained

Type: contents of the gaps file

gaps_file: path to the chromosome gaps file

genome_dir: path to the genome directory

genome_file: path to the genome fasta

genomes_dir: path to the genomepy genomes directory

get_random_sequences(n=10, length=200, chroms=None, max_n=0.1, outtype='list')

Return random genomic sequences.

Parameters

n (int , optional) – Number of sequences to return.
length (int , optional) – Length of sequences to return.
chroms (list , optional) – Return sequences only from these chromosomes.
max_n (float , optional) – Maximum fraction of Ns.
outtype (string , optional) – return the output as list or string. Options: “list” or “string”, default: “list”.

Returns

coordinates as lists or strings: List with [chrom, start, end] genomic coordinates. String with “chrom:start-end” genomic coordinates (can be used as input for track2fasta).

Return type

list

get_seq(name, start, end, rc=False)

Return a sequence by record name and interval [start, end].

Coordinates are 1-based, closed interval. If rc is set, reverse complement will be returned.

get_spliced_seq(name, intervals, rc=False)

Return a sequence by record name and list of intervals

Interval list is an iterable of [start, end]. Coordinates are 1-based, end-exclusive. If rc is set, reverse complement will be returned.

index_file: path to the genome index

name: genome name

property plugin: dict of all active plugins and their properties

readme_file: path to the README file

sizes: dict = None

contigs and their lengths

Type: contents of the sizes file

sizes_file: path to the chromosome sizes file

tax_id: genome taxonomy identifier

track2fasta(track, fastafile=None, stranded=False, extend_up=0, extend_down=0)

Return a list of fasta sequences as Sequence objects as directed from the track(s).

Parameters

track (list/region file/bed file) – region(s) you wish to translate to fasta. Example input files can be found in genomepy/tests/data/regions.*
fastafile (bool , optional) – return Sequences as list or save to file? (default: list)
stranded (bool , optional) – return sequences for both strands? Required BED6 (or higher) as input (default: False)
extend_up (int , optional) – extend the sequences up? (command is strand sensitive, default: 0)
extend_down (int , optional) – extend the sequences down? (command is strand sensitive, default: 0)

You can obtain genomic sequences from a wide variety of inputs with as_seqdict. To use the function, it must be explicitly imported with from genomepy.seq import as_seqdict.

genomepy.seq.as_seqdict(to_convert, genome=None, minsize=None)

genomepy.seq.as_seqdict(to_convert: list, genome=None, minsize=None)

genomepy.seq.as_seqdict(to_convert: TextIOWrapper, genome=None, minsize=None)

genomepy.seq.as_seqdict(to_convert: str, genome=None, minsize=None)

genomepy.seq.as_seqdict(to_convert: Fasta, genome=None, minsize=None)

genomepy.seq.as_seqdict(to_convert: ndarray, genome=None, minsize=None)

Convert input to a dictionary with name as key and sequence as value.

If the input contains genomic coordinates, the genome needs to be specified. If minsize is specified all sequences will be checked if they are not shorter than minsize. If regions (or a region file) are used as the input, the genome can optionally be specified in the region using the following format: genome@chrom:start-end.

Current supported input types include: * FASTA, BED and region files. * List or numpy.ndarray of regions. * pyfaidx.Fasta object. * pybedtools.BedTool object.

Parameters

to_convert (list, str, pyfaidx.Fasta or pybedtools.BedTool) – Input to convert to FASTA-like dictionary
genome (str, optional) – Genomepy genome name.
minsize (int or None, optional) – If specified, check if all sequences have at least size minsize.

Returns

sequence names as key and sequences as value.

Return type

dict

A non-core function worth mentioning is genomepy.files.filter_fasta, for when you wish to filter a fasta file by chromosome name using regex, but want the output straight to (another) fasta file.

genomepy.files.filter_fasta(infa: str, outfa: str = None, regex: str = '.*', invert_match: Optional[bool] = False) → list

Filter fasta file based on regex.

Parameters

infa (str) – Filename of the input fasta file.
outfa (str, optional) – Filename of the output fasta file. If None, infa is overwritten.
regex (str, optional) – Regular expression used for selecting sequences. Matches everything if left blank.
invert_match (bool, optional) – Select all sequence not matching regex if set.

Returns

removed contigs

Return type

list

Using a gene annotation

Similarly, the genomepy.Annotation class helps you get the genes in check. This class returns a number of neat pandas dataframes, such as the named_gtf, or an annotation with the gene or chromosome names remapped to another type. Remapping gene names to another type is also possible with Annotation.map_genes. This feature also comes as separate function genomepy.query_mygene, as it’s just so darn useful.

class genomepy.Annotation(name: str, genomes_dir: str = None, quiet: bool = False)

Manipulate genes and whole gene annotations with pandas dataframes.

Parameters

name (str) – Genome name/directory/fasta or gene annotation BED/GTF file.
genomes_dir (str, optional) – Genomes installation directory.
quiet (bool, optional) – Silence init warnings

Returns

attributes & methods to manipulate gene annotations

Return type

object

annotation_bed_file: path to the gene annotation BED file

annotation_contigs: list = None: Contigs found in the gene annotation BED

annotation_gtf_file: path to the gene annotation GTF file

attributes(annot: Union[str, DataFrame] = 'gtf')

list all attributes present in the GTF attribute field.

Parameters: annot (str or pd.Dataframe, optional) – any GTF in dataframe format, or the default GTF.
Returns: with attributes
Return type: list

bed: DataFrame = None: Dataframe with BED format annotation

filter_regex(annot: Union[str, DataFrame], regex: Optional[str] = '.*', invert_match: Optional[bool] = False, column: Union[str, int] = 0) → DataFrame

Filter a dataframe by any column using regex.

Parameters

annot (str or pd.Dataframe) – annotation to filter: “bed”, “gtf” or a pandas dataframe
regex (str) – regex string to match
invert_match (bool, optional) – keep contigs NOT matching the regex string
column (str or int, optional) – column name or number to filter (default: 1st, contig name)

Returns

filtered dataframe

Return type

pd.DataFrame

from_attributes(field, annot: Union[str, DataFrame] = 'gtf', check=True)

Convert the specified GTF attribute field to a pandas series

Parameters

field (str) – field from the GTF’s attribute column.
annot (str or pd.Dataframe, optional) – any GTF in dataframe format, or the default GTF.
check (bool, optional) – filter the GTF for rows containing field?

Returns

with the same index as the input GTF and the field column

Return type

pd.Series

gene_coords(genes: Iterable[str], annot: str = 'bed') → DataFrame

Retrieve gene locations.

Parameters

genes (Iterable) – List of gene names as found in the given annotation file type
annot (str, optional) – Annotation file type: ‘bed’ or ‘gtf’ (default: “bed”)

Returns

gene annotation

Return type

pandas.DataFrame

genes(annot: str = 'gtf') → list

Retrieve gene names from an annotation.

For BED files, names are taken from the ‘name’ columns.

For GTF files, names are taken from the ‘gene_name’ field in the attribute column, if available.

Parameters: annot (str, optional) – Annotation file type: ‘bed’ or ‘gtf’ (default: “gtf”)
Returns: gene names
Return type: list

genome_contigs: list = None: Contigs found in the genome fasta

genome_dir: path to the genome directory

genome_file: path to the genome fasta

gtf: DataFrame = None: Dataframe with GTF format annotation

gtf_dict(key, value, string_values=True, annot: Union[str, DataFrame] = 'gtf')

Create a dictionary based on the columns or attribute fields in a GTF.

Parameters

key (str) – column name or attribute fields (e.g. “seqname”, “gene_name”)
value (str) – column name or attribute fields (e.g. “gene_id”, “transcript_name”)
string_values (bool, optional) – attempt to format the dict values as strings (only happens if all value lists are length 1)
annot (str or pd.Dataframe, optional) – annotation to filter: “gtf” or a pandas dataframe

Returns

with values as lists. If string_values is True and all lists are length 1, values will be strings.

Return type

dict

index_file: path to the genome index

lengths(attribute='gene_name')

Return a series with the selected GTF attribute as index, and its lengths as values.

Parameters: attribute (str) – attribute to provide lengths of. Options: gene_name, gene_id, transcript_name, transcript_id. Attribute must be present in the GTF file.
Returns: attribute indexed series named ‘length’
Return type: pd.Series

map_genes(field: str, product: str = 'protein', annot: Union[str, DataFrame] = 'bed') → DataFrame

Use mygene.info to map gene identifiers to any specified field.

Returns the dataframe with remapped “name” column. Drops missing identifiers.

Parameters

annot (str or pd.Dataframe) – Annotation dataframe to map (a pandas dataframe or “bed”). Is mapped to a column named “name” (required).
field (str, optional) – Identifier for gene annotation. Uses mygene.info to map ids. Valid fields are: ensembl.gene, entrezgene, symbol, name, refseq, entrezgene. Note that refseq will return the protein refseq_id by default, use product=”rna” to return the RNA refseq_id. Currently, mapping to Ensembl transcript ids is not supported.
product (str, optional) – Either “protein” or “rna”. Only used when field=”refseq”

Returns

remapped gene annotation

Return type

pandas.DataFrame

map_locations(annot: Union[str, DataFrame], to: str, drop=True) → Union[None, DataFrame]

Map chromosome mapping from one assembly to another.

Uses the NCBI assembly reports to find contigs. Drops missing contigs.

Parameters

annot (str or pd.Dataframe) – annotation to map: “bed”, “gtf” or a pandas dataframe.
to (str) – target provider (UCSC, Ensembl or NCBI)
drop (bool, optional) – if True, replace the chromosome column. If False, add a 2nd chromosome column.

Returns

chromosome mapping.

Return type

pandas.DataFrame

name: genome name

named_gtf: DataFrame = None: Dataframe with GTF format annotation, with gene_name as index

readme_file: path to the README file

sanitize(match=True, filter=True, overwrite=False)

Match the contigs names of the gene annotations to the genome’s.

First, match the contig names if possible. Second, remove contig names not found in the genome. Third, save the results and log this in the README.

Parameters

match (bool, optional) – match annotation contig names to the genome contig names (default is True)
filter (bool, optional) – remove annotation contig names not found in the genome contig names (default is True)
overwrite (bool, optional) – update the annotation files on disk, and log this in the README (default is False).

Returns

updated attributes

Return type

Annotation class

sizes_file: path to the chromosome sizes file

tax_id: genome taxonomy identifier

genomepy.query_mygene(query: Iterable[str], tax_id: Union[str, int], field: str = 'genomic_pos') → DataFrame

Use mygene.info to map gene identifiers to another type.

Parameters

query (iterable) – a list or list-like of gene identifiers
tax_id (str or int) – Target genome taxonomy id
field (str, optional) – Target identifier to map the query genes to. Valid fields are: ensembl.gene, entrezgene, symbol, name, refseq, entrezgene. Note that refseq will return the protein refseq_id by default, use refseq.translation.rna to return the RNA refseq_id. Currently, mapping to Ensembl transcript ids is not supported.

Returns

mapped gene annotation.

Return type

pandas.DataFrame

Another non-core function worth mentioning is genomepy.annotation.filter_regex, which allows you to filter a dataframe by any columns using regex.

genomepy.annotation.filter_regex(df: DataFrame, regex: str, invert_match: Optional[bool] = False, column: Union[str, int] = 0) → DataFrame

Filter a pandas dataframe by a column (default: 1st, contig name).

Parameters

df (pd.Dataframe) – annotation to filter (a pandas dataframe)
regex (str) – regex string to match
invert_match (bool, optional) – keep contigs NOT matching the regex string
column (str or int, optional) – column name or number to filter (default: 1st, contig name)

Returns

filtered dataframe

Return type

pd.DataFrame