Python API documentation (core)
This page described the core genomepy functionality.
These classes and functions can be found on the top level of the genomepy module (e.g. genomepy.search
),
and are made available when running from genomepy import *
(we won’t judge you).
Additional functions that do not fit the core functionality, but we feel are still pretty cool, are also described.
Finding genomic data
When looking to download a new genome/gene annotation, your first step would be genomepy.search
.
This function will check either one, or all, providers.
Advanced users may want to specify a provider for their search to speed up the process.
To see which providers are available, use genomepy.list_providers
or genomepy.list_online_providers
:
- genomepy.list_providers()
List of providers genomepy supports
- Returns
names of providers
- Return type
list
- genomepy.list_online_providers()
List of providers genomepy supports that are online right now.
- Returns
names of online providers
- Return type
list
- genomepy.search(term: str, provider: str = None, exact=False, size=False)
Search for term in genome names and descriptions (if term contains text. Case-insensitive), assembly accession IDs (if term starts with GCA_ or GCF_), or taxonomy IDs (if term is a number).
If provider is specified, search only that specific provider, else search all providers.
Note: exact accession ID search on UCSC may return different patch levels.
- Parameters
term (str, int) – Search term, case-insensitive, allows regex.
provider (str , optional) – Only search the specified provider (faster).
exact (bool, optional) – term must be an exact match
size (bool, optional) – Show absolute genome size.
- Yields
list – genome name, provider and metadata
If you have no idea what you are looking for, you could even check out all available genomes.
Be warned, genomepy.list_available_genomes
is like watching the Star Wars title crawl.
- genomepy.list_available_genomes(provider=None, size=False) list
List all available genomes.
- Parameters
provider (str, optional) – List genomes from specific provider. Genomes from all providers will be returned if not specified.
size (bool, optional) – Show absolute genome size.
- Yields
list – tuples with genome name, provider and metadata
If we search for homo sapiens for instance, we find that GRCh3.p13
and hg38
are the latest versions.
These names describe the same genome, but different assemblies
, with differences between them.
One of these differences is the quality of the gene annotation.
Next, we can inspect these with genomepy.head_annotations
:
- genomepy.head_annotations(name: str, provider=None, n: int = 2)
Quickly inspect the metadata of each available annotation for the specified genome.
For UCSC, up to 4 gene annotation styles are available: “ncbiRefSeq”, “refGene”, “ensGene”, “knownGene” (respectively).
For NCBI, the chromosome names are not yet sanitized.
- Parameters
name (str) – genome name
provider (str, optional) – only search the specified provider for the genome name
n (int, optional) – number of lines to show
Installing genomic data
Now that you have seen whats available, its time to download a genome.
The default parameter for genomepy.install_genome
are optimized for sequence alignment and gene counting,
but you have full control over them, so have a look!
genomepy won’t overwrite any files you already downloaded (unless specified),
but you can review your local genomes with genomepy.list_installed_genomes
.
- genomepy.install_genome(name: str, provider: Optional[str] = None, genomes_dir: Optional[str] = None, localname: Optional[str] = None, mask: Optional[str] = 'soft', keep_alt: Optional[bool] = False, regex: Optional[str] = None, invert_match: Optional[bool] = False, bgzip: Optional[bool] = None, annotation: Optional[bool] = False, only_annotation: Optional[bool] = False, skip_matching: Optional[bool] = False, skip_filter: Optional[bool] = False, threads: Optional[int] = 1, force: Optional[bool] = False, **kwargs: Optional[dict]) Genome
Install a genome (& gene annotation).
- Parameters
name (str) – Genome name
provider (str , optional) – Provider name. will try Gencode, Ensembl, UCSC and NCBI (in that order) if not specified.
genomes_dir (str , optional) – Where to create the output folder.
localname (str , optional) – Custom name for this genome.
mask (str , optional) – Genome masking of repetitive sequences. Options: hard/soft/none, default is soft.
keep_alt (bool , optional) – Some genomes contain alternative regions. These regions cause issues with sequence alignment, as they are inherently duplications of the consensus regions. Set to true to keep these alternative regions.
regex (str , optional) – Regular expression to select specific chromosome / scaffold names.
invert_match (bool , optional) – Set to True to select all chromosomes that don’t match the regex.
bgzip (bool , optional) – If set to True the genome FASTA file will be compressed using bgzip, and gene annotation will be compressed with gzip.
threads (int , optional) – Build genome index using multithreading (if supported). Default: lowest of 8/all threads.
force (bool , optional) – Set to True to overwrite existing files.
annotation (bool , optional) – If set to True, download gene annotation in BED and GTF format.
only_annotation (bool , optional) – If set to True, only download the gene annotation files.
skip_matching (bool , optional) – If set to True, contigs in the annotation not matching those in the genome will not be corrected.
skip_filter (bool , optional) – If set to True, the gene annotations will not be filtered to match the genome contigs.
kwargs (dict , optional) –
Provider specific options.
- toplevelbool , optional
Ensembl only: Always download the toplevel genome. Ignores potential primary assembly.
- versionint , optional
Ensembl only: Specify release version. Default is latest.
- to_annotationtext , optional
URL only: direct link to annotation file. Required if this is not the same directory as the fasta.
- path_to_annotationtext, optional
Local only: path to local annotation file. Required if this is not the same directory as the fasta.
- Returns
Genome class with the installed genome
- Return type
- genomepy.list_installed_genomes(genomes_dir: str = None) list
List all locally available genomes.
- Parameters
genomes_dir (str, optional) – Directory with genomes installed by genomepy.
- Returns
genome names
- Return type
list
If you want to download a sequence blacklist, or create an aligner index, you might wanna look at plugins!
Don’t worry, you can rerun the genome.install_genome
command, and genomepy will only run the new parts.
- genomepy.manage_plugins(command: str, plugin_names: list = None)
Manage genomepy plugins
- Parameters
command (str) –
command to perform. Options:
- list
show plugins and status
- enable
enable plugins
- disable
disable plugins
plugin_names (list) – plugin names for the enable/disable command
The genome and gene annotations were installed in the genomes directory (unless specified otherwise).
If you have a specific location in mind, you could set this as default in the genomepy config.
To find and inspect it, use genomepy.manage_config
:
- genomepy.manage_config(command)
Manage the genomepy configuration
- Parameters
command (str) –
command to perform. Options:
- file
return config filepath
- show
return config content
- generate
create new config file
Errors
Did something go wrong? Oh noes!
If the problem persists, clear the genomepy cache with genomepy.clean
, and try again.
- genomepy.clean()
Remove cached data on providers.
Using a genome
Alright, you’ve got the goods!
You can browse the genome’s sequences and metadata with the genomepy.Genome
class.
This class builds on the pyfaidx.Fasta
class to also provide you with several options
to get specific sequences from your genome, and save these to file.
- class genomepy.Genome(name, genomes_dir=None, *args, **kwargs)
pyfaidx Fasta object of a genome with additional attributes & methods.
Generates a genome index file, sizes file and gaps file of the genome.
- Parameters
name (str) – Genome name
genomes_dir (str, optional) – Genome installation directory
- Returns
An object that provides a pygr compatible interface.
- Return type
pyfaidx.Fasta
Methods
close
()get_random_sequences
([n, length, chroms, ...])Return random genomic sequences.
get_seq
(name, start, end[, rc])Return a sequence by record name and interval [start, end].
get_spliced_seq
(name, intervals[, rc])Return a sequence by record name and list of intervals
items
()keys
()track2fasta
(track[, fastafile, stranded, ...])Return a list of fasta sequences as Sequence objects as directed from the track(s).
values
()Attributes
gaps
contigs and the number of Ns contained
plugin
dict of all active plugins and their properties
sizes
contigs and their lengths
genomes_dir
path to the genomepy genomes directory
name
genome name
genome_file
path to the genome fasta
genome_dir
path to the genome directory
index_file
path to the genome index
sizes_file
path to the chromosome sizes file
gaps_file
path to the chromosome gaps file
annotation_gtf_file
path to the gene annotation GTF file
annotation_bed_file
path to the gene annotation BED file
readme_file
path to the README file
- annotation_bed_file
path to the gene annotation BED file
- annotation_gtf_file
path to the gene annotation GTF file
- assembly_accession
genome assembly accession
- gaps: dict = None
contigs and the number of Ns contained
- Type
contents of the gaps file
- gaps_file
path to the chromosome gaps file
- genome_dir
path to the genome directory
- genome_file
path to the genome fasta
- genomes_dir
path to the genomepy genomes directory
- get_random_sequences(n=10, length=200, chroms=None, max_n=0.1, outtype='list')
Return random genomic sequences.
- Parameters
n (int , optional) – Number of sequences to return.
length (int , optional) – Length of sequences to return.
chroms (list , optional) – Return sequences only from these chromosomes.
max_n (float , optional) – Maximum fraction of Ns.
outtype (string , optional) – return the output as list or string. Options: “list” or “string”, default: “list”.
- Returns
coordinates as lists or strings: List with [chrom, start, end] genomic coordinates. String with “chrom:start-end” genomic coordinates (can be used as input for track2fasta).
- Return type
list
- get_seq(name, start, end, rc=False)
Return a sequence by record name and interval [start, end].
Coordinates are 1-based, closed interval. If rc is set, reverse complement will be returned.
- get_spliced_seq(name, intervals, rc=False)
Return a sequence by record name and list of intervals
Interval list is an iterable of [start, end]. Coordinates are 1-based, end-exclusive. If rc is set, reverse complement will be returned.
- index_file
path to the genome index
- name
genome name
- property plugin
dict of all active plugins and their properties
- readme_file
path to the README file
- sizes: dict = None
contigs and their lengths
- Type
contents of the sizes file
- sizes_file
path to the chromosome sizes file
- tax_id
genome taxonomy identifier
- track2fasta(track, fastafile=None, stranded=False, extend_up=0, extend_down=0)
Return a list of fasta sequences as Sequence objects as directed from the track(s).
- Parameters
track (list/region file/bed file) – region(s) you wish to translate to fasta. Example input files can be found in genomepy/tests/data/regions.*
fastafile (bool , optional) – return Sequences as list or save to file? (default: list)
stranded (bool , optional) – return sequences for both strands? Required BED6 (or higher) as input (default: False)
extend_up (int , optional) – extend the sequences up? (command is strand sensitive, default: 0)
extend_down (int , optional) – extend the sequences down? (command is strand sensitive, default: 0)
You can obtain genomic sequences from a wide variety of inputs with as_seqdict
.
To use the function, it must be explicitly imported with from genomepy.seq import as_seqdict
.
- genomepy.seq.as_seqdict(to_convert, genome=None, minsize=None)
- genomepy.seq.as_seqdict(to_convert: list, genome=None, minsize=None)
- genomepy.seq.as_seqdict(to_convert: TextIOWrapper, genome=None, minsize=None)
- genomepy.seq.as_seqdict(to_convert: str, genome=None, minsize=None)
- genomepy.seq.as_seqdict(to_convert: Fasta, genome=None, minsize=None)
- genomepy.seq.as_seqdict(to_convert: ndarray, genome=None, minsize=None)
Convert input to a dictionary with name as key and sequence as value.
If the input contains genomic coordinates, the genome needs to be specified. If minsize is specified all sequences will be checked if they are not shorter than minsize. If regions (or a region file) are used as the input, the genome can optionally be specified in the region using the following format: genome@chrom:start-end.
Current supported input types include: * FASTA, BED and region files. * List or numpy.ndarray of regions. * pyfaidx.Fasta object. * pybedtools.BedTool object.
- Parameters
to_convert (list, str, pyfaidx.Fasta or pybedtools.BedTool) – Input to convert to FASTA-like dictionary
genome (str, optional) – Genomepy genome name.
minsize (int or None, optional) – If specified, check if all sequences have at least size minsize.
- Returns
sequence names as key and sequences as value.
- Return type
dict
A non-core function worth mentioning is genomepy.files.filter_fasta
,
for when you wish to filter a fasta file by chromosome name using regex,
but want the output straight to (another) fasta file.
- genomepy.files.filter_fasta(infa: str, outfa: str = None, regex: str = '.*', invert_match: Optional[bool] = False) list
Filter fasta file based on regex.
- Parameters
infa (str) – Filename of the input fasta file.
outfa (str, optional) – Filename of the output fasta file. If None, infa is overwritten.
regex (str, optional) – Regular expression used for selecting sequences. Matches everything if left blank.
invert_match (bool, optional) – Select all sequence not matching regex if set.
- Returns
removed contigs
- Return type
list
Using a gene annotation
Similarly, the genomepy.Annotation
class helps you get the genes in check.
This class returns a number of neat pandas dataframes, such as the named_gtf
,
or an annotation with the gene or chromosome names remapped to another type.
Remapping gene names to another type is also possible with Annotation.map_genes
.
This feature also comes as separate function genomepy.query_mygene
, as it’s just so darn useful.
- class genomepy.Annotation(name: str, genomes_dir: str = None, quiet: bool = False)
Manipulate genes and whole gene annotations with pandas dataframes.
- Parameters
name (str) – Genome name/directory/fasta or gene annotation BED/GTF file.
genomes_dir (str, optional) – Genomes installation directory.
quiet (bool, optional) – Silence init warnings
- Returns
attributes & methods to manipulate gene annotations
- Return type
object
- annotation_bed_file
path to the gene annotation BED file
- annotation_contigs: list = None
Contigs found in the gene annotation BED
- annotation_gtf_file
path to the gene annotation GTF file
- attributes(annot: Union[str, DataFrame] = 'gtf')
list all attributes present in the GTF attribute field.
- Parameters
annot (str or pd.Dataframe, optional) – any GTF in dataframe format, or the default GTF.
- Returns
with attributes
- Return type
list
- bed: DataFrame = None
Dataframe with BED format annotation
- filter_regex(annot: Union[str, DataFrame], regex: Optional[str] = '.*', invert_match: Optional[bool] = False, column: Union[str, int] = 0) DataFrame
Filter a dataframe by any column using regex.
- Parameters
annot (str or pd.Dataframe) – annotation to filter: “bed”, “gtf” or a pandas dataframe
regex (str) – regex string to match
invert_match (bool, optional) – keep contigs NOT matching the regex string
column (str or int, optional) – column name or number to filter (default: 1st, contig name)
- Returns
filtered dataframe
- Return type
pd.DataFrame
- from_attributes(field, annot: Union[str, DataFrame] = 'gtf', check=True)
Convert the specified GTF attribute field to a pandas series
- Parameters
field (str) – field from the GTF’s attribute column.
annot (str or pd.Dataframe, optional) – any GTF in dataframe format, or the default GTF.
check (bool, optional) – filter the GTF for rows containing field?
- Returns
with the same index as the input GTF and the field column
- Return type
pd.Series
- gene_coords(genes: Iterable[str], annot: str = 'bed') DataFrame
Retrieve gene locations.
- Parameters
genes (Iterable) – List of gene names as found in the given annotation file type
annot (str, optional) – Annotation file type: ‘bed’ or ‘gtf’ (default: “bed”)
- Returns
gene annotation
- Return type
pandas.DataFrame
- genes(annot: str = 'gtf') list
Retrieve gene names from an annotation.
For BED files, names are taken from the ‘name’ columns.
For GTF files, names are taken from the ‘gene_name’ field in the attribute column, if available.
- Parameters
annot (str, optional) – Annotation file type: ‘bed’ or ‘gtf’ (default: “gtf”)
- Returns
gene names
- Return type
list
- genome_contigs: list = None
Contigs found in the genome fasta
- genome_dir
path to the genome directory
- genome_file
path to the genome fasta
- gtf: DataFrame = None
Dataframe with GTF format annotation
- gtf_dict(key, value, string_values=True, annot: Union[str, DataFrame] = 'gtf')
Create a dictionary based on the columns or attribute fields in a GTF.
- Parameters
key (str) – column name or attribute fields (e.g. “seqname”, “gene_name”)
value (str) – column name or attribute fields (e.g. “gene_id”, “transcript_name”)
string_values (bool, optional) – attempt to format the dict values as strings (only happens if all value lists are length 1)
annot (str or pd.Dataframe, optional) – annotation to filter: “gtf” or a pandas dataframe
- Returns
with values as lists. If string_values is True and all lists are length 1, values will be strings.
- Return type
dict
- index_file
path to the genome index
- lengths(attribute='gene_name')
Return a series with the selected GTF attribute as index, and its lengths as values.
- Parameters
attribute (str) – attribute to provide lengths of. Options: gene_name, gene_id, transcript_name, transcript_id. Attribute must be present in the GTF file.
- Returns
attribute indexed series named ‘length’
- Return type
pd.Series
- map_genes(field: str, product: str = 'protein', annot: Union[str, DataFrame] = 'bed') DataFrame
Use mygene.info to map gene identifiers to any specified field.
Returns the dataframe with remapped “name” column. Drops missing identifiers.
- Parameters
annot (str or pd.Dataframe) – Annotation dataframe to map (a pandas dataframe or “bed”). Is mapped to a column named “name” (required).
field (str, optional) – Identifier for gene annotation. Uses mygene.info to map ids. Valid fields are: ensembl.gene, entrezgene, symbol, name, refseq, entrezgene. Note that refseq will return the protein refseq_id by default, use product=”rna” to return the RNA refseq_id. Currently, mapping to Ensembl transcript ids is not supported.
product (str, optional) – Either “protein” or “rna”. Only used when field=”refseq”
- Returns
remapped gene annotation
- Return type
pandas.DataFrame
- map_locations(annot: Union[str, DataFrame], to: str, drop=True) Union[None, DataFrame]
Map chromosome mapping from one assembly to another.
Uses the NCBI assembly reports to find contigs. Drops missing contigs.
- Parameters
annot (str or pd.Dataframe) – annotation to map: “bed”, “gtf” or a pandas dataframe.
to (str) – target provider (UCSC, Ensembl or NCBI)
drop (bool, optional) – if True, replace the chromosome column. If False, add a 2nd chromosome column.
- Returns
chromosome mapping.
- Return type
pandas.DataFrame
- name
genome name
- named_gtf: DataFrame = None
Dataframe with GTF format annotation, with gene_name as index
- readme_file
path to the README file
- sanitize(match=True, filter=True, overwrite=False)
Match the contigs names of the gene annotations to the genome’s.
First, match the contig names if possible. Second, remove contig names not found in the genome. Third, save the results and log this in the README.
- Parameters
match (bool, optional) – match annotation contig names to the genome contig names (default is True)
filter (bool, optional) – remove annotation contig names not found in the genome contig names (default is True)
overwrite (bool, optional) – update the annotation files on disk, and log this in the README (default is False).
- Returns
updated attributes
- Return type
Annotation class
- sizes_file
path to the chromosome sizes file
- tax_id
genome taxonomy identifier
- genomepy.query_mygene(query: Iterable[str], tax_id: Union[str, int], field: str = 'genomic_pos') DataFrame
Use mygene.info to map gene identifiers to another type.
- Parameters
query (iterable) – a list or list-like of gene identifiers
tax_id (str or int) – Target genome taxonomy id
field (str, optional) – Target identifier to map the query genes to. Valid fields are: ensembl.gene, entrezgene, symbol, name, refseq, entrezgene. Note that refseq will return the protein refseq_id by default, use refseq.translation.rna to return the RNA refseq_id. Currently, mapping to Ensembl transcript ids is not supported.
- Returns
mapped gene annotation.
- Return type
pandas.DataFrame
Another non-core function worth mentioning is genomepy.annotation.filter_regex
,
which allows you to filter a dataframe by any columns using regex.
- genomepy.annotation.filter_regex(df: DataFrame, regex: str, invert_match: Optional[bool] = False, column: Union[str, int] = 0) DataFrame
Filter a pandas dataframe by a column (default: 1st, contig name).
- Parameters
df (pd.Dataframe) – annotation to filter (a pandas dataframe)
regex (str) – regex string to match
invert_match (bool, optional) – keep contigs NOT matching the regex string
column (str or int, optional) – column name or number to filter (default: 1st, contig name)
- Returns
filtered dataframe
- Return type
pd.DataFrame