genomepy.annotation.Annotation

class genomepy.annotation.Annotation(name: str, genomes_dir: str = None, quiet: bool = False)

Bases: object

Manipulate genes and whole gene annotations with pandas dataframes.

Parameters

name (str) – Genome name/directory/fasta or gene annotation BED/GTF file.
genomes_dir (str, optional) – Genomes installation directory.
quiet (bool, optional) – Silence init warnings

Returns

attributes & methods to manipulate gene annotations

Return type

object

__init__(name: str, genomes_dir: str = None, quiet: bool = False)

Methods

`__init__`(name[, genomes_dir, quiet])
`attributes`([annot])	list all attributes present in the GTF attribute field.
`filter_regex`(annot[, regex, invert_match, ...])	Filter a dataframe by any column using regex.
`from_attributes`(field[, annot, check])	Convert the specified GTF attribute field to a pandas series
`gene_coords`(genes[, annot])	Retrieve gene locations.
`genes`([annot])	Retrieve gene names from an annotation.
`gtf_dict`(key, value[, string_values, annot])	Create a dictionary based on the columns or attribute fields in a GTF.
`lengths`([attribute])	Return a series with the selected GTF attribute as index, and its lengths as values.
`map_genes`(field[, product, annot])	Use mygene.info to map gene identifiers to any specified field.
`map_locations`(annot, to[, drop])	Map chromosome mapping from one assembly to another.
`sanitize`([match, filter, overwrite])	Match the contigs names of the gene annotations to the genome's.

Attributes

`annotation_contigs`	Contigs found in the gene annotation BED
`bed`	Dataframe with BED format annotation
`genome_contigs`	Contigs found in the genome fasta
`gtf`	Dataframe with GTF format annotation
`named_gtf`	Dataframe with GTF format annotation, with gene_name as index
`name`	genome name
`genome_dir`	path to the genome directory
`annotation_bed_file`	path to the gene annotation BED file
`annotation_gtf_file`	path to the gene annotation GTF file
`genome_file`	path to the genome fasta
`readme_file`	path to the README file
`index_file`	path to the genome index
`sizes_file`	path to the chromosome sizes file
`tax_id`	genome taxonomy identifier

annotation_bed_file: path to the gene annotation BED file

annotation_contigs: list = None: Contigs found in the gene annotation BED

annotation_gtf_file: path to the gene annotation GTF file

attributes(annot: Union[str, DataFrame] = 'gtf')

list all attributes present in the GTF attribute field.

Parameters: annot (str or pd.Dataframe, optional) – any GTF in dataframe format, or the default GTF.
Returns: with attributes
Return type: list

bed: DataFrame = None: Dataframe with BED format annotation

filter_regex(annot: Union[str, DataFrame], regex: Optional[str] = '.*', invert_match: Optional[bool] = False, column: Union[str, int] = 0) → DataFrame

Filter a dataframe by any column using regex.

Parameters

annot (str or pd.Dataframe) – annotation to filter: “bed”, “gtf” or a pandas dataframe
regex (str) – regex string to match
invert_match (bool, optional) – keep contigs NOT matching the regex string
column (str or int, optional) – column name or number to filter (default: 1st, contig name)

Returns

filtered dataframe

Return type

pd.DataFrame

from_attributes(field, annot: Union[str, DataFrame] = 'gtf', check=True)

Convert the specified GTF attribute field to a pandas series

Parameters

field (str) – field from the GTF’s attribute column.
annot (str or pd.Dataframe, optional) – any GTF in dataframe format, or the default GTF.
check (bool, optional) – filter the GTF for rows containing field?

Returns

with the same index as the input GTF and the field column

Return type

pd.Series

gene_coords(genes: Iterable[str], annot: str = 'bed') → DataFrame

Retrieve gene locations.

Parameters

genes (Iterable) – List of gene names as found in the given annotation file type
annot (str, optional) – Annotation file type: ‘bed’ or ‘gtf’ (default: “bed”)

Returns

gene annotation

Return type

pandas.DataFrame

genes(annot: str = 'gtf') → list

Retrieve gene names from an annotation.

For BED files, names are taken from the ‘name’ columns.

For GTF files, names are taken from the ‘gene_name’ field in the attribute column, if available.

Parameters: annot (str, optional) – Annotation file type: ‘bed’ or ‘gtf’ (default: “gtf”)
Returns: gene names
Return type: list

genome_contigs: list = None: Contigs found in the genome fasta

genome_dir: path to the genome directory

genome_file: path to the genome fasta

gtf: DataFrame = None: Dataframe with GTF format annotation

gtf_dict(key, value, string_values=True, annot: Union[str, DataFrame] = 'gtf')

Create a dictionary based on the columns or attribute fields in a GTF.

Parameters

key (str) – column name or attribute fields (e.g. “seqname”, “gene_name”)
value (str) – column name or attribute fields (e.g. “gene_id”, “transcript_name”)
string_values (bool, optional) – attempt to format the dict values as strings (only happens if all value lists are length 1)
annot (str or pd.Dataframe, optional) – annotation to filter: “gtf” or a pandas dataframe

Returns

with values as lists. If string_values is True and all lists are length 1, values will be strings.

Return type

dict

index_file: path to the genome index

lengths(attribute='gene_name')

Return a series with the selected GTF attribute as index, and its lengths as values.

Parameters: attribute (str) – attribute to provide lengths of. Options: gene_name, gene_id, transcript_name, transcript_id. Attribute must be present in the GTF file.
Returns: attribute indexed series named ‘length’
Return type: pd.Series

map_genes(field: str, product: str = 'protein', annot: Union[str, DataFrame] = 'bed') → DataFrame

Use mygene.info to map gene identifiers to any specified field.

Returns the dataframe with remapped “name” column. Drops missing identifiers.

Parameters

annot (str or pd.Dataframe) – Annotation dataframe to map (a pandas dataframe or “bed”). Is mapped to a column named “name” (required).
field (str, optional) – Identifier for gene annotation. Uses mygene.info to map ids. Valid fields are: ensembl.gene, entrezgene, symbol, name, refseq, entrezgene. Note that refseq will return the protein refseq_id by default, use product=”rna” to return the RNA refseq_id. Currently, mapping to Ensembl transcript ids is not supported.
product (str, optional) – Either “protein” or “rna”. Only used when field=”refseq”

Returns

remapped gene annotation

Return type

pandas.DataFrame

map_locations(annot: Union[str, DataFrame], to: str, drop=True) → Union[None, DataFrame]

Map chromosome mapping from one assembly to another.

Uses the NCBI assembly reports to find contigs. Drops missing contigs.

Parameters

annot (str or pd.Dataframe) – annotation to map: “bed”, “gtf” or a pandas dataframe.
to (str) – target provider (UCSC, Ensembl or NCBI)
drop (bool, optional) – if True, replace the chromosome column. If False, add a 2nd chromosome column.

Returns

chromosome mapping.

Return type

pandas.DataFrame

name: genome name

named_gtf: DataFrame = None: Dataframe with GTF format annotation, with gene_name as index

readme_file: path to the README file

sanitize(match=True, filter=True, overwrite=False)

Match the contigs names of the gene annotations to the genome’s.

First, match the contig names if possible. Second, remove contig names not found in the genome. Third, save the results and log this in the README.

Parameters

match (bool, optional) – match annotation contig names to the genome contig names (default is True)
filter (bool, optional) – remove annotation contig names not found in the genome contig names (default is True)
overwrite (bool, optional) – update the annotation files on disk, and log this in the README (default is False).

Returns

updated attributes

Return type

Annotation class

sizes_file: path to the chromosome sizes file

tax_id: genome taxonomy identifier