genomepy.annotation.Annotation

class genomepy.annotation.Annotation(name: str, genomes_dir: str = None, quiet: bool = False)

Bases: object

Manipulate genes and whole gene annotations with pandas dataframes.

Parameters
  • name (str) – Genome name/directory/fasta or gene annotation BED/GTF file.

  • genomes_dir (str, optional) – Genomes installation directory.

  • quiet (bool, optional) – Silence init warnings

Returns

attributes & methods to manipulate gene annotations

Return type

object

__init__(name: str, genomes_dir: str = None, quiet: bool = False)

Methods

__init__(name[, genomes_dir, quiet])

attributes([annot])

list all attributes present in the GTF attribute field.

filter_regex(annot[, regex, invert_match, ...])

Filter a dataframe by any column using regex.

from_attributes(field[, annot, check])

Convert the specified GTF attribute field to a pandas series

gene_coords(genes[, annot])

Retrieve gene locations.

genes([annot])

Retrieve gene names from an annotation.

gtf_dict(key, value[, string_values, annot])

Create a dictionary based on the columns or attribute fields in a GTF.

lengths([attribute])

Return a series with the selected GTF attribute as index, and its lengths as values.

map_genes(field[, product, annot])

Use mygene.info to map gene identifiers to any specified field.

map_locations(annot, to[, drop])

Map chromosome mapping from one assembly to another.

sanitize([match, filter, overwrite])

Match the contigs names of the gene annotations to the genome's.

Attributes

annotation_contigs

Contigs found in the gene annotation BED

bed

Dataframe with BED format annotation

genome_contigs

Contigs found in the genome fasta

gtf

Dataframe with GTF format annotation

named_gtf

Dataframe with GTF format annotation, with gene_name as index

name

genome name

genome_dir

path to the genome directory

annotation_bed_file

path to the gene annotation BED file

annotation_gtf_file

path to the gene annotation GTF file

genome_file

path to the genome fasta

readme_file

path to the README file

index_file

path to the genome index

sizes_file

path to the chromosome sizes file

tax_id

genome taxonomy identifier

annotation_bed_file

path to the gene annotation BED file

annotation_contigs: list = None

Contigs found in the gene annotation BED

annotation_gtf_file

path to the gene annotation GTF file

attributes(annot: Union[str, DataFrame] = 'gtf')

list all attributes present in the GTF attribute field.

Parameters

annot (str or pd.Dataframe, optional) – any GTF in dataframe format, or the default GTF.

Returns

with attributes

Return type

list

bed: DataFrame = None

Dataframe with BED format annotation

filter_regex(annot: Union[str, DataFrame], regex: Optional[str] = '.*', invert_match: Optional[bool] = False, column: Union[str, int] = 0) DataFrame

Filter a dataframe by any column using regex.

Parameters
  • annot (str or pd.Dataframe) – annotation to filter: “bed”, “gtf” or a pandas dataframe

  • regex (str) – regex string to match

  • invert_match (bool, optional) – keep contigs NOT matching the regex string

  • column (str or int, optional) – column name or number to filter (default: 1st, contig name)

Returns

filtered dataframe

Return type

pd.DataFrame

from_attributes(field, annot: Union[str, DataFrame] = 'gtf', check=True)

Convert the specified GTF attribute field to a pandas series

Parameters
  • field (str) – field from the GTF’s attribute column.

  • annot (str or pd.Dataframe, optional) – any GTF in dataframe format, or the default GTF.

  • check (bool, optional) – filter the GTF for rows containing field?

Returns

with the same index as the input GTF and the field column

Return type

pd.Series

gene_coords(genes: Iterable[str], annot: str = 'bed') DataFrame

Retrieve gene locations.

Parameters
  • genes (Iterable) – List of gene names as found in the given annotation file type

  • annot (str, optional) – Annotation file type: ‘bed’ or ‘gtf’ (default: “bed”)

Returns

gene annotation

Return type

pandas.DataFrame

genes(annot: str = 'gtf') list

Retrieve gene names from an annotation.

For BED files, names are taken from the ‘name’ columns.

For GTF files, names are taken from the ‘gene_name’ field in the attribute column, if available.

Parameters

annot (str, optional) – Annotation file type: ‘bed’ or ‘gtf’ (default: “gtf”)

Returns

gene names

Return type

list

genome_contigs: list = None

Contigs found in the genome fasta

genome_dir

path to the genome directory

genome_file

path to the genome fasta

gtf: DataFrame = None

Dataframe with GTF format annotation

gtf_dict(key, value, string_values=True, annot: Union[str, DataFrame] = 'gtf')

Create a dictionary based on the columns or attribute fields in a GTF.

Parameters
  • key (str) – column name or attribute fields (e.g. “seqname”, “gene_name”)

  • value (str) – column name or attribute fields (e.g. “gene_id”, “transcript_name”)

  • string_values (bool, optional) – attempt to format the dict values as strings (only happens if all value lists are length 1)

  • annot (str or pd.Dataframe, optional) – annotation to filter: “gtf” or a pandas dataframe

Returns

with values as lists. If string_values is True and all lists are length 1, values will be strings.

Return type

dict

index_file

path to the genome index

lengths(attribute='gene_name')

Return a series with the selected GTF attribute as index, and its lengths as values.

Parameters

attribute (str) – attribute to provide lengths of. Options: gene_name, gene_id, transcript_name, transcript_id. Attribute must be present in the GTF file.

Returns

attribute indexed series named ‘length’

Return type

pd.Series

map_genes(field: str, product: str = 'protein', annot: Union[str, DataFrame] = 'bed') DataFrame

Use mygene.info to map gene identifiers to any specified field.

Returns the dataframe with remapped “name” column. Drops missing identifiers.

Parameters
  • annot (str or pd.Dataframe) – Annotation dataframe to map (a pandas dataframe or “bed”). Is mapped to a column named “name” (required).

  • field (str, optional) – Identifier for gene annotation. Uses mygene.info to map ids. Valid fields are: ensembl.gene, entrezgene, symbol, name, refseq, entrezgene. Note that refseq will return the protein refseq_id by default, use product=”rna” to return the RNA refseq_id. Currently, mapping to Ensembl transcript ids is not supported.

  • product (str, optional) – Either “protein” or “rna”. Only used when field=”refseq”

Returns

remapped gene annotation

Return type

pandas.DataFrame

map_locations(annot: Union[str, DataFrame], to: str, drop=True) Union[None, DataFrame]

Map chromosome mapping from one assembly to another.

Uses the NCBI assembly reports to find contigs. Drops missing contigs.

Parameters
  • annot (str or pd.Dataframe) – annotation to map: “bed”, “gtf” or a pandas dataframe.

  • to (str) – target provider (UCSC, Ensembl or NCBI)

  • drop (bool, optional) – if True, replace the chromosome column. If False, add a 2nd chromosome column.

Returns

chromosome mapping.

Return type

pandas.DataFrame

name

genome name

named_gtf: DataFrame = None

Dataframe with GTF format annotation, with gene_name as index

readme_file

path to the README file

sanitize(match=True, filter=True, overwrite=False)

Match the contigs names of the gene annotations to the genome’s.

First, match the contig names if possible. Second, remove contig names not found in the genome. Third, save the results and log this in the README.

Parameters
  • match (bool, optional) – match annotation contig names to the genome contig names (default is True)

  • filter (bool, optional) – remove annotation contig names not found in the genome contig names (default is True)

  • overwrite (bool, optional) – update the annotation files on disk, and log this in the README (default is False).

Returns

updated attributes

Return type

Annotation class

sizes_file

path to the chromosome sizes file

tax_id

genome taxonomy identifier