genomepy.annotation.Annotation
- class genomepy.annotation.Annotation(name: str, genomes_dir: str = None, quiet: bool = False)
Bases:
object
Manipulate genes and whole gene annotations with pandas dataframes.
- Parameters
name (str) – Genome name/directory/fasta or gene annotation BED/GTF file.
genomes_dir (str, optional) – Genomes installation directory.
quiet (bool, optional) – Silence init warnings
- Returns
attributes & methods to manipulate gene annotations
- Return type
object
- __init__(name: str, genomes_dir: str = None, quiet: bool = False)
Methods
__init__
(name[, genomes_dir, quiet])attributes
([annot])list all attributes present in the GTF attribute field.
filter_regex
(annot[, regex, invert_match, ...])Filter a dataframe by any column using regex.
from_attributes
(field[, annot, check])Convert the specified GTF attribute field to a pandas series
gene_coords
(genes[, annot])Retrieve gene locations.
genes
([annot])Retrieve gene names from an annotation.
gtf_dict
(key, value[, string_values, annot])Create a dictionary based on the columns or attribute fields in a GTF.
lengths
([attribute])Return a series with the selected GTF attribute as index, and its lengths as values.
map_genes
(field[, product, annot])Use mygene.info to map gene identifiers to any specified field.
map_locations
(annot, to[, drop])Map chromosome mapping from one assembly to another.
sanitize
([match, filter, overwrite])Match the contigs names of the gene annotations to the genome's.
Attributes
Contigs found in the gene annotation BED
Dataframe with BED format annotation
Contigs found in the genome fasta
Dataframe with GTF format annotation
Dataframe with GTF format annotation, with gene_name as index
genome name
path to the genome directory
path to the gene annotation BED file
path to the gene annotation GTF file
path to the genome fasta
path to the README file
path to the genome index
path to the chromosome sizes file
genome taxonomy identifier
- annotation_bed_file
path to the gene annotation BED file
- annotation_contigs: list = None
Contigs found in the gene annotation BED
- annotation_gtf_file
path to the gene annotation GTF file
- attributes(annot: Union[str, DataFrame] = 'gtf')
list all attributes present in the GTF attribute field.
- Parameters
annot (str or pd.Dataframe, optional) – any GTF in dataframe format, or the default GTF.
- Returns
with attributes
- Return type
list
- bed: DataFrame = None
Dataframe with BED format annotation
- filter_regex(annot: Union[str, DataFrame], regex: Optional[str] = '.*', invert_match: Optional[bool] = False, column: Union[str, int] = 0) DataFrame
Filter a dataframe by any column using regex.
- Parameters
annot (str or pd.Dataframe) – annotation to filter: “bed”, “gtf” or a pandas dataframe
regex (str) – regex string to match
invert_match (bool, optional) – keep contigs NOT matching the regex string
column (str or int, optional) – column name or number to filter (default: 1st, contig name)
- Returns
filtered dataframe
- Return type
pd.DataFrame
- from_attributes(field, annot: Union[str, DataFrame] = 'gtf', check=True)
Convert the specified GTF attribute field to a pandas series
- Parameters
field (str) – field from the GTF’s attribute column.
annot (str or pd.Dataframe, optional) – any GTF in dataframe format, or the default GTF.
check (bool, optional) – filter the GTF for rows containing field?
- Returns
with the same index as the input GTF and the field column
- Return type
pd.Series
- gene_coords(genes: Iterable[str], annot: str = 'bed') DataFrame
Retrieve gene locations.
- Parameters
genes (Iterable) – List of gene names as found in the given annotation file type
annot (str, optional) – Annotation file type: ‘bed’ or ‘gtf’ (default: “bed”)
- Returns
gene annotation
- Return type
pandas.DataFrame
- genes(annot: str = 'gtf') list
Retrieve gene names from an annotation.
For BED files, names are taken from the ‘name’ columns.
For GTF files, names are taken from the ‘gene_name’ field in the attribute column, if available.
- Parameters
annot (str, optional) – Annotation file type: ‘bed’ or ‘gtf’ (default: “gtf”)
- Returns
gene names
- Return type
list
- genome_contigs: list = None
Contigs found in the genome fasta
- genome_dir
path to the genome directory
- genome_file
path to the genome fasta
- gtf: DataFrame = None
Dataframe with GTF format annotation
- gtf_dict(key, value, string_values=True, annot: Union[str, DataFrame] = 'gtf')
Create a dictionary based on the columns or attribute fields in a GTF.
- Parameters
key (str) – column name or attribute fields (e.g. “seqname”, “gene_name”)
value (str) – column name or attribute fields (e.g. “gene_id”, “transcript_name”)
string_values (bool, optional) – attempt to format the dict values as strings (only happens if all value lists are length 1)
annot (str or pd.Dataframe, optional) – annotation to filter: “gtf” or a pandas dataframe
- Returns
with values as lists. If string_values is True and all lists are length 1, values will be strings.
- Return type
dict
- index_file
path to the genome index
- lengths(attribute='gene_name')
Return a series with the selected GTF attribute as index, and its lengths as values.
- Parameters
attribute (str) – attribute to provide lengths of. Options: gene_name, gene_id, transcript_name, transcript_id. Attribute must be present in the GTF file.
- Returns
attribute indexed series named ‘length’
- Return type
pd.Series
- map_genes(field: str, product: str = 'protein', annot: Union[str, DataFrame] = 'bed') DataFrame
Use mygene.info to map gene identifiers to any specified field.
Returns the dataframe with remapped “name” column. Drops missing identifiers.
- Parameters
annot (str or pd.Dataframe) – Annotation dataframe to map (a pandas dataframe or “bed”). Is mapped to a column named “name” (required).
field (str, optional) – Identifier for gene annotation. Uses mygene.info to map ids. Valid fields are: ensembl.gene, entrezgene, symbol, name, refseq, entrezgene. Note that refseq will return the protein refseq_id by default, use product=”rna” to return the RNA refseq_id. Currently, mapping to Ensembl transcript ids is not supported.
product (str, optional) – Either “protein” or “rna”. Only used when field=”refseq”
- Returns
remapped gene annotation
- Return type
pandas.DataFrame
- map_locations(annot: Union[str, DataFrame], to: str, drop=True) Union[None, DataFrame]
Map chromosome mapping from one assembly to another.
Uses the NCBI assembly reports to find contigs. Drops missing contigs.
- Parameters
annot (str or pd.Dataframe) – annotation to map: “bed”, “gtf” or a pandas dataframe.
to (str) – target provider (UCSC, Ensembl or NCBI)
drop (bool, optional) – if True, replace the chromosome column. If False, add a 2nd chromosome column.
- Returns
chromosome mapping.
- Return type
pandas.DataFrame
- name
genome name
- named_gtf: DataFrame = None
Dataframe with GTF format annotation, with gene_name as index
- readme_file
path to the README file
- sanitize(match=True, filter=True, overwrite=False)
Match the contigs names of the gene annotations to the genome’s.
First, match the contig names if possible. Second, remove contig names not found in the genome. Third, save the results and log this in the README.
- Parameters
match (bool, optional) – match annotation contig names to the genome contig names (default is True)
filter (bool, optional) – remove annotation contig names not found in the genome contig names (default is True)
overwrite (bool, optional) – update the annotation files on disk, and log this in the README (default is False).
- Returns
updated attributes
- Return type
Annotation class
- sizes_file
path to the chromosome sizes file
- tax_id
genome taxonomy identifier