12. Visualization¶
In this Pandas tutorial you will use Pandas to explore ChIP-seq and DNase-seq footprinting datasets for 3 different celltypes from the Roadmap data:
E081: Fetal Brain Male
E082: Fetal Brain Female
E083: Fetal Heart
Before we start, import the appropriate modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Set directory for the data files
DATA_DIR = "/vol/cursus/CFB/visual"
Note that by doing this:
we import three modules:
pandas
, a module that allows us to work with DataFramesnumpy
, a module for scientific calculationspyplot
functionality from the modulematplotlib
seaborn
, a module for making pretty plots
we make sure that
pandas
commands can be called withpd.
numpy
commands can be called withnp.
pyplot
commands can be called withplt.
seaborn
commands can be called withsns.
%matplotlib inline
: we make sure that all plotting is going to be shown “inline” with our commands (in Jupyter)
12.1. H3K4me1 and H3K4me3 marking in promoters and enhancers in Fetal Brain¶
Next, we can import the data for E082 into pandas DataFrames, using the read_csv()
function. Note: in earlier versions of pandas you could use the read_table()
function, however, in newer versions of pandas this function has been deprecated and will be removed.
E082_prom_K4me1 = pd.read_csv(f"{DATA_DIR}/hg19_RefSeq_TSS500__E082-H3K4me1.counts.gz",
sep="\t", # The files are tab-separated
names=['chr', 'start', 'end', 'reads'])
E082_enh_K4me1 = pd.read_csv(f"{DATA_DIR}/regions_enh_E081_E082_E083__E082-H3K4me1.counts.gz",
sep="\t", # The files are tab-separated
names=['chr', 'start', 'end', 'reads'])
E082_prom_K4me3 = pd.read_csv(f"{DATA_DIR}/hg19_RefSeq_TSS500__E082-H3K4me3.counts.gz",
sep="\t", # The files are tab-separated
names=['chr', 'start', 'end', 'reads'])
E082_enh_K4me3 = pd.read_csv(f"{DATA_DIR}/regions_enh_E081_E082_E083__E082-H3K4me3.counts.gz",
sep="\t", # The files are tab-separated
names=['chr', 'start', 'end', 'reads'])
For each DataFrame, make columns with the corresponding annotation
E082_prom_K4me1['elem'] = 'prom'
E082_enh_K4me1['elem'] = 'enh'
E082_prom_K4me3['elem'] = 'prom'
E082_enh_K4me3['elem'] = 'enh'
E082_prom_K4me1['mark'] = 'H3K4me1'
E082_enh_K4me1['mark'] = 'H3K4me1'
E082_prom_K4me3['mark'] = 'H3K4me3'
E082_enh_K4me3['mark'] = 'H3K4me3'
Check the result by doing a head()
on some of the DataFrames ( df.head()
, where df
is the object name of the DataFrame )
Concatenate the DataFrames into one DataFrame (all DataFrame rows will be added to one big DataFrame)
E082 = pd.concat([E082_prom_K4me1, E082_enh_K4me1, E082_prom_K4me3, E082_enh_K4me3])
Make a boxplot of H3K4me1, H3K4me3 in promoters and enhancers
sns.boxplot(x="elem", y="reads", hue="mark", data=E082)
<AxesSubplot:xlabel='elem', ylabel='reads'>
Note how the synthax for boxplot
is assembled:
Use the column with name
elem
as a grouping of boxes along the x-axisUse the column with name
reads
as the value to plot along the y-axisUse the column with name
mark
to differentially color the boxes
Make sure you understand how this works.
There are lots of outliers (dots stacked on top of each box/whisker) which do not contribute much to the plot
It is probably better to not plot the outliers
sns.boxplot(x="elem", y="reads", hue="mark", fliersize=0, data=E082)
<AxesSubplot:xlabel='elem', ylabel='reads'>
This is better already, but the boxplots are all at the bottom of the graph, this does not help much for clarity.
If you do a data transformation, the plot may look better.
For readcounts (integers ranging from 0 up to …) a squareroot transformation can work well since it handles zeros well (log-transformation does not handle the zeros).
Repeat the boxplot on the transformed data.
E082['reads_sqrt'] = np.sqrt(E082['reads'])
sns.boxplot(x="elem", y="reads_sqrt", hue="mark", fliersize=0, data=E082)
<AxesSubplot:xlabel='elem', ylabel='reads_sqrt'>
Knowing that the regions in the DataFrame are not of the same length, we may need to correct the transformed readcounts for the length of the regions
E082['reads_sqrt_bp'] = E082['reads_sqrt']/(E082['end']-E082['start'])
sns.boxplot(x="elem", y="reads_sqrt_bp", hue="mark", fliersize=0, data=E082)
<AxesSubplot:xlabel='elem', ylabel='reads_sqrt_bp'>
This is definitely better, but the plot may even better if we adapt the y-axis scale
plt.ylim(0,0.05)
sns.boxplot(x="elem", y="reads_sqrt_bp", hue="mark", fliersize=0, data=E082)
<AxesSubplot:xlabel='elem', ylabel='reads_sqrt_bp'>
Great, this is a nice boxplot.
What can you conclude for H3K4me3 and H3K4me1 in promoters and enhancers?
12.2. H3K4me1 and H3K4me3 marking in promoters and enhancers: Fetal Brain-specific enhancers only¶
In the previous section we used all enhancers that were called in Fetal Brain (male and Female) and Fetal Heart, but maybe we should include only those enhancers that were really called as an enhancer in E082 (Fetal Brain Female)
This information is in the file regions_enh_E081_E082_E083.bed
, which you find in the /vol/cursus/CFB/visual
directory.
This file has the same order as all files above that are called regions_enh_E081_E082_E083__xxxxx
.
Read the file into a pandas DataFrame.
enh = pd.read_csv(f"{DATA_DIR}/regions_enh_E081_E082_E083.bed.gz",
sep="\t",
names=['chr', 'start', 'end', 'celltype'])
Do a head to see what the file looks like.
To select all the regions that are called as enhancer in E082, we will create a boolean that consists of True
/ False
indicating which line (=region, =enhancer) in the enh
DataFrame refers to an enhancer in E082.
cond = enh['celltype'].str.contains('E082')
Remember that the order of the enh
DataFrame was the same as for all enhancer DataFrames.
Now we take the individual enhancer DataFrames for E082 (not the concatenated one) and select from those only the E082-specific enhancer regions
E082_enh_K4me1_spec = E082_enh_K4me1[cond]
E082_enh_K4me3_spec = E082_enh_K4me3[cond]
Now that we have selected the E082-specific enhancers only, we can concatenate the DataFrames like we did above
E082_spec = pd.concat([E082_prom_K4me1, E082_enh_K4me1_spec, E082_prom_K4me3, E082_enh_K4me3_spec])
Like above, we do the sqrt transformation and length-correction
E082_spec['reads_sqrt'] = np.sqrt(E082_spec['reads'])
E082_spec['reads_sqrt_bp'] = E082_spec['reads_sqrt']/(E082_spec['end']-E082_spec['start'])
We can now make the boxplot as we did above
plt.ylim(0,0.05)
sns.boxplot(x="elem", y="reads_sqrt_bp", hue="mark", fliersize=0, data=E082_spec)
<AxesSubplot:xlabel='elem', ylabel='reads_sqrt_bp'>
Do you see a difference with the boxplot above?
Does this make sense (H3K4me is a typical enhancer mark) ?
12.3. Enhancer marking in different cell types¶
Now let us look at H3K4me1 data only, in enhancers only, but in different cell types
We will add data from a different cell type, E081 (Fetal Brain Male), E082 (Fetal Brain Female) and E083 (Fetal Heart)
First, read in the data into a pandas DataFrame again
E081_enh_K4me1 = pd.read_csv(f"{DATA_DIR}/regions_enh_E081_E082_E083__E081-H3K4me1.counts.gz",
sep="\t",
names=['chr', 'start', 'end', 'reads'])
E082_enh_K4me1 = pd.read_csv(f"{DATA_DIR}/regions_enh_E081_E082_E083__E082-H3K4me1.counts.gz",
sep="\t",
names=['chr', 'start', 'end', 'reads'])
E083_enh_K4me1 = pd.read_csv(f"{DATA_DIR}/regions_enh_E081_E082_E083__E083-H3K4me1.counts.gz",
sep="\t",
names=['chr', 'start', 'end', 'reads'])
In each DataFrame, annotate from which celltype the data comes in a column called celltype
.
E081_enh_K4me1['celltype'] = "E081"
E082_enh_K4me1['celltype'] = "E082"
E083_enh_K4me1['celltype'] = "E083"
In each DataFrame, annotate the enhancers that were called for each cell type.
We create a column called enh
which denotes whether a region was called as enhancer in E081, E082, or E083.
This data we take from the DataFrame enh
, from which we know that the order is the same as in our DataFrames.
E081_enh_K4me1['enh'] = enh['celltype']
E082_enh_K4me1['enh'] = enh['celltype']
E083_enh_K4me1['enh'] = enh['celltype']
Like above, we make a concatenated DataFrame m
using concat
.
m = pd.concat([E081_enh_K4me1, E082_enh_K4me1, E083_enh_K4me1])
Like above, we transform the readcounts using sqrt
, and correct for region length.
m['reads_sqrt'] = np.sqrt(m['reads'])
m['reads_sqrt_bp'] = m['reads_sqrt']/(m['end']-m['start'])
Now we can make the boxplot.
Note that we add an extra argument called hue
which colors the boxplot according the information in the column called enh
.
sns.boxplot(x="celltype", y="reads_sqrt_bp", hue="enh", fliersize=0, data=m)
# Move legend outside the plot
plt.legend(bbox_to_anchor=(1, 1), loc=2)
# Adapt the y-axis
plt.ylim(0, 0.08)
(0.0, 0.08)
Carefully look at the boxplot
Does this make sense?
12.4. Promoter marking vs. gene expression¶
You are going to plot the gene expression (RNA-seq, so RPKM) as a function of the active promoter mark H3K4me3 at the corresponding genes, for E082.
For this you need the file hg19_RefSeq_TSS500_ENSEMBL_RPKM
from the /vol/cursus/CFB/visual
directory.
This file has the same ordering as all files named hg19_RefSeq_TSS500__xxxxx
.
First, read in the expression data into a pandas DataFrame.
rpkm = pd.read_csv(f"{DATA_DIR}/hg19_RefSeq_TSS500_ENSEMBL_RPKM.gz", sep="\t")
Check with a head()
what this file looks like. Note that we did not specify column names, since these are already present in the file. `read_csv by default assumes that this first line is a header line
Note that head()
does not show you all columns, so you can use the following to see all column names
rpkm.columns
Index(['chr', 'start', 'end', 'refseq', 'gene', 'strand', 'ensembl', 'E000',
'E003', 'E004', 'E005', 'E006', 'E007', 'E011', 'E012', 'E013', 'E016',
'E024', 'E027', 'E028', 'E037', 'E038', 'E047', 'E050', 'E053', 'E054',
'E055', 'E056', 'E057', 'E058', 'E059', 'E061', 'E062', 'E065', 'E066',
'E070', 'E071', 'E079', 'E082', 'E084', 'E085', 'E087', 'E094', 'E095',
'E096', 'E097', 'E098', 'E100', 'E104', 'E105', 'E106', 'E109', 'E112',
'E113', 'E114', 'E116', 'E117', 'E118', 'E119', 'E120', 'E122', 'E123',
'E127', 'E128'],
dtype='object')
The DataFrame rpkm
has many columns, you need to paste the column called E082
to a DataFrame called E082_prom_K4me3
E082_prom_K4me3 = pd.read_csv(f"{DATA_DIR}/hg19_RefSeq_TSS500__E082-H3K4me3.counts.gz",
sep="\t",
names=['chr', 'start', 'end', 'reads'])
Like above, do the sqrt transformation on the H3K4me3 readcounts.
E082_prom_K4me3['reads_sqrt'] = np.sqrt(E082_prom_K4me3['reads'])
In this case there is no need to normalize for region length since all promoters here have the same length. However if you do normalize, it does not matter
Since the order of the regions (=lines) in the rpkm
DataFrame is the same as in the E082_prom_K4me3
DataFrame, we can paste the RPKM column E082
from the rpkm
DataFrame into the E082_prom_K4me3
DataFrame.
E082_prom_K4me3['rpkm'] = rpkm['E082']
Now we can make a scatterplot, in which each dot represent a promoter and the RPKM of the corresponding gene.
We use the function lmplot()
from seaborn.
sns.lmplot(x='reads_sqrt', y='rpkm', data=E082_prom_K4me3, fit_reg=False)
<seaborn.axisgrid.FacetGrid at 0x7f42ba9f0310>
Note that lmplot()
by default plots a regression line, which you can turn off by fit_reg=False
.
Looking at the plot, we can say that this does not look fantastic….
Most dots are at the bottom of the graph, and we cannot see whether there is any correlation.
For gene expression it is usually good to plot it in log-scale.
In this case we choose the log2, but a log10 would be fine as well.
E082_prom_K4me3['rpkm_log'] = np.log2(E082_prom_K4me3['rpkm'])
sns.lmplot(x='reads_sqrt', y='rpkm_log', data=E082_prom_K4me3, fit_reg=False)
/opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages/pandas/core/arraylike.py:364: RuntimeWarning: divide by zero encountered in log2
result = getattr(ufunc, method)(*inputs, **kwargs)
<seaborn.axisgrid.FacetGrid at 0x7f42ba759d90>
This looks much better.
Although the relationship is not clear-cut, there is definitely a trend of higher expression for genes which promoters have higher H3K4me3.
We can think about a control for this analysis.
To check whether the observed trend has some biological meaning, we can plot the same gene expression from E082 as a function of H3K4me3 in E083 (the ‘wrong’ celltype).
The expectation is that the trend you observed above is less clear here.
Read the promoter H3K4me3 data into a pandas DataFrame as we did above
E083_prom_K4me3 = pd.read_csv(f"{DATA_DIR}/hg19_RefSeq_TSS500__E083-H3K4me3.counts.gz",
sep="\t",
names=['chr', 'start', 'end', 'reads'])
Do the sqrt transformation of the H3K4me3 readcounts, as above.
Paste the RPKMs into the H3K4me3 DataFrame of E083, as above.
Do the RPKM log2 transformation, as above.
E083_prom_K4me3['reads_sqrt'] = np.sqrt(E083_prom_K4me3['reads'])
E083_prom_K4me3['rpkm'] = rpkm['E082']
E083_prom_K4me3['rpkm_log'] = np.log2(E083_prom_K4me3['rpkm'])
/opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages/pandas/core/arraylike.py:364: RuntimeWarning: divide by zero encountered in log2
result = getattr(ufunc, method)(*inputs, **kwargs)
And then the scatterplot.
sns.lmplot(x='reads_sqrt', y='rpkm_log', data=E083_prom_K4me3, fit_reg=False)
<seaborn.axisgrid.FacetGrid at 0x7f42ba4f0f50>
As an alternative to the scatterplot approach above, we can classify our H3K4me3 levels into ‘high’ or ‘low’. But, what is ‘high’ and what is ‘low’? Where should we put the threshold for ‘high’?
To get an idea, we can plot the distribution of the H3K4me3 signal using a histogram/densityplot. For this we use the distplot
function from seaborn.
sns.distplot(E082_prom_K4me3['reads_sqrt'])
/opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='reads_sqrt', ylabel='Density'>
By looking at this histogram, we could conclude that the regions with ‘high’ H3K4me3 are the ones that roughly have signals higher than 15 or so.
To assign ‘high’ and ‘low’ to H3K4me3 levels we add a column to the E082 DataFrame called reads_sqrt_bin
.
For this we use the where
function from numpy, in which we state: if the values in column x are higher than 15, return “high H3K4me3”, else, return “low H3K4me3”.
E082_prom_K4me3['reads_sqrt_bin'] = np.where(E082_prom_K4me3['reads_sqrt'] > 15, "high K4me3", "low K4me3")
Do a head()
on the DataFrame to see what happened.
Now we can use a boxplot instead of a scatterplot.
In this boxplot the RPKM is plotted in boxes, as a function of “high K4me3” or “low K4me3”.
sns.boxplot(x="reads_sqrt_bin", y="rpkm_log", fliersize=0, data=E082_prom_K4me3)
<AxesSubplot:xlabel='reads_sqrt_bin', ylabel='rpkm_log'>
In this plot it is already much more clear that gener with high H3K4me3 at the promoter have higher expression.
Rather than assigning two “bins” (“high” and “low”) by ourselves, we can specify multiple bins.
For this we use the cut
method of pandas, for which we can specify the number of bins we want.
E082_prom_K4me3['reads_sqrt_bin'] = pd.cut(E082_prom_K4me3['reads_sqrt'], 3, labels=False)
Note that we used labels=False
, which causes the bins to be numbered.
We can now make a similar boxplot as above, with the RPKM plotted in boxes, as a function of the different bins we created.
sns.boxplot(x="reads_sqrt_bin", y="rpkm_log", fliersize=0, data=E082_prom_K4me3)
<AxesSubplot:xlabel='reads_sqrt_bin', ylabel='rpkm_log'>
12.5. Exercises¶
Exercise 12.1 Visualization using pandas and seaborn
In this exercise you are going to compare the different ChIP-seq and DNase-seq datasets from
E081: Fetal Brain Male
E082: Fetal Brain Female
E083: Fetal Heart
The final goal is to see how these marks correlate, and how these marks classify the different cell types. You will visualize this in a heatmap that shows pairwise correlations, and clusters the datasets that are most similar.
To get to this, the exercise is divided into different questions that will lead you through the exercise. By the way, for help you are allowed (=encouraged) to use Google as well!
Exercise outline
The strategy is outlined here, a detailed description follows after this:
Make a dictionary with pandas DataFrames, so that object names can be used to access the DataFrames
Q1. Create a definition to read files into pandas DataFrames
Q2. Create lists with filenames and their corresponding object names
Q3. Create the dictionary with DataFrames
Q4. Make one merged DataFrame containing the readcounts as columns, one for each datasets.
Q5. Make a DataFrame with all pairwise correlations
Q6. Plot the heatmap of pairwise correlations
Question 1. Function for reading a file into a pandas DataFrame
Write a definition (function) that:
uses a filename as input (e.g.
dirname_xxxxx/filename_xxxxx
)returns the corresponding pandas DataFrame
this should work for files with four columns (e.g. named ‘chr’, ‘start’, ‘end’, ‘reads’)
Question 2. Create lists with filenames and object names
The directory /vol/cursus/CFB/visual
contains:
readcounts for promoters (
hg19_RefSeq_TSS500__xxxxx.counts.gz
)readcounts for enhancers (
regions_enh_E081_E082_E083__xxxxx.counts.gz
)
Create 4 lists:
filenames (e.g.
dirname_xxxxx/filename_xxxxx
) for all promoter files withDNase
H3K4me3
H3K27me3
filenames (e.g.
dirname_xxxxx/filename_xxxxx
) for all enhancer files withDNase
H3K4me1
H3K27me3
object names for all promoter files of list 1. (e.g.
"E082-H3K27me3"
)object names for all enhancer files of list 2. (e.g.
"E082-H3K4me1"
)
Note: these filenames should contain the full path to the file
Note: it is OK to create the lists with filenames manually, but the lists with object names can be derived from the former, using python.
Question 3. Make a dictionary with DataFrames
Make a dictionary with DataFrames with
object names as keys (e.g.
"E081-H3K4me1"
, use the list with object names created above)the corresponding pandas DataFrame as values (use the list with filenames created above)
Make one such dictionary for promoters, and one dictionary for enhancers
Question 4. Make a merged DataFrame with readcounts
Make a new pandas DataFrame in which the columns contain the readcounts from each DataFrame in your dictionary
one DataFrame for the promoters
one DataFrame for the enhancers
To do this, you can use a loop over your dictionaries and create a new DataFrame column by column.
Question 5. Make a DataFrame with pairwise correlations
Make a pandas DataFrame containing the pairwise correlations between the different datasets
one DataFrame for the promoters
one DataFrame for the enhancers
Hint: you can do this by using the corr()
method for a pandas DataFrame
Question 6. Plot a heatmap displaying the pairwise correlations
Plot a heatmap that displays the correlation coefficients obtained in the DataFrame above.
Make one heatmap for the promoters
Make one heatmap for the enhancers
Hint: you can do this by using the clustermap()
function for a pandas DataFrame, which will not only draw the heatmap, but also cluster the heatmap
Which of the two, promoters or enhancers, are best for classifying the different celltypes?