11. Introduction to pandas DataFrames¶
In this tutorial you are going to use the module pandas
.
We will focus on DataFrames, a convenient object type used in pandas.
Note
This tutorial has no exercises, but serves as an introduction to pandas. It will illustrate what you can do with pandas DataFrames.
If you want more information on pandas, a (quick) tutorial is found here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
11.1. Importing the pandas module¶
We start by importing the pandas
module, for which we use the following command:
import pandas as pd
Note that we imported the pandas
module and renamed it to pd
. This allows us to use all pandas
functionality using pd
(instead of pandas
). This is very common: it saves typing since pd
is shorter than pandas
.
11.2. Creating a pandas DataFrame from lists¶
Here we will create a pandas DataFrame containing the following columns:
chrom = ['chr6', 'chr11', 'chr3', 'chr2', 'chr17', 'chr17']
start = [143889976, 77774406, 14988512, 203102822, 7590368, 26903451]
end = [143890976, 77775406, 14989512, 203103822, 7591368, 26904451]
refseq =['NR_027113', 'NM_003251', 'NR_046253', 'NM_003352', 'NM_001126112', 'NM_005165']
In this case we create an empty DataFrame first, and we fill it up with the individual columns.
df_peaks = pd.DataFrame()
df_peaks['chrom'] = chrom
df_peaks['start'] = start
df_peaks['end'] = end
df_peaks['nearest_gene'] = refseq
You can check what your DataFrame looks like using print()
, or simply type df_peaks
. The latter only works in a Jupyter notebook, but the advantage is that you get a nicely formatted table!
print(df_peaks)
chrom start end nearest_gene
0 chr6 143889976 143890976 NR_027113
1 chr11 77774406 77775406 NM_003251
2 chr3 14988512 14989512 NR_046253
3 chr2 203102822 203103822 NM_003352
4 chr17 7590368 7591368 NM_001126112
5 chr17 26903451 26904451 NM_005165
df_peaks
chrom | start | end | nearest_gene | |
---|---|---|---|---|
0 | chr6 | 143889976 | 143890976 | NR_027113 |
1 | chr11 | 77774406 | 77775406 | NM_003251 |
2 | chr3 | 14988512 | 14989512 | NR_046253 |
3 | chr2 | 203102822 | 203103822 | NM_003352 |
4 | chr17 | 7590368 | 7591368 | NM_001126112 |
5 | chr17 | 26903451 | 26904451 | NM_005165 |
We make a second DataFrame, which corresponds to a second experiment (‘exp2’)
chr_exp2 = ["chr6_dbb_hap3", "chr7", "chr9", "chrX", "chrX", "chrX"]
start_exp2 = [4324541, 98029927, 90497271, 71496641, 73506544, 152162171]
end_exp2 = [4325541, 98030927, 90498271, 71497641, 73507544, 152163171]
refseq_exp2 = ["NM_002121", "NM_018842", "NM_178828", "NM_001007", "NR_030258", "NM_001184924"]
df_peaks_exp2 = pd.DataFrame()
df_peaks_exp2['chrom'] = chr_exp2
df_peaks_exp2['start'] = start_exp2
df_peaks_exp2['end'] = end_exp2
df_peaks_exp2['nearest_gene'] = refseq_exp2
Look at both DataFrames. The head()
function shows the first five rows of a DataFrame.
They have the same column layout (chrom
, start
, end
, refseq
).
df_peaks.head()
chrom | start | end | nearest_gene | |
---|---|---|---|---|
0 | chr6 | 143889976 | 143890976 | NR_027113 |
1 | chr11 | 77774406 | 77775406 | NM_003251 |
2 | chr3 | 14988512 | 14989512 | NR_046253 |
3 | chr2 | 203102822 | 203103822 | NM_003352 |
4 | chr17 | 7590368 | 7591368 | NM_001126112 |
df_peaks_exp2.head()
chrom | start | end | nearest_gene | |
---|---|---|---|---|
0 | chr6_dbb_hap3 | 4324541 | 4325541 | NM_002121 |
1 | chr7 | 98029927 | 98030927 | NM_018842 |
2 | chr9 | 90497271 | 90498271 | NM_178828 |
3 | chrX | 71496641 | 71497641 | NM_001007 |
4 | chrX | 73506544 | 73507544 | NR_030258 |
They both represent peaks of a ChIP-seq experiment.
For both DataFrames we can add an additional column indicating from which experiment the data comes (‘exp1’ or ‘exp2’)
df_peaks['exp'] = 'exp1'
df_peaks_exp2['exp'] = 'exp2'
Take a look at the two DataFrames, and note how we did this:
We used
[ ]
to define a new colum, with the column name between[ ]
We used only one string (
'exp1'
or'exp2'
), but this was automatically expanded to the whole column
11.3. Creating a pandas DataFrame using a dictionary
¶
We can also create a pandas DataFrame from a dictionary. The dictionary should contain column names as keys and lists as values for the columns. To show this in action, we will use the three following lists:
refseq = ["NM_001007", "NM_001098638", "NM_001110221", "NM_001126112", "NM_001142599",
"NM_001164283", "NM_001164386", "NM_001184924", "NM_001290043","NM_002121",
"NM_002441", "NM_003251", "NM_003352", "NM_004263", "NM_005165", "NM_006017",
"NM_007145", "NM_016522", "NM_018842", "NM_024097", "NM_147181", "NM_173856",
"NM_178828"]
ensembl = ["ENSG00000198034", "ENSG00000166439", "ENSG00000121742", "ENSG00000141510",
"ENSG00000072682", "ENSG00000056972", "ENSG00000158987", "ENSG00000198883",
"ENSG00000225967", "ENSG00000237710", "ENSG00000235569", "ENSG00000151365",
"ENSG00000116030", "ENSG00000135622", "ENSG00000109107", "ENSG00000007062",
"ENSG00000167635", "ENSG00000182667", "ENSG00000006453", "ENSG00000164008",
"ENSG00000185774", "ENSG00000196131", "ENSG00000177992"]
genename = ["RPS4X", "RNF169", "GJB6", "TP53", "P4HA2", "TRAF3IP2", "RAPGEF6", "PNMA5",
"TAP2", "HLA-DPB1", "MSH5", "THRSP", "SUMO1", "SEMA4F", "ALDOC", "PROM1",
"ZNF146", "NTM", "BAIAP2L1", "C1orf50", "KCNIP4", "VN1R2", "SPATA31E1"]
First, we use these lists to create a dictionary
dict_genes = {
'refseq': refseq,
'ensembl': ensembl,
'genename': genename
}
Now that we have the dictionary, we can create the DataFrame.
df_genes = pd.DataFrame(dict_genes)
Look at the resulting DataFrame called df_genes
. This DataFrame will help us to map RefSeq to ENSEMBL gene identifiers.
The first three rows.
df_genes.head(3)
refseq | ensembl | genename | |
---|---|---|---|
0 | NM_001007 | ENSG00000198034 | RPS4X |
1 | NM_001098638 | ENSG00000166439 | RNF169 |
2 | NM_001110221 | ENSG00000121742 | GJB6 |
The last three rows.
df_genes.tail(3)
refseq | ensembl | genename | |
---|---|---|---|
20 | NM_147181 | ENSG00000185774 | KCNIP4 |
21 | NM_173856 | ENSG00000196131 | VN1R2 |
22 | NM_178828 | ENSG00000177992 | SPATA31E1 |
Or three random rows.
df_genes.sample(3)
refseq | ensembl | genename | |
---|---|---|---|
12 | NM_003352 | ENSG00000116030 | SUMO1 |
5 | NM_001164283 | ENSG00000056972 | TRAF3IP2 |
10 | NM_002441 | ENSG00000235569 | MSH5 |
Next, we create another DataFrame with gene expression values
Here, we will use the dictionary method to create a DataFrame
without explicitly creating the dictionary.
ensembl_1 = ["ENSG00000237710", "ENSG00000182667", "ENSG00000121742", "ENSG00000158987",
"ENSG00000196131", "ENSG00000151365", "ENSG00000056972", "ENSG00000235569",
"ENSG00000006453", "ENSG00000198883", "ENSG00000225967", "ENSG00000166439",
"ENSG00000135622", "ENSG00000007062", "ENSG00000109107", "ENSG00000167635",
"ENSG00000164008", "ENSG00000141510", "ENSG00000177992"]
sample_1 = [None, 1.686, 0.063, 4.222, 0.021, 0.026, 8.169, None, 16.159, 0.08, None,
3.972, 0.844, 26.208, 18.218, 91.049, 3.828, 58.697, 0.024]
sample_2 = [None, 1.159, 0.011, 4.291, 0.004, 0.013, 0.306, None, 12.671, 0, None,
6.983, 5.352, 14.336, 5.333, 60.28, 5.137, 47.569, 0]
sample_3 = [None, 6.865, 0, 4.727, 0.039, 0.023, 10.459, None, 0.225, 0.011, None,
7.007, 3.402, 6.076, 18.445, 28.716, 4.128, 34.299, 0]
sample_4 = [None, 15.691, 2.16, 5.284, 0, 0.181, 1.428, None, 0.184, 0.395, None,
11.842, 4.186, 6.01, 131.013, 25.126, 9.933, 28.18, 0]
# The next code could all be written on one line. However, to improve readability
# we write in this manner.
df_expr = pd.DataFrame(
{
'ensembl': ensembl_1,
'sample1': sample_1,
'sample2': sample_2,
'sample3': sample_3,
'sample4': sample_4
}
)
Look at the DataFrames using print()
or, in Jupyter, by just typing the DataFrame name followed by <Enter>
.
In the case where you would have very large DataFrames, you can always use df.head()
to show the first rows, or df.tail()
to show the last rows, where df
is your DataFrame name.
11.4. Merging pandas DataFrames¶
11.4.1. Merging DataFrames on column values¶
We now have 3 types of data in our 3 DataFrames:
ChIP-seq peaks (
df_peaks
)A RefSeq to ENSEMBL gene identifier mapping (
df_genes
)Expression values of genes in 4 samples, with ENSEMBL identifiers (
df_expr
)
This is a typical example for any (bioinformatical) analysis in which you want to incorporate different data types, often from different sources. You need to merge them if you want to do anything useful. Pandas DataFrames can be of big help here.
So let us try to merge these pieces of data.
11.4.2. Merging DataFrames: concat
¶
Using the command concat()
we can concatenate multiple DataFrames.
Let us do this with the 2 ChIP-seq peaks DataFrames:
df_all_peaks = pd.concat([df_peaks, df_peaks_exp2])
df_all_peaks
chrom | start | end | nearest_gene | exp | |
---|---|---|---|---|---|
0 | chr6 | 143889976 | 143890976 | NR_027113 | exp1 |
1 | chr11 | 77774406 | 77775406 | NM_003251 | exp1 |
2 | chr3 | 14988512 | 14989512 | NR_046253 | exp1 |
3 | chr2 | 203102822 | 203103822 | NM_003352 | exp1 |
4 | chr17 | 7590368 | 7591368 | NM_001126112 | exp1 |
5 | chr17 | 26903451 | 26904451 | NM_005165 | exp1 |
0 | chr6_dbb_hap3 | 4324541 | 4325541 | NM_002121 | exp2 |
1 | chr7 | 98029927 | 98030927 | NM_018842 | exp2 |
2 | chr9 | 90497271 | 90498271 | NM_178828 | exp2 |
3 | chrX | 71496641 | 71497641 | NM_001007 | exp2 |
4 | chrX | 73506544 | 73507544 | NR_030258 | exp2 |
5 | chrX | 152162171 | 152163171 | NM_001184924 | exp2 |
Check out the result, and note how the exp
column still allows us to distinguish the two experiments.
You can use concat()
to combine two DataFrames vertically (axis=0
) or horizontally (axis=1
).
pd.concat([df_peaks, df_peaks_exp2], axis=0)
chrom | start | end | nearest_gene | exp | |
---|---|---|---|---|---|
0 | chr6 | 143889976 | 143890976 | NR_027113 | exp1 |
1 | chr11 | 77774406 | 77775406 | NM_003251 | exp1 |
2 | chr3 | 14988512 | 14989512 | NR_046253 | exp1 |
3 | chr2 | 203102822 | 203103822 | NM_003352 | exp1 |
4 | chr17 | 7590368 | 7591368 | NM_001126112 | exp1 |
5 | chr17 | 26903451 | 26904451 | NM_005165 | exp1 |
0 | chr6_dbb_hap3 | 4324541 | 4325541 | NM_002121 | exp2 |
1 | chr7 | 98029927 | 98030927 | NM_018842 | exp2 |
2 | chr9 | 90497271 | 90498271 | NM_178828 | exp2 |
3 | chrX | 71496641 | 71497641 | NM_001007 | exp2 |
4 | chrX | 73506544 | 73507544 | NR_030258 | exp2 |
5 | chrX | 152162171 | 152163171 | NM_001184924 | exp2 |
pd.concat([df_peaks, df_peaks_exp2], axis=1)
chrom | start | end | nearest_gene | exp | chrom | start | end | nearest_gene | exp | |
---|---|---|---|---|---|---|---|---|---|---|
0 | chr6 | 143889976 | 143890976 | NR_027113 | exp1 | chr6_dbb_hap3 | 4324541 | 4325541 | NM_002121 | exp2 |
1 | chr11 | 77774406 | 77775406 | NM_003251 | exp1 | chr7 | 98029927 | 98030927 | NM_018842 | exp2 |
2 | chr3 | 14988512 | 14989512 | NR_046253 | exp1 | chr9 | 90497271 | 90498271 | NM_178828 | exp2 |
3 | chr2 | 203102822 | 203103822 | NM_003352 | exp1 | chrX | 71496641 | 71497641 | NM_001007 | exp2 |
4 | chr17 | 7590368 | 7591368 | NM_001126112 | exp1 | chrX | 73506544 | 73507544 | NR_030258 | exp2 |
5 | chr17 | 26903451 | 26904451 | NM_005165 | exp1 | chrX | 152162171 | 152163171 | NM_001184924 | exp2 |
The documentation contains more examples.
11.4.3. Merging DataFrames: merge
¶
In the above case, we want to simply combine two DataFrames by concatenating, i.e. simply adding the rows of two DataFrames together. You can also join two DataFrames in other ways. The df_peaks
DataFrame has a nearest_gene
column. This is a RefSeq gene identifier that is also present in the df_genes
DataFrame. If we would like to annotate the genes in df_peaks
, we would only need the relevant columns for the genes that are in df_peaks
.
df_genes_peaks = df_peaks.merge(df_genes, left_on="nearest_gene", right_on="refseq")
df_genes_peaks
chrom | start | end | nearest_gene | exp | refseq | ensembl | genename | |
---|---|---|---|---|---|---|---|---|
0 | chr11 | 77774406 | 77775406 | NM_003251 | exp1 | NM_003251 | ENSG00000151365 | THRSP |
1 | chr2 | 203102822 | 203103822 | NM_003352 | exp1 | NM_003352 | ENSG00000116030 | SUMO1 |
2 | chr17 | 7590368 | 7591368 | NM_001126112 | exp1 | NM_001126112 | ENSG00000141510 | TP53 |
3 | chr17 | 26903451 | 26904451 | NM_005165 | exp1 | NM_005165 | ENSG00000109107 | ALDOC |
Check the resulting DataFrame. For every row in df_peaks
, all the columns from df_genes
have been added if the refseq
identifier matches the identifier specified in nearest_gene
.
Our DataFrame now has the ENSEMBL gene identifiers, so we can also merge the df_expr
.
df_genes_peaks_expr = df_genes_peaks.merge(df_expr, on='ensembl')
Note that here the name of the columns that will be used for merging is the same in df_genes_peaks
and df_exp
, so we can use the on='ensembl'
argument.
For convenience, we will use a shorter name for our merged DataFrame
df_all = df_genes_peaks_expr
11.5. Tab completion and help¶
Now that we have merged our data, we can do several neat things.
Because df_all
is a pandas DataFrame, it automatically inherits DataFrame methods.
You can get an idea if you type:
df_all.
And then hit the left <Tab>
key.
Notice that you get a lot of options….
Suppose you want to know how the sort_values()
method for a DataFrame works. You can type:
df_all.sort_values?
This HELP menu gives you information about the function. type q
or <ESC>
to leave the HELP.
11.6. Sorting a DataFrame¶
Let us try it out:
df_all.sort_values(by='chrom')
chrom | start | end | nearest_gene | exp | refseq | ensembl | genename | sample1 | sample2 | sample3 | sample4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr11 | 77774406 | 77775406 | NM_003251 | exp1 | NM_003251 | ENSG00000151365 | THRSP | 0.026 | 0.013 | 0.023 | 0.181 |
1 | chr17 | 7590368 | 7591368 | NM_001126112 | exp1 | NM_001126112 | ENSG00000141510 | TP53 | 58.697 | 47.569 | 34.299 | 28.180 |
2 | chr17 | 26903451 | 26904451 | NM_005165 | exp1 | NM_005165 | ENSG00000109107 | ALDOC | 18.218 | 5.333 | 18.445 | 131.013 |
or
df_all.sort_values(by='start')
chrom | start | end | nearest_gene | exp | refseq | ensembl | genename | sample1 | sample2 | sample3 | sample4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | chr17 | 7590368 | 7591368 | NM_001126112 | exp1 | NM_001126112 | ENSG00000141510 | TP53 | 58.697 | 47.569 | 34.299 | 28.180 |
2 | chr17 | 26903451 | 26904451 | NM_005165 | exp1 | NM_005165 | ENSG00000109107 | ALDOC | 18.218 | 5.333 | 18.445 | 131.013 |
0 | chr11 | 77774406 | 77775406 | NM_003251 | exp1 | NM_003251 | ENSG00000151365 | THRSP | 0.026 | 0.013 | 0.023 | 0.181 |
Note that these commands did not replace the original DataFrame df_all
!
11.7. Using [ ]
to obtain DataFrame columns or rows¶
Using [ ]
we can obtain specific parts of the DataFrame.
df_all['genename']
0 THRSP
1 TP53
2 ALDOC
Name: genename, dtype: object
## or, for multiple columns:
df_all[['genename', 'exp']]
genename | exp | |
---|---|---|
0 | THRSP | exp1 |
1 | TP53 | exp1 |
2 | ALDOC | exp1 |
Using iloc
we can do the same using indexes
One index by default points to rows of a DataFrame
To get the 2nd row, type:
df_all.iloc[2]
chrom chr17
start 26903451
end 26904451
nearest_gene NM_005165
exp exp1
refseq NM_005165
ensembl ENSG00000109107
genename ALDOC
sample1 18.218
sample2 5.333
sample3 18.445
sample4 131.013
Name: 2, dtype: object
Or, for multiple rows:
df_all.iloc[2:5]
chrom | start | end | nearest_gene | exp | refseq | ensembl | genename | sample1 | sample2 | sample3 | sample4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | chr17 | 26903451 | 26904451 | NM_005165 | exp1 | NM_005165 | ENSG00000109107 | ALDOC | 18.218 | 5.333 | 18.445 | 131.013 |
This works very much like indexes of a list
If you use two indices, you can obtain specific rows and columns.
The first index is always rows, the second columns.
df_all.iloc[2:5, 4:11]
exp | refseq | ensembl | genename | sample1 | sample2 | sample3 | |
---|---|---|---|---|---|---|---|
2 | exp1 | NM_005165 | ENSG00000109107 | ALDOC | 18.218 | 5.333 | 18.445 |
11.8. Removing missing values or NaN
s.¶
In many cases DataFrames contains NaN
, missing values, and you often have to remove them.
gene_info = pd.DataFrame({"full_name":["tumor protein p53", "aldolase, fructose-bisphosphate C"]},
index=["TP53", "ALDOC"])
df_all = df_all.join(gene_info, on="genename")
df_all
chrom | start | end | nearest_gene | exp | refseq | ensembl | genename | sample1 | sample2 | sample3 | sample4 | full_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr11 | 77774406 | 77775406 | NM_003251 | exp1 | NM_003251 | ENSG00000151365 | THRSP | 0.026 | 0.013 | 0.023 | 0.181 | NaN |
1 | chr17 | 7590368 | 7591368 | NM_001126112 | exp1 | NM_001126112 | ENSG00000141510 | TP53 | 58.697 | 47.569 | 34.299 | 28.180 | tumor protein p53 |
2 | chr17 | 26903451 | 26904451 | NM_005165 | exp1 | NM_005165 | ENSG00000109107 | ALDOC | 18.218 | 5.333 | 18.445 | 131.013 | aldolase, fructose-bisphosphate C |
We can obtain a boolean
(True
/False
) for the column called full_name
like this:
df_all['full_name'].isnull()
0 True
1 False
2 False
Name: full_name, dtype: bool
By using the ~
symbol we can invert this boolean
column. The ~
acts as not
for a pandas Series.
~df_all['full_name'].isnull()
0 False
1 True
2 True
Name: full_name, dtype: bool
We can use this strategy to remove all rows that have NaN
in the column full_name
.
df_all[~df_all['full_name'].isnull()]
chrom | start | end | nearest_gene | exp | refseq | ensembl | genename | sample1 | sample2 | sample3 | sample4 | full_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | chr17 | 7590368 | 7591368 | NM_001126112 | exp1 | NM_001126112 | ENSG00000141510 | TP53 | 58.697 | 47.569 | 34.299 | 28.180 | tumor protein p53 |
2 | chr17 | 26903451 | 26904451 | NM_005165 | exp1 | NM_005165 | ENSG00000109107 | ALDOC | 18.218 | 5.333 | 18.445 | 131.013 | aldolase, fructose-bisphosphate C |
Make sure you understand how this worked!
11.9. Basic calculations¶
There are a lot of methods that allows you to do basic calculation on DataFrames.
As an illustration we will calculate the pairwise correlations between the columns sample1
, sample2
, sample3
, and sample4
.
We take the appropriate columns first
df_1 = df_all[['sample1', 'sample2', 'sample3', 'sample4']]
Now we can use the pandas corr
method to calculate all pairwise correlations:
df_1.corr()
sample1 | sample2 | sample3 | sample4 | |
---|---|---|---|---|
sample1 | 1.000000 | 0.978971 | 0.966614 | -0.011258 |
sample2 | 0.978971 | 1.000000 | 0.894017 | -0.215006 |
sample3 | 0.966614 | 0.894017 | 1.000000 | 0.245337 |
sample4 | -0.011258 | -0.215006 | 0.245337 | 1.000000 |