{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to pandas DataFrames\n",
"\n",
"In this tutorial you are going to use the module `pandas`.\n",
"We will focus on DataFrames, a convenient object type used in pandas.\n",
"\n",
"```{note}\n",
"This tutorial has no exercises, but serves as an introduction to pandas.\n",
"It will illustrate what you can do with pandas DataFrames. \n",
"```\n",
"\n",
"If you want more information on pandas, a (quick) tutorial is found here:\n",
"\n",
"[https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)\n",
"\n",
"## Importing the pandas module\n",
"\n",
"We start by importing the `pandas` module, for which we use the following command:\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that we imported the `pandas` module and renamed it to `pd`. This allows us to use all `pandas` functionality using `pd` (instead of `pandas`). This is very common: it saves typing since `pd` is shorter than `pandas`.\n",
"\n",
"## Creating a pandas DataFrame from lists\n",
"\n",
"Here we will create a pandas DataFrame containing the following columns:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"chrom = ['chr6', 'chr11', 'chr3', 'chr2', 'chr17', 'chr17']\n",
"start = [143889976, 77774406, 14988512, 203102822, 7590368, 26903451]\n",
"end = [143890976, 77775406, 14989512, 203103822, 7591368, 26904451]\n",
"refseq =['NR_027113', 'NM_003251', 'NR_046253', 'NM_003352', 'NM_001126112', 'NM_005165']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case we create an empty DataFrame first, and we fill it up with the individual columns."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"df_peaks = pd.DataFrame()\n",
"\n",
"df_peaks['chrom'] = chrom\n",
"df_peaks['start'] = start\n",
"df_peaks['end'] = end\n",
"df_peaks['nearest_gene'] = refseq"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can check what your DataFrame looks like using `print()`, or simply type `df_peaks`. The latter only works in a Jupyter notebook, but the advantage is that you get a nicely formatted table!"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" chrom start end nearest_gene\n",
"0 chr6 143889976 143890976 NR_027113\n",
"1 chr11 77774406 77775406 NM_003251\n",
"2 chr3 14988512 14989512 NR_046253\n",
"3 chr2 203102822 203103822 NM_003352\n",
"4 chr17 7590368 7591368 NM_001126112\n",
"5 chr17 26903451 26904451 NM_005165\n"
]
}
],
"source": [
"print(df_peaks)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" chrom | \n",
" start | \n",
" end | \n",
" nearest_gene | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" chr6 | \n",
" 143889976 | \n",
" 143890976 | \n",
" NR_027113 | \n",
"
\n",
" \n",
" 1 | \n",
" chr11 | \n",
" 77774406 | \n",
" 77775406 | \n",
" NM_003251 | \n",
"
\n",
" \n",
" 2 | \n",
" chr3 | \n",
" 14988512 | \n",
" 14989512 | \n",
" NR_046253 | \n",
"
\n",
" \n",
" 3 | \n",
" chr2 | \n",
" 203102822 | \n",
" 203103822 | \n",
" NM_003352 | \n",
"
\n",
" \n",
" 4 | \n",
" chr17 | \n",
" 7590368 | \n",
" 7591368 | \n",
" NM_001126112 | \n",
"
\n",
" \n",
" 5 | \n",
" chr17 | \n",
" 26903451 | \n",
" 26904451 | \n",
" NM_005165 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" chrom start end nearest_gene\n",
"0 chr6 143889976 143890976 NR_027113\n",
"1 chr11 77774406 77775406 NM_003251\n",
"2 chr3 14988512 14989512 NR_046253\n",
"3 chr2 203102822 203103822 NM_003352\n",
"4 chr17 7590368 7591368 NM_001126112\n",
"5 chr17 26903451 26904451 NM_005165"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_peaks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We make a second DataFrame, which corresponds to a second experiment ('exp2')\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"chr_exp2 = [\"chr6_dbb_hap3\", \"chr7\", \"chr9\", \"chrX\", \"chrX\", \"chrX\"]\n",
"start_exp2 = [4324541, 98029927, 90497271, 71496641, 73506544, 152162171]\n",
"end_exp2 = [4325541, 98030927, 90498271, 71497641, 73507544, 152163171]\n",
"refseq_exp2 = [\"NM_002121\", \"NM_018842\", \"NM_178828\", \"NM_001007\", \"NR_030258\", \"NM_001184924\"]\n",
"\n",
"df_peaks_exp2 = pd.DataFrame()\n",
"df_peaks_exp2['chrom'] = chr_exp2\n",
"df_peaks_exp2['start'] = start_exp2\n",
"df_peaks_exp2['end'] = end_exp2\n",
"df_peaks_exp2['nearest_gene'] = refseq_exp2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Look at both DataFrames. The `head()` function shows the first five rows of a DataFrame.\n",
"\n",
"They have the same column layout (`chrom`, `start`, `end`, `refseq`)."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" chrom | \n",
" start | \n",
" end | \n",
" nearest_gene | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" chr6 | \n",
" 143889976 | \n",
" 143890976 | \n",
" NR_027113 | \n",
"
\n",
" \n",
" 1 | \n",
" chr11 | \n",
" 77774406 | \n",
" 77775406 | \n",
" NM_003251 | \n",
"
\n",
" \n",
" 2 | \n",
" chr3 | \n",
" 14988512 | \n",
" 14989512 | \n",
" NR_046253 | \n",
"
\n",
" \n",
" 3 | \n",
" chr2 | \n",
" 203102822 | \n",
" 203103822 | \n",
" NM_003352 | \n",
"
\n",
" \n",
" 4 | \n",
" chr17 | \n",
" 7590368 | \n",
" 7591368 | \n",
" NM_001126112 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" chrom start end nearest_gene\n",
"0 chr6 143889976 143890976 NR_027113\n",
"1 chr11 77774406 77775406 NM_003251\n",
"2 chr3 14988512 14989512 NR_046253\n",
"3 chr2 203102822 203103822 NM_003352\n",
"4 chr17 7590368 7591368 NM_001126112"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_peaks.head()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" chrom | \n",
" start | \n",
" end | \n",
" nearest_gene | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" chr6_dbb_hap3 | \n",
" 4324541 | \n",
" 4325541 | \n",
" NM_002121 | \n",
"
\n",
" \n",
" 1 | \n",
" chr7 | \n",
" 98029927 | \n",
" 98030927 | \n",
" NM_018842 | \n",
"
\n",
" \n",
" 2 | \n",
" chr9 | \n",
" 90497271 | \n",
" 90498271 | \n",
" NM_178828 | \n",
"
\n",
" \n",
" 3 | \n",
" chrX | \n",
" 71496641 | \n",
" 71497641 | \n",
" NM_001007 | \n",
"
\n",
" \n",
" 4 | \n",
" chrX | \n",
" 73506544 | \n",
" 73507544 | \n",
" NR_030258 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" chrom start end nearest_gene\n",
"0 chr6_dbb_hap3 4324541 4325541 NM_002121\n",
"1 chr7 98029927 98030927 NM_018842\n",
"2 chr9 90497271 90498271 NM_178828\n",
"3 chrX 71496641 71497641 NM_001007\n",
"4 chrX 73506544 73507544 NR_030258"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_peaks_exp2.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"They both represent peaks of a ChIP-seq experiment.\n",
"\n",
"For both DataFrames we can add an additional column indicating from which experiment the data comes ('exp1' or 'exp2')\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"df_peaks['exp'] = 'exp1'\n",
"df_peaks_exp2['exp'] = 'exp2'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Take a look at the two DataFrames, and note how we did this:\n",
"\n",
"* We used `[ ]` to define a new colum, with the column name between `[ ]`\n",
"* We used only one string (`'exp1'` or `'exp2'`), but this was automatically expanded to the whole column\n",
"\n",
"\n",
"## Creating a pandas DataFrame using a `dictionary`\n",
"\n",
"We can also create a pandas DataFrame from a dictionary. The dictionary should contain *column names as keys* and *lists as values* for the columns. To show this in action, we will use the three following lists:\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"refseq = [\"NM_001007\", \"NM_001098638\", \"NM_001110221\", \"NM_001126112\", \"NM_001142599\", \n",
" \"NM_001164283\", \"NM_001164386\", \"NM_001184924\", \"NM_001290043\",\"NM_002121\",\n",
" \"NM_002441\", \"NM_003251\", \"NM_003352\", \"NM_004263\", \"NM_005165\", \"NM_006017\",\n",
" \"NM_007145\", \"NM_016522\", \"NM_018842\", \"NM_024097\", \"NM_147181\", \"NM_173856\",\n",
" \"NM_178828\"]\n",
"\n",
"ensembl = [\"ENSG00000198034\", \"ENSG00000166439\", \"ENSG00000121742\", \"ENSG00000141510\",\n",
" \"ENSG00000072682\", \"ENSG00000056972\", \"ENSG00000158987\", \"ENSG00000198883\",\n",
" \"ENSG00000225967\", \"ENSG00000237710\", \"ENSG00000235569\", \"ENSG00000151365\",\n",
" \"ENSG00000116030\", \"ENSG00000135622\", \"ENSG00000109107\", \"ENSG00000007062\",\n",
" \"ENSG00000167635\", \"ENSG00000182667\", \"ENSG00000006453\", \"ENSG00000164008\",\n",
" \"ENSG00000185774\", \"ENSG00000196131\", \"ENSG00000177992\"]\n",
"\n",
"genename = [\"RPS4X\", \"RNF169\", \"GJB6\", \"TP53\", \"P4HA2\", \"TRAF3IP2\", \"RAPGEF6\", \"PNMA5\", \n",
" \"TAP2\", \"HLA-DPB1\", \"MSH5\", \"THRSP\", \"SUMO1\", \"SEMA4F\", \"ALDOC\", \"PROM1\",\n",
" \"ZNF146\", \"NTM\", \"BAIAP2L1\", \"C1orf50\", \"KCNIP4\", \"VN1R2\", \"SPATA31E1\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we use these lists to create a dictionary\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"dict_genes = {\n",
" 'refseq': refseq,\n",
" 'ensembl': ensembl,\n",
" 'genename': genename\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have the dictionary, we can create the DataFrame.\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"df_genes = pd.DataFrame(dict_genes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Look at the resulting DataFrame called `df_genes`. This DataFrame will help us to map RefSeq to ENSEMBL gene identifiers."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first three rows."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" refseq | \n",
" ensembl | \n",
" genename | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" NM_001007 | \n",
" ENSG00000198034 | \n",
" RPS4X | \n",
"
\n",
" \n",
" 1 | \n",
" NM_001098638 | \n",
" ENSG00000166439 | \n",
" RNF169 | \n",
"
\n",
" \n",
" 2 | \n",
" NM_001110221 | \n",
" ENSG00000121742 | \n",
" GJB6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" refseq ensembl genename\n",
"0 NM_001007 ENSG00000198034 RPS4X\n",
"1 NM_001098638 ENSG00000166439 RNF169\n",
"2 NM_001110221 ENSG00000121742 GJB6"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_genes.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The last three rows."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" refseq | \n",
" ensembl | \n",
" genename | \n",
"
\n",
" \n",
" \n",
" \n",
" 20 | \n",
" NM_147181 | \n",
" ENSG00000185774 | \n",
" KCNIP4 | \n",
"
\n",
" \n",
" 21 | \n",
" NM_173856 | \n",
" ENSG00000196131 | \n",
" VN1R2 | \n",
"
\n",
" \n",
" 22 | \n",
" NM_178828 | \n",
" ENSG00000177992 | \n",
" SPATA31E1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" refseq ensembl genename\n",
"20 NM_147181 ENSG00000185774 KCNIP4\n",
"21 NM_173856 ENSG00000196131 VN1R2\n",
"22 NM_178828 ENSG00000177992 SPATA31E1"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_genes.tail(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or three random rows."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" refseq | \n",
" ensembl | \n",
" genename | \n",
"
\n",
" \n",
" \n",
" \n",
" 8 | \n",
" NM_001290043 | \n",
" ENSG00000225967 | \n",
" TAP2 | \n",
"
\n",
" \n",
" 7 | \n",
" NM_001184924 | \n",
" ENSG00000198883 | \n",
" PNMA5 | \n",
"
\n",
" \n",
" 11 | \n",
" NM_003251 | \n",
" ENSG00000151365 | \n",
" THRSP | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" refseq ensembl genename\n",
"8 NM_001290043 ENSG00000225967 TAP2\n",
"7 NM_001184924 ENSG00000198883 PNMA5\n",
"11 NM_003251 ENSG00000151365 THRSP"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_genes.sample(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we create another DataFrame with gene expression values\n",
"\n",
"Here, we will use the dictionary method to create a `DataFrame` without explicitly creating the dictionary."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"ensembl_1 = [\"ENSG00000237710\", \"ENSG00000182667\", \"ENSG00000121742\", \"ENSG00000158987\", \n",
" \"ENSG00000196131\", \"ENSG00000151365\", \"ENSG00000056972\", \"ENSG00000235569\", \n",
" \"ENSG00000006453\", \"ENSG00000198883\", \"ENSG00000225967\", \"ENSG00000166439\", \n",
" \"ENSG00000135622\", \"ENSG00000007062\", \"ENSG00000109107\", \"ENSG00000167635\", \n",
" \"ENSG00000164008\", \"ENSG00000141510\", \"ENSG00000177992\"]\n",
"\n",
"sample_1 = [None, 1.686, 0.063, 4.222, 0.021, 0.026, 8.169, None, 16.159, 0.08, None, \n",
" 3.972, 0.844, 26.208, 18.218, 91.049, 3.828, 58.697, 0.024]\n",
"\n",
"sample_2 = [None, 1.159, 0.011, 4.291, 0.004, 0.013, 0.306, None, 12.671, 0, None, \n",
" 6.983, 5.352, 14.336, 5.333, 60.28, 5.137, 47.569, 0]\n",
"\n",
"sample_3 = [None, 6.865, 0, 4.727, 0.039, 0.023, 10.459, None, 0.225, 0.011, None, \n",
" 7.007, 3.402, 6.076, 18.445, 28.716, 4.128, 34.299, 0]\n",
"\n",
"sample_4 = [None, 15.691, 2.16, 5.284, 0, 0.181, 1.428, None, 0.184, 0.395, None, \n",
" 11.842, 4.186, 6.01, 131.013, 25.126, 9.933, 28.18, 0]"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# The next code could all be written on one line. However, to improve readability\n",
"# we write in this manner.\n",
"df_expr = pd.DataFrame(\n",
" {\n",
" 'ensembl': ensembl_1, \n",
" 'sample1': sample_1, \n",
" 'sample2': sample_2, \n",
" 'sample3': sample_3, \n",
" 'sample4': sample_4\n",
" }\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Look at the DataFrames using `print()` or, in Jupyter, by just typing the DataFrame name followed by ``.\n",
"\n",
"In the case where you would have very large DataFrames, you can always use `df.head()` to show the first rows, or `df.tail()` to show the last rows, where `df` is your DataFrame name."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Merging pandas DataFrames\n",
"\n",
"\n",
"### Merging DataFrames on column values\n",
"\n",
"We now have 3 types of data in our 3 DataFrames:\n",
"\n",
"* ChIP-seq peaks (`df_peaks`)\n",
"* A RefSeq to ENSEMBL gene identifier mapping (`df_genes`)\n",
"* Expression values of genes in 4 samples, with ENSEMBL identifiers (`df_expr`)\n",
"\n",
"This is a typical example for any (bioinformatical) analysis in which you want to incorporate different data types, often from different sources. You need to merge them if you want to do anything useful. Pandas DataFrames can be of big help here.\n",
"\n",
"So let us try to merge these pieces of data.\n",
"\n",
"### Merging DataFrames: `concat`\n",
"\n",
"Using the command `concat()` we can concatenate multiple DataFrames.\n",
"Let us do this with the 2 ChIP-seq peaks DataFrames:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" chrom | \n",
" start | \n",
" end | \n",
" nearest_gene | \n",
" exp | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" chr6 | \n",
" 143889976 | \n",
" 143890976 | \n",
" NR_027113 | \n",
" exp1 | \n",
"
\n",
" \n",
" 1 | \n",
" chr11 | \n",
" 77774406 | \n",
" 77775406 | \n",
" NM_003251 | \n",
" exp1 | \n",
"
\n",
" \n",
" 2 | \n",
" chr3 | \n",
" 14988512 | \n",
" 14989512 | \n",
" NR_046253 | \n",
" exp1 | \n",
"
\n",
" \n",
" 3 | \n",
" chr2 | \n",
" 203102822 | \n",
" 203103822 | \n",
" NM_003352 | \n",
" exp1 | \n",
"
\n",
" \n",
" 4 | \n",
" chr17 | \n",
" 7590368 | \n",
" 7591368 | \n",
" NM_001126112 | \n",
" exp1 | \n",
"
\n",
" \n",
" 5 | \n",
" chr17 | \n",
" 26903451 | \n",
" 26904451 | \n",
" NM_005165 | \n",
" exp1 | \n",
"
\n",
" \n",
" 0 | \n",
" chr6_dbb_hap3 | \n",
" 4324541 | \n",
" 4325541 | \n",
" NM_002121 | \n",
" exp2 | \n",
"
\n",
" \n",
" 1 | \n",
" chr7 | \n",
" 98029927 | \n",
" 98030927 | \n",
" NM_018842 | \n",
" exp2 | \n",
"
\n",
" \n",
" 2 | \n",
" chr9 | \n",
" 90497271 | \n",
" 90498271 | \n",
" NM_178828 | \n",
" exp2 | \n",
"
\n",
" \n",
" 3 | \n",
" chrX | \n",
" 71496641 | \n",
" 71497641 | \n",
" NM_001007 | \n",
" exp2 | \n",
"
\n",
" \n",
" 4 | \n",
" chrX | \n",
" 73506544 | \n",
" 73507544 | \n",
" NR_030258 | \n",
" exp2 | \n",
"
\n",
" \n",
" 5 | \n",
" chrX | \n",
" 152162171 | \n",
" 152163171 | \n",
" NM_001184924 | \n",
" exp2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" chrom start end nearest_gene exp\n",
"0 chr6 143889976 143890976 NR_027113 exp1\n",
"1 chr11 77774406 77775406 NM_003251 exp1\n",
"2 chr3 14988512 14989512 NR_046253 exp1\n",
"3 chr2 203102822 203103822 NM_003352 exp1\n",
"4 chr17 7590368 7591368 NM_001126112 exp1\n",
"5 chr17 26903451 26904451 NM_005165 exp1\n",
"0 chr6_dbb_hap3 4324541 4325541 NM_002121 exp2\n",
"1 chr7 98029927 98030927 NM_018842 exp2\n",
"2 chr9 90497271 90498271 NM_178828 exp2\n",
"3 chrX 71496641 71497641 NM_001007 exp2\n",
"4 chrX 73506544 73507544 NR_030258 exp2\n",
"5 chrX 152162171 152163171 NM_001184924 exp2"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_all_peaks = pd.concat([df_peaks, df_peaks_exp2])\n",
"df_all_peaks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check out the result, and note how the `exp` column still allows us to distinguish the two experiments. \n",
"\n",
"You can use `concat()` to combine two DataFrames vertically (`axis=0`) or horizontally (`axis=1`)."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" chrom | \n",
" start | \n",
" end | \n",
" nearest_gene | \n",
" exp | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" chr6 | \n",
" 143889976 | \n",
" 143890976 | \n",
" NR_027113 | \n",
" exp1 | \n",
"
\n",
" \n",
" 1 | \n",
" chr11 | \n",
" 77774406 | \n",
" 77775406 | \n",
" NM_003251 | \n",
" exp1 | \n",
"
\n",
" \n",
" 2 | \n",
" chr3 | \n",
" 14988512 | \n",
" 14989512 | \n",
" NR_046253 | \n",
" exp1 | \n",
"
\n",
" \n",
" 3 | \n",
" chr2 | \n",
" 203102822 | \n",
" 203103822 | \n",
" NM_003352 | \n",
" exp1 | \n",
"
\n",
" \n",
" 4 | \n",
" chr17 | \n",
" 7590368 | \n",
" 7591368 | \n",
" NM_001126112 | \n",
" exp1 | \n",
"
\n",
" \n",
" 5 | \n",
" chr17 | \n",
" 26903451 | \n",
" 26904451 | \n",
" NM_005165 | \n",
" exp1 | \n",
"
\n",
" \n",
" 0 | \n",
" chr6_dbb_hap3 | \n",
" 4324541 | \n",
" 4325541 | \n",
" NM_002121 | \n",
" exp2 | \n",
"
\n",
" \n",
" 1 | \n",
" chr7 | \n",
" 98029927 | \n",
" 98030927 | \n",
" NM_018842 | \n",
" exp2 | \n",
"
\n",
" \n",
" 2 | \n",
" chr9 | \n",
" 90497271 | \n",
" 90498271 | \n",
" NM_178828 | \n",
" exp2 | \n",
"
\n",
" \n",
" 3 | \n",
" chrX | \n",
" 71496641 | \n",
" 71497641 | \n",
" NM_001007 | \n",
" exp2 | \n",
"
\n",
" \n",
" 4 | \n",
" chrX | \n",
" 73506544 | \n",
" 73507544 | \n",
" NR_030258 | \n",
" exp2 | \n",
"
\n",
" \n",
" 5 | \n",
" chrX | \n",
" 152162171 | \n",
" 152163171 | \n",
" NM_001184924 | \n",
" exp2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" chrom start end nearest_gene exp\n",
"0 chr6 143889976 143890976 NR_027113 exp1\n",
"1 chr11 77774406 77775406 NM_003251 exp1\n",
"2 chr3 14988512 14989512 NR_046253 exp1\n",
"3 chr2 203102822 203103822 NM_003352 exp1\n",
"4 chr17 7590368 7591368 NM_001126112 exp1\n",
"5 chr17 26903451 26904451 NM_005165 exp1\n",
"0 chr6_dbb_hap3 4324541 4325541 NM_002121 exp2\n",
"1 chr7 98029927 98030927 NM_018842 exp2\n",
"2 chr9 90497271 90498271 NM_178828 exp2\n",
"3 chrX 71496641 71497641 NM_001007 exp2\n",
"4 chrX 73506544 73507544 NR_030258 exp2\n",
"5 chrX 152162171 152163171 NM_001184924 exp2"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([df_peaks, df_peaks_exp2], axis=0)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" chrom | \n",
" start | \n",
" end | \n",
" nearest_gene | \n",
" exp | \n",
" chrom | \n",
" start | \n",
" end | \n",
" nearest_gene | \n",
" exp | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" chr6 | \n",
" 143889976 | \n",
" 143890976 | \n",
" NR_027113 | \n",
" exp1 | \n",
" chr6_dbb_hap3 | \n",
" 4324541 | \n",
" 4325541 | \n",
" NM_002121 | \n",
" exp2 | \n",
"
\n",
" \n",
" 1 | \n",
" chr11 | \n",
" 77774406 | \n",
" 77775406 | \n",
" NM_003251 | \n",
" exp1 | \n",
" chr7 | \n",
" 98029927 | \n",
" 98030927 | \n",
" NM_018842 | \n",
" exp2 | \n",
"
\n",
" \n",
" 2 | \n",
" chr3 | \n",
" 14988512 | \n",
" 14989512 | \n",
" NR_046253 | \n",
" exp1 | \n",
" chr9 | \n",
" 90497271 | \n",
" 90498271 | \n",
" NM_178828 | \n",
" exp2 | \n",
"
\n",
" \n",
" 3 | \n",
" chr2 | \n",
" 203102822 | \n",
" 203103822 | \n",
" NM_003352 | \n",
" exp1 | \n",
" chrX | \n",
" 71496641 | \n",
" 71497641 | \n",
" NM_001007 | \n",
" exp2 | \n",
"
\n",
" \n",
" 4 | \n",
" chr17 | \n",
" 7590368 | \n",
" 7591368 | \n",
" NM_001126112 | \n",
" exp1 | \n",
" chrX | \n",
" 73506544 | \n",
" 73507544 | \n",
" NR_030258 | \n",
" exp2 | \n",
"
\n",
" \n",
" 5 | \n",
" chr17 | \n",
" 26903451 | \n",
" 26904451 | \n",
" NM_005165 | \n",
" exp1 | \n",
" chrX | \n",
" 152162171 | \n",
" 152163171 | \n",
" NM_001184924 | \n",
" exp2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" chrom start end nearest_gene exp chrom start \\\n",
"0 chr6 143889976 143890976 NR_027113 exp1 chr6_dbb_hap3 4324541 \n",
"1 chr11 77774406 77775406 NM_003251 exp1 chr7 98029927 \n",
"2 chr3 14988512 14989512 NR_046253 exp1 chr9 90497271 \n",
"3 chr2 203102822 203103822 NM_003352 exp1 chrX 71496641 \n",
"4 chr17 7590368 7591368 NM_001126112 exp1 chrX 73506544 \n",
"5 chr17 26903451 26904451 NM_005165 exp1 chrX 152162171 \n",
"\n",
" end nearest_gene exp \n",
"0 4325541 NM_002121 exp2 \n",
"1 98030927 NM_018842 exp2 \n",
"2 90498271 NM_178828 exp2 \n",
"3 71497641 NM_001007 exp2 \n",
"4 73507544 NR_030258 exp2 \n",
"5 152163171 NM_001184924 exp2 "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([df_peaks, df_peaks_exp2], axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) contains more examples.\n",
"\n",
"### Merging DataFrames: `merge`\n",
"\n",
"In the above case, we want to simply combine two DataFrames by concatenating, i.e. simply adding the rows of two DataFrames together. You can also join two DataFrames in other ways. The `df_peaks` DataFrame has a `nearest_gene` column. This is a RefSeq gene identifier that is also present in the `df_genes` DataFrame. If we would like to annotate the genes in `df_peaks`, we would only need the relevant columns for the genes that are in `df_peaks`. "
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" chrom | \n",
" start | \n",
" end | \n",
" nearest_gene | \n",
" exp | \n",
" refseq | \n",
" ensembl | \n",
" genename | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" chr11 | \n",
" 77774406 | \n",
" 77775406 | \n",
" NM_003251 | \n",
" exp1 | \n",
" NM_003251 | \n",
" ENSG00000151365 | \n",
" THRSP | \n",
"
\n",
" \n",
" 1 | \n",
" chr2 | \n",
" 203102822 | \n",
" 203103822 | \n",
" NM_003352 | \n",
" exp1 | \n",
" NM_003352 | \n",
" ENSG00000116030 | \n",
" SUMO1 | \n",
"
\n",
" \n",
" 2 | \n",
" chr17 | \n",
" 7590368 | \n",
" 7591368 | \n",
" NM_001126112 | \n",
" exp1 | \n",
" NM_001126112 | \n",
" ENSG00000141510 | \n",
" TP53 | \n",
"
\n",
" \n",
" 3 | \n",
" chr17 | \n",
" 26903451 | \n",
" 26904451 | \n",
" NM_005165 | \n",
" exp1 | \n",
" NM_005165 | \n",
" ENSG00000109107 | \n",
" ALDOC | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" chrom start end nearest_gene exp refseq \\\n",
"0 chr11 77774406 77775406 NM_003251 exp1 NM_003251 \n",
"1 chr2 203102822 203103822 NM_003352 exp1 NM_003352 \n",
"2 chr17 7590368 7591368 NM_001126112 exp1 NM_001126112 \n",
"3 chr17 26903451 26904451 NM_005165 exp1 NM_005165 \n",
"\n",
" ensembl genename \n",
"0 ENSG00000151365 THRSP \n",
"1 ENSG00000116030 SUMO1 \n",
"2 ENSG00000141510 TP53 \n",
"3 ENSG00000109107 ALDOC "
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_genes_peaks = df_peaks.merge(df_genes, left_on=\"nearest_gene\", right_on=\"refseq\")\n",
"df_genes_peaks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check the resulting DataFrame. For every row in `df_peaks`, all the columns from `df_genes` have been added if the `refseq` identifier matches the identifier specified in `nearest_gene`.\n",
"\n",
"Our DataFrame now has the ENSEMBL gene identifiers, so we can also merge the `df_expr`."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"df_genes_peaks_expr = df_genes_peaks.merge(df_expr, on='ensembl')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that here the name of the columns that will be used for merging is the same in `df_genes_peaks` and `df_exp`, so we can use the `on='ensembl'` argument."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For convenience, we will use a shorter name for our merged DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"df_all = df_genes_peaks_expr"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tab completion and help\n",
"\n",
"Now that we have merged our data, we can do several neat things.\n",
"\n",
"Because `df_all` is a pandas DataFrame, it automatically inherits DataFrame methods.\n",
"\n",
"You can get an idea if you type:\n",
" \n",
" df_all.\n",
" \n",
"And then hit the left `` key.\n",
"\n",
"Notice that you get a lot of options....\n",
"\n",
"Suppose you want to know how the `sort_values()` method for a DataFrame works. You can type:\n",
"\n",
" df_all.sort_values?\n",
" \n",
"This HELP menu gives you information about the function. type `q` or `` to leave the HELP.\n",
"\n",
"## Sorting a DataFrame\n",
"\n",
"Let us try it out:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" chrom | \n",
" start | \n",
" end | \n",
" nearest_gene | \n",
" exp | \n",
" refseq | \n",
" ensembl | \n",
" genename | \n",
" sample1 | \n",
" sample2 | \n",
" sample3 | \n",
" sample4 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" chr11 | \n",
" 77774406 | \n",
" 77775406 | \n",
" NM_003251 | \n",
" exp1 | \n",
" NM_003251 | \n",
" ENSG00000151365 | \n",
" THRSP | \n",
" 0.026 | \n",
" 0.013 | \n",
" 0.023 | \n",
" 0.181 | \n",
"
\n",
" \n",
" 1 | \n",
" chr17 | \n",
" 7590368 | \n",
" 7591368 | \n",
" NM_001126112 | \n",
" exp1 | \n",
" NM_001126112 | \n",
" ENSG00000141510 | \n",
" TP53 | \n",
" 58.697 | \n",
" 47.569 | \n",
" 34.299 | \n",
" 28.180 | \n",
"
\n",
" \n",
" 2 | \n",
" chr17 | \n",
" 26903451 | \n",
" 26904451 | \n",
" NM_005165 | \n",
" exp1 | \n",
" NM_005165 | \n",
" ENSG00000109107 | \n",
" ALDOC | \n",
" 18.218 | \n",
" 5.333 | \n",
" 18.445 | \n",
" 131.013 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" chrom start end nearest_gene exp refseq \\\n",
"0 chr11 77774406 77775406 NM_003251 exp1 NM_003251 \n",
"1 chr17 7590368 7591368 NM_001126112 exp1 NM_001126112 \n",
"2 chr17 26903451 26904451 NM_005165 exp1 NM_005165 \n",
"\n",
" ensembl genename sample1 sample2 sample3 sample4 \n",
"0 ENSG00000151365 THRSP 0.026 0.013 0.023 0.181 \n",
"1 ENSG00000141510 TP53 58.697 47.569 34.299 28.180 \n",
"2 ENSG00000109107 ALDOC 18.218 5.333 18.445 131.013 "
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_all.sort_values(by='chrom')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"or"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" chrom | \n",
" start | \n",
" end | \n",
" nearest_gene | \n",
" exp | \n",
" refseq | \n",
" ensembl | \n",
" genename | \n",
" sample1 | \n",
" sample2 | \n",
" sample3 | \n",
" sample4 | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" chr17 | \n",
" 7590368 | \n",
" 7591368 | \n",
" NM_001126112 | \n",
" exp1 | \n",
" NM_001126112 | \n",
" ENSG00000141510 | \n",
" TP53 | \n",
" 58.697 | \n",
" 47.569 | \n",
" 34.299 | \n",
" 28.180 | \n",
"
\n",
" \n",
" 2 | \n",
" chr17 | \n",
" 26903451 | \n",
" 26904451 | \n",
" NM_005165 | \n",
" exp1 | \n",
" NM_005165 | \n",
" ENSG00000109107 | \n",
" ALDOC | \n",
" 18.218 | \n",
" 5.333 | \n",
" 18.445 | \n",
" 131.013 | \n",
"
\n",
" \n",
" 0 | \n",
" chr11 | \n",
" 77774406 | \n",
" 77775406 | \n",
" NM_003251 | \n",
" exp1 | \n",
" NM_003251 | \n",
" ENSG00000151365 | \n",
" THRSP | \n",
" 0.026 | \n",
" 0.013 | \n",
" 0.023 | \n",
" 0.181 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" chrom start end nearest_gene exp refseq \\\n",
"1 chr17 7590368 7591368 NM_001126112 exp1 NM_001126112 \n",
"2 chr17 26903451 26904451 NM_005165 exp1 NM_005165 \n",
"0 chr11 77774406 77775406 NM_003251 exp1 NM_003251 \n",
"\n",
" ensembl genename sample1 sample2 sample3 sample4 \n",
"1 ENSG00000141510 TP53 58.697 47.569 34.299 28.180 \n",
"2 ENSG00000109107 ALDOC 18.218 5.333 18.445 131.013 \n",
"0 ENSG00000151365 THRSP 0.026 0.013 0.023 0.181 "
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_all.sort_values(by='start')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that these commands did not replace the original DataFrame `df_all`!\n",
"\n",
"\n",
"## Using `[ ]` to obtain DataFrame columns or rows\n",
"\n",
"Using `[ ]` we can obtain specific parts of the DataFrame.\n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 THRSP\n",
"1 TP53\n",
"2 ALDOC\n",
"Name: genename, dtype: object"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_all['genename']"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" genename | \n",
" exp | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" THRSP | \n",
" exp1 | \n",
"
\n",
" \n",
" 1 | \n",
" TP53 | \n",
" exp1 | \n",
"
\n",
" \n",
" 2 | \n",
" ALDOC | \n",
" exp1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" genename exp\n",
"0 THRSP exp1\n",
"1 TP53 exp1\n",
"2 ALDOC exp1"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## or, for multiple columns:\n",
"df_all[['genename', 'exp']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using `iloc` we can do the same using indexes\n",
"\n",
"One index by default points to **rows** of a DataFrame\n",
"\n",
"To get the 2nd row, type:\n"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"chrom chr17\n",
"start 26903451\n",
"end 26904451\n",
"nearest_gene NM_005165\n",
"exp exp1\n",
"refseq NM_005165\n",
"ensembl ENSG00000109107\n",
"genename ALDOC\n",
"sample1 18.218\n",
"sample2 5.333\n",
"sample3 18.445\n",
"sample4 131.013\n",
"Name: 2, dtype: object"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_all.iloc[2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or, for multiple rows:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" chrom | \n",
" start | \n",
" end | \n",
" nearest_gene | \n",
" exp | \n",
" refseq | \n",
" ensembl | \n",
" genename | \n",
" sample1 | \n",
" sample2 | \n",
" sample3 | \n",
" sample4 | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" chr17 | \n",
" 26903451 | \n",
" 26904451 | \n",
" NM_005165 | \n",
" exp1 | \n",
" NM_005165 | \n",
" ENSG00000109107 | \n",
" ALDOC | \n",
" 18.218 | \n",
" 5.333 | \n",
" 18.445 | \n",
" 131.013 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" chrom start end nearest_gene exp refseq ensembl \\\n",
"2 chr17 26903451 26904451 NM_005165 exp1 NM_005165 ENSG00000109107 \n",
"\n",
" genename sample1 sample2 sample3 sample4 \n",
"2 ALDOC 18.218 5.333 18.445 131.013 "
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_all.iloc[2:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This works very much like indexes of a `list`\n",
"\n",
"If you use two indices, you can obtain specific rows *and* columns.\n",
"\n",
"The first index is always *rows*, the second *columns*.\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" exp | \n",
" refseq | \n",
" ensembl | \n",
" genename | \n",
" sample1 | \n",
" sample2 | \n",
" sample3 | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" exp1 | \n",
" NM_005165 | \n",
" ENSG00000109107 | \n",
" ALDOC | \n",
" 18.218 | \n",
" 5.333 | \n",
" 18.445 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" exp refseq ensembl genename sample1 sample2 sample3\n",
"2 exp1 NM_005165 ENSG00000109107 ALDOC 18.218 5.333 18.445"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_all.iloc[2:5, 4:11]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Removing missing values or `NaN`s.\n",
"\n",
"In many cases DataFrames contains `NaN`, missing values, and you often have to remove them.\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" chrom | \n",
" start | \n",
" end | \n",
" nearest_gene | \n",
" exp | \n",
" refseq | \n",
" ensembl | \n",
" genename | \n",
" sample1 | \n",
" sample2 | \n",
" sample3 | \n",
" sample4 | \n",
" full_name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" chr11 | \n",
" 77774406 | \n",
" 77775406 | \n",
" NM_003251 | \n",
" exp1 | \n",
" NM_003251 | \n",
" ENSG00000151365 | \n",
" THRSP | \n",
" 0.026 | \n",
" 0.013 | \n",
" 0.023 | \n",
" 0.181 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" chr17 | \n",
" 7590368 | \n",
" 7591368 | \n",
" NM_001126112 | \n",
" exp1 | \n",
" NM_001126112 | \n",
" ENSG00000141510 | \n",
" TP53 | \n",
" 58.697 | \n",
" 47.569 | \n",
" 34.299 | \n",
" 28.180 | \n",
" tumor protein p53 | \n",
"
\n",
" \n",
" 2 | \n",
" chr17 | \n",
" 26903451 | \n",
" 26904451 | \n",
" NM_005165 | \n",
" exp1 | \n",
" NM_005165 | \n",
" ENSG00000109107 | \n",
" ALDOC | \n",
" 18.218 | \n",
" 5.333 | \n",
" 18.445 | \n",
" 131.013 | \n",
" aldolase, fructose-bisphosphate C | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" chrom start end nearest_gene exp refseq \\\n",
"0 chr11 77774406 77775406 NM_003251 exp1 NM_003251 \n",
"1 chr17 7590368 7591368 NM_001126112 exp1 NM_001126112 \n",
"2 chr17 26903451 26904451 NM_005165 exp1 NM_005165 \n",
"\n",
" ensembl genename sample1 sample2 sample3 sample4 \\\n",
"0 ENSG00000151365 THRSP 0.026 0.013 0.023 0.181 \n",
"1 ENSG00000141510 TP53 58.697 47.569 34.299 28.180 \n",
"2 ENSG00000109107 ALDOC 18.218 5.333 18.445 131.013 \n",
"\n",
" full_name \n",
"0 NaN \n",
"1 tumor protein p53 \n",
"2 aldolase, fructose-bisphosphate C "
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gene_info = pd.DataFrame({\"full_name\":[\"tumor protein p53\", \"aldolase, fructose-bisphosphate C\"]},\n",
" index=[\"TP53\", \"ALDOC\"])\n",
"df_all = df_all.join(gene_info, on=\"genename\")\n",
"df_all"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can obtain a `boolean` (`True`/`False`) for the column called `full_name` like this:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 True\n",
"1 False\n",
"2 False\n",
"Name: full_name, dtype: bool"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_all['full_name'].isnull()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By using the `~` symbol we can invert this `boolean` column. The `~` acts as `not` for a pandas Series."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 True\n",
"2 True\n",
"Name: full_name, dtype: bool"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"~df_all['full_name'].isnull()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use this strategy to remove all rows that have `NaN` in the column `full_name`."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" chrom | \n",
" start | \n",
" end | \n",
" nearest_gene | \n",
" exp | \n",
" refseq | \n",
" ensembl | \n",
" genename | \n",
" sample1 | \n",
" sample2 | \n",
" sample3 | \n",
" sample4 | \n",
" full_name | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" chr17 | \n",
" 7590368 | \n",
" 7591368 | \n",
" NM_001126112 | \n",
" exp1 | \n",
" NM_001126112 | \n",
" ENSG00000141510 | \n",
" TP53 | \n",
" 58.697 | \n",
" 47.569 | \n",
" 34.299 | \n",
" 28.180 | \n",
" tumor protein p53 | \n",
"
\n",
" \n",
" 2 | \n",
" chr17 | \n",
" 26903451 | \n",
" 26904451 | \n",
" NM_005165 | \n",
" exp1 | \n",
" NM_005165 | \n",
" ENSG00000109107 | \n",
" ALDOC | \n",
" 18.218 | \n",
" 5.333 | \n",
" 18.445 | \n",
" 131.013 | \n",
" aldolase, fructose-bisphosphate C | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" chrom start end nearest_gene exp refseq \\\n",
"1 chr17 7590368 7591368 NM_001126112 exp1 NM_001126112 \n",
"2 chr17 26903451 26904451 NM_005165 exp1 NM_005165 \n",
"\n",
" ensembl genename sample1 sample2 sample3 sample4 \\\n",
"1 ENSG00000141510 TP53 58.697 47.569 34.299 28.180 \n",
"2 ENSG00000109107 ALDOC 18.218 5.333 18.445 131.013 \n",
"\n",
" full_name \n",
"1 tumor protein p53 \n",
"2 aldolase, fructose-bisphosphate C "
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_all[~df_all['full_name'].isnull()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Make sure you understand how this worked!\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Basic calculations\n",
"\n",
"There are a lot of methods that allows you to do basic calculation on DataFrames.\n",
"\n",
"As an illustration we will calculate the pairwise correlations between the columns `sample1`, `sample2`, `sample3`, and `sample4`.\n",
"\n",
"We take the appropriate columns first\n"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"df_1 = df_all[['sample1', 'sample2', 'sample3', 'sample4']]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can use the pandas `corr` method to calculate all pairwise correlations:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sample1 | \n",
" sample2 | \n",
" sample3 | \n",
" sample4 | \n",
"
\n",
" \n",
" \n",
" \n",
" sample1 | \n",
" 1.000000 | \n",
" 0.978971 | \n",
" 0.966614 | \n",
" -0.011258 | \n",
"
\n",
" \n",
" sample2 | \n",
" 0.978971 | \n",
" 1.000000 | \n",
" 0.894017 | \n",
" -0.215006 | \n",
"
\n",
" \n",
" sample3 | \n",
" 0.966614 | \n",
" 0.894017 | \n",
" 1.000000 | \n",
" 0.245337 | \n",
"
\n",
" \n",
" sample4 | \n",
" -0.011258 | \n",
" -0.215006 | \n",
" 0.245337 | \n",
" 1.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sample1 sample2 sample3 sample4\n",
"sample1 1.000000 0.978971 0.966614 -0.011258\n",
"sample2 0.978971 1.000000 0.894017 -0.215006\n",
"sample3 0.966614 0.894017 1.000000 0.245337\n",
"sample4 -0.011258 -0.215006 0.245337 1.000000"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_1.corr()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}