{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to pandas DataFrames\n", "\n", "In this tutorial you are going to use the module `pandas`.\n", "We will focus on DataFrames, a convenient object type used in pandas.\n", "\n", "```{note}\n", "This tutorial has no exercises, but serves as an introduction to pandas.\n", "It will illustrate what you can do with pandas DataFrames. \n", "```\n", "\n", "If you want more information on pandas, a (quick) tutorial is found here:\n", "\n", "[https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)\n", "\n", "## Importing the pandas module\n", "\n", "We start by importing the `pandas` module, for which we use the following command:\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we imported the `pandas` module and renamed it to `pd`. This allows us to use all `pandas` functionality using `pd` (instead of `pandas`). This is very common: it saves typing since `pd` is shorter than `pandas`.\n", "\n", "## Creating a pandas DataFrame from lists\n", "\n", "Here we will create a pandas DataFrame containing the following columns:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "chrom = ['chr6', 'chr11', 'chr3', 'chr2', 'chr17', 'chr17']\n", "start = [143889976, 77774406, 14988512, 203102822, 7590368, 26903451]\n", "end = [143890976, 77775406, 14989512, 203103822, 7591368, 26904451]\n", "refseq =['NR_027113', 'NM_003251', 'NR_046253', 'NM_003352', 'NM_001126112', 'NM_005165']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case we create an empty DataFrame first, and we fill it up with the individual columns." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "df_peaks = pd.DataFrame()\n", "\n", "df_peaks['chrom'] = chrom\n", "df_peaks['start'] = start\n", "df_peaks['end'] = end\n", "df_peaks['nearest_gene'] = refseq" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can check what your DataFrame looks like using `print()`, or simply type `df_peaks`. The latter only works in a Jupyter notebook, but the advantage is that you get a nicely formatted table!" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " chrom start end nearest_gene\n", "0 chr6 143889976 143890976 NR_027113\n", "1 chr11 77774406 77775406 NM_003251\n", "2 chr3 14988512 14989512 NR_046253\n", "3 chr2 203102822 203103822 NM_003352\n", "4 chr17 7590368 7591368 NM_001126112\n", "5 chr17 26903451 26904451 NM_005165\n" ] } ], "source": [ "print(df_peaks)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromstartendnearest_gene
0chr6143889976143890976NR_027113
1chr117777440677775406NM_003251
2chr31498851214989512NR_046253
3chr2203102822203103822NM_003352
4chr1775903687591368NM_001126112
5chr172690345126904451NM_005165
\n", "
" ], "text/plain": [ " chrom start end nearest_gene\n", "0 chr6 143889976 143890976 NR_027113\n", "1 chr11 77774406 77775406 NM_003251\n", "2 chr3 14988512 14989512 NR_046253\n", "3 chr2 203102822 203103822 NM_003352\n", "4 chr17 7590368 7591368 NM_001126112\n", "5 chr17 26903451 26904451 NM_005165" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_peaks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We make a second DataFrame, which corresponds to a second experiment ('exp2')\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "chr_exp2 = [\"chr6_dbb_hap3\", \"chr7\", \"chr9\", \"chrX\", \"chrX\", \"chrX\"]\n", "start_exp2 = [4324541, 98029927, 90497271, 71496641, 73506544, 152162171]\n", "end_exp2 = [4325541, 98030927, 90498271, 71497641, 73507544, 152163171]\n", "refseq_exp2 = [\"NM_002121\", \"NM_018842\", \"NM_178828\", \"NM_001007\", \"NR_030258\", \"NM_001184924\"]\n", "\n", "df_peaks_exp2 = pd.DataFrame()\n", "df_peaks_exp2['chrom'] = chr_exp2\n", "df_peaks_exp2['start'] = start_exp2\n", "df_peaks_exp2['end'] = end_exp2\n", "df_peaks_exp2['nearest_gene'] = refseq_exp2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look at both DataFrames. The `head()` function shows the first five rows of a DataFrame.\n", "\n", "They have the same column layout (`chrom`, `start`, `end`, `refseq`)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromstartendnearest_gene
0chr6143889976143890976NR_027113
1chr117777440677775406NM_003251
2chr31498851214989512NR_046253
3chr2203102822203103822NM_003352
4chr1775903687591368NM_001126112
\n", "
" ], "text/plain": [ " chrom start end nearest_gene\n", "0 chr6 143889976 143890976 NR_027113\n", "1 chr11 77774406 77775406 NM_003251\n", "2 chr3 14988512 14989512 NR_046253\n", "3 chr2 203102822 203103822 NM_003352\n", "4 chr17 7590368 7591368 NM_001126112" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_peaks.head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromstartendnearest_gene
0chr6_dbb_hap343245414325541NM_002121
1chr79802992798030927NM_018842
2chr99049727190498271NM_178828
3chrX7149664171497641NM_001007
4chrX7350654473507544NR_030258
\n", "
" ], "text/plain": [ " chrom start end nearest_gene\n", "0 chr6_dbb_hap3 4324541 4325541 NM_002121\n", "1 chr7 98029927 98030927 NM_018842\n", "2 chr9 90497271 90498271 NM_178828\n", "3 chrX 71496641 71497641 NM_001007\n", "4 chrX 73506544 73507544 NR_030258" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_peaks_exp2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "They both represent peaks of a ChIP-seq experiment.\n", "\n", "For both DataFrames we can add an additional column indicating from which experiment the data comes ('exp1' or 'exp2')\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "df_peaks['exp'] = 'exp1'\n", "df_peaks_exp2['exp'] = 'exp2'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Take a look at the two DataFrames, and note how we did this:\n", "\n", "* We used `[ ]` to define a new colum, with the column name between `[ ]`\n", "* We used only one string (`'exp1'` or `'exp2'`), but this was automatically expanded to the whole column\n", "\n", "\n", "## Creating a pandas DataFrame using a `dictionary`\n", "\n", "We can also create a pandas DataFrame from a dictionary. The dictionary should contain *column names as keys* and *lists as values* for the columns. To show this in action, we will use the three following lists:\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "refseq = [\"NM_001007\", \"NM_001098638\", \"NM_001110221\", \"NM_001126112\", \"NM_001142599\", \n", " \"NM_001164283\", \"NM_001164386\", \"NM_001184924\", \"NM_001290043\",\"NM_002121\",\n", " \"NM_002441\", \"NM_003251\", \"NM_003352\", \"NM_004263\", \"NM_005165\", \"NM_006017\",\n", " \"NM_007145\", \"NM_016522\", \"NM_018842\", \"NM_024097\", \"NM_147181\", \"NM_173856\",\n", " \"NM_178828\"]\n", "\n", "ensembl = [\"ENSG00000198034\", \"ENSG00000166439\", \"ENSG00000121742\", \"ENSG00000141510\",\n", " \"ENSG00000072682\", \"ENSG00000056972\", \"ENSG00000158987\", \"ENSG00000198883\",\n", " \"ENSG00000225967\", \"ENSG00000237710\", \"ENSG00000235569\", \"ENSG00000151365\",\n", " \"ENSG00000116030\", \"ENSG00000135622\", \"ENSG00000109107\", \"ENSG00000007062\",\n", " \"ENSG00000167635\", \"ENSG00000182667\", \"ENSG00000006453\", \"ENSG00000164008\",\n", " \"ENSG00000185774\", \"ENSG00000196131\", \"ENSG00000177992\"]\n", "\n", "genename = [\"RPS4X\", \"RNF169\", \"GJB6\", \"TP53\", \"P4HA2\", \"TRAF3IP2\", \"RAPGEF6\", \"PNMA5\", \n", " \"TAP2\", \"HLA-DPB1\", \"MSH5\", \"THRSP\", \"SUMO1\", \"SEMA4F\", \"ALDOC\", \"PROM1\",\n", " \"ZNF146\", \"NTM\", \"BAIAP2L1\", \"C1orf50\", \"KCNIP4\", \"VN1R2\", \"SPATA31E1\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we use these lists to create a dictionary\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "dict_genes = {\n", " 'refseq': refseq,\n", " 'ensembl': ensembl,\n", " 'genename': genename\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have the dictionary, we can create the DataFrame.\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "df_genes = pd.DataFrame(dict_genes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look at the resulting DataFrame called `df_genes`. This DataFrame will help us to map RefSeq to ENSEMBL gene identifiers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first three rows." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
refseqensemblgenename
0NM_001007ENSG00000198034RPS4X
1NM_001098638ENSG00000166439RNF169
2NM_001110221ENSG00000121742GJB6
\n", "
" ], "text/plain": [ " refseq ensembl genename\n", "0 NM_001007 ENSG00000198034 RPS4X\n", "1 NM_001098638 ENSG00000166439 RNF169\n", "2 NM_001110221 ENSG00000121742 GJB6" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_genes.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last three rows." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
refseqensemblgenename
20NM_147181ENSG00000185774KCNIP4
21NM_173856ENSG00000196131VN1R2
22NM_178828ENSG00000177992SPATA31E1
\n", "
" ], "text/plain": [ " refseq ensembl genename\n", "20 NM_147181 ENSG00000185774 KCNIP4\n", "21 NM_173856 ENSG00000196131 VN1R2\n", "22 NM_178828 ENSG00000177992 SPATA31E1" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_genes.tail(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or three random rows." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
refseqensemblgenename
8NM_001290043ENSG00000225967TAP2
7NM_001184924ENSG00000198883PNMA5
11NM_003251ENSG00000151365THRSP
\n", "
" ], "text/plain": [ " refseq ensembl genename\n", "8 NM_001290043 ENSG00000225967 TAP2\n", "7 NM_001184924 ENSG00000198883 PNMA5\n", "11 NM_003251 ENSG00000151365 THRSP" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_genes.sample(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we create another DataFrame with gene expression values\n", "\n", "Here, we will use the dictionary method to create a `DataFrame` without explicitly creating the dictionary." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "ensembl_1 = [\"ENSG00000237710\", \"ENSG00000182667\", \"ENSG00000121742\", \"ENSG00000158987\", \n", " \"ENSG00000196131\", \"ENSG00000151365\", \"ENSG00000056972\", \"ENSG00000235569\", \n", " \"ENSG00000006453\", \"ENSG00000198883\", \"ENSG00000225967\", \"ENSG00000166439\", \n", " \"ENSG00000135622\", \"ENSG00000007062\", \"ENSG00000109107\", \"ENSG00000167635\", \n", " \"ENSG00000164008\", \"ENSG00000141510\", \"ENSG00000177992\"]\n", "\n", "sample_1 = [None, 1.686, 0.063, 4.222, 0.021, 0.026, 8.169, None, 16.159, 0.08, None, \n", " 3.972, 0.844, 26.208, 18.218, 91.049, 3.828, 58.697, 0.024]\n", "\n", "sample_2 = [None, 1.159, 0.011, 4.291, 0.004, 0.013, 0.306, None, 12.671, 0, None, \n", " 6.983, 5.352, 14.336, 5.333, 60.28, 5.137, 47.569, 0]\n", "\n", "sample_3 = [None, 6.865, 0, 4.727, 0.039, 0.023, 10.459, None, 0.225, 0.011, None, \n", " 7.007, 3.402, 6.076, 18.445, 28.716, 4.128, 34.299, 0]\n", "\n", "sample_4 = [None, 15.691, 2.16, 5.284, 0, 0.181, 1.428, None, 0.184, 0.395, None, \n", " 11.842, 4.186, 6.01, 131.013, 25.126, 9.933, 28.18, 0]" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# The next code could all be written on one line. However, to improve readability\n", "# we write in this manner.\n", "df_expr = pd.DataFrame(\n", " {\n", " 'ensembl': ensembl_1, \n", " 'sample1': sample_1, \n", " 'sample2': sample_2, \n", " 'sample3': sample_3, \n", " 'sample4': sample_4\n", " }\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look at the DataFrames using `print()` or, in Jupyter, by just typing the DataFrame name followed by ``.\n", "\n", "In the case where you would have very large DataFrames, you can always use `df.head()` to show the first rows, or `df.tail()` to show the last rows, where `df` is your DataFrame name." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Merging pandas DataFrames\n", "\n", "\n", "### Merging DataFrames on column values\n", "\n", "We now have 3 types of data in our 3 DataFrames:\n", "\n", "* ChIP-seq peaks (`df_peaks`)\n", "* A RefSeq to ENSEMBL gene identifier mapping (`df_genes`)\n", "* Expression values of genes in 4 samples, with ENSEMBL identifiers (`df_expr`)\n", "\n", "This is a typical example for any (bioinformatical) analysis in which you want to incorporate different data types, often from different sources. You need to merge them if you want to do anything useful. Pandas DataFrames can be of big help here.\n", "\n", "So let us try to merge these pieces of data.\n", "\n", "### Merging DataFrames: `concat`\n", "\n", "Using the command `concat()` we can concatenate multiple DataFrames.\n", "Let us do this with the 2 ChIP-seq peaks DataFrames:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromstartendnearest_geneexp
0chr6143889976143890976NR_027113exp1
1chr117777440677775406NM_003251exp1
2chr31498851214989512NR_046253exp1
3chr2203102822203103822NM_003352exp1
4chr1775903687591368NM_001126112exp1
5chr172690345126904451NM_005165exp1
0chr6_dbb_hap343245414325541NM_002121exp2
1chr79802992798030927NM_018842exp2
2chr99049727190498271NM_178828exp2
3chrX7149664171497641NM_001007exp2
4chrX7350654473507544NR_030258exp2
5chrX152162171152163171NM_001184924exp2
\n", "
" ], "text/plain": [ " chrom start end nearest_gene exp\n", "0 chr6 143889976 143890976 NR_027113 exp1\n", "1 chr11 77774406 77775406 NM_003251 exp1\n", "2 chr3 14988512 14989512 NR_046253 exp1\n", "3 chr2 203102822 203103822 NM_003352 exp1\n", "4 chr17 7590368 7591368 NM_001126112 exp1\n", "5 chr17 26903451 26904451 NM_005165 exp1\n", "0 chr6_dbb_hap3 4324541 4325541 NM_002121 exp2\n", "1 chr7 98029927 98030927 NM_018842 exp2\n", "2 chr9 90497271 90498271 NM_178828 exp2\n", "3 chrX 71496641 71497641 NM_001007 exp2\n", "4 chrX 73506544 73507544 NR_030258 exp2\n", "5 chrX 152162171 152163171 NM_001184924 exp2" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_all_peaks = pd.concat([df_peaks, df_peaks_exp2])\n", "df_all_peaks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check out the result, and note how the `exp` column still allows us to distinguish the two experiments. \n", "\n", "You can use `concat()` to combine two DataFrames vertically (`axis=0`) or horizontally (`axis=1`)." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromstartendnearest_geneexp
0chr6143889976143890976NR_027113exp1
1chr117777440677775406NM_003251exp1
2chr31498851214989512NR_046253exp1
3chr2203102822203103822NM_003352exp1
4chr1775903687591368NM_001126112exp1
5chr172690345126904451NM_005165exp1
0chr6_dbb_hap343245414325541NM_002121exp2
1chr79802992798030927NM_018842exp2
2chr99049727190498271NM_178828exp2
3chrX7149664171497641NM_001007exp2
4chrX7350654473507544NR_030258exp2
5chrX152162171152163171NM_001184924exp2
\n", "
" ], "text/plain": [ " chrom start end nearest_gene exp\n", "0 chr6 143889976 143890976 NR_027113 exp1\n", "1 chr11 77774406 77775406 NM_003251 exp1\n", "2 chr3 14988512 14989512 NR_046253 exp1\n", "3 chr2 203102822 203103822 NM_003352 exp1\n", "4 chr17 7590368 7591368 NM_001126112 exp1\n", "5 chr17 26903451 26904451 NM_005165 exp1\n", "0 chr6_dbb_hap3 4324541 4325541 NM_002121 exp2\n", "1 chr7 98029927 98030927 NM_018842 exp2\n", "2 chr9 90497271 90498271 NM_178828 exp2\n", "3 chrX 71496641 71497641 NM_001007 exp2\n", "4 chrX 73506544 73507544 NR_030258 exp2\n", "5 chrX 152162171 152163171 NM_001184924 exp2" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df_peaks, df_peaks_exp2], axis=0)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromstartendnearest_geneexpchromstartendnearest_geneexp
0chr6143889976143890976NR_027113exp1chr6_dbb_hap343245414325541NM_002121exp2
1chr117777440677775406NM_003251exp1chr79802992798030927NM_018842exp2
2chr31498851214989512NR_046253exp1chr99049727190498271NM_178828exp2
3chr2203102822203103822NM_003352exp1chrX7149664171497641NM_001007exp2
4chr1775903687591368NM_001126112exp1chrX7350654473507544NR_030258exp2
5chr172690345126904451NM_005165exp1chrX152162171152163171NM_001184924exp2
\n", "
" ], "text/plain": [ " chrom start end nearest_gene exp chrom start \\\n", "0 chr6 143889976 143890976 NR_027113 exp1 chr6_dbb_hap3 4324541 \n", "1 chr11 77774406 77775406 NM_003251 exp1 chr7 98029927 \n", "2 chr3 14988512 14989512 NR_046253 exp1 chr9 90497271 \n", "3 chr2 203102822 203103822 NM_003352 exp1 chrX 71496641 \n", "4 chr17 7590368 7591368 NM_001126112 exp1 chrX 73506544 \n", "5 chr17 26903451 26904451 NM_005165 exp1 chrX 152162171 \n", "\n", " end nearest_gene exp \n", "0 4325541 NM_002121 exp2 \n", "1 98030927 NM_018842 exp2 \n", "2 90498271 NM_178828 exp2 \n", "3 71497641 NM_001007 exp2 \n", "4 73507544 NR_030258 exp2 \n", "5 152163171 NM_001184924 exp2 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df_peaks, df_peaks_exp2], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) contains more examples.\n", "\n", "### Merging DataFrames: `merge`\n", "\n", "In the above case, we want to simply combine two DataFrames by concatenating, i.e. simply adding the rows of two DataFrames together. You can also join two DataFrames in other ways. The `df_peaks` DataFrame has a `nearest_gene` column. This is a RefSeq gene identifier that is also present in the `df_genes` DataFrame. If we would like to annotate the genes in `df_peaks`, we would only need the relevant columns for the genes that are in `df_peaks`. " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromstartendnearest_geneexprefseqensemblgenename
0chr117777440677775406NM_003251exp1NM_003251ENSG00000151365THRSP
1chr2203102822203103822NM_003352exp1NM_003352ENSG00000116030SUMO1
2chr1775903687591368NM_001126112exp1NM_001126112ENSG00000141510TP53
3chr172690345126904451NM_005165exp1NM_005165ENSG00000109107ALDOC
\n", "
" ], "text/plain": [ " chrom start end nearest_gene exp refseq \\\n", "0 chr11 77774406 77775406 NM_003251 exp1 NM_003251 \n", "1 chr2 203102822 203103822 NM_003352 exp1 NM_003352 \n", "2 chr17 7590368 7591368 NM_001126112 exp1 NM_001126112 \n", "3 chr17 26903451 26904451 NM_005165 exp1 NM_005165 \n", "\n", " ensembl genename \n", "0 ENSG00000151365 THRSP \n", "1 ENSG00000116030 SUMO1 \n", "2 ENSG00000141510 TP53 \n", "3 ENSG00000109107 ALDOC " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_genes_peaks = df_peaks.merge(df_genes, left_on=\"nearest_gene\", right_on=\"refseq\")\n", "df_genes_peaks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check the resulting DataFrame. For every row in `df_peaks`, all the columns from `df_genes` have been added if the `refseq` identifier matches the identifier specified in `nearest_gene`.\n", "\n", "Our DataFrame now has the ENSEMBL gene identifiers, so we can also merge the `df_expr`." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "df_genes_peaks_expr = df_genes_peaks.merge(df_expr, on='ensembl')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that here the name of the columns that will be used for merging is the same in `df_genes_peaks` and `df_exp`, so we can use the `on='ensembl'` argument." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For convenience, we will use a shorter name for our merged DataFrame" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "df_all = df_genes_peaks_expr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tab completion and help\n", "\n", "Now that we have merged our data, we can do several neat things.\n", "\n", "Because `df_all` is a pandas DataFrame, it automatically inherits DataFrame methods.\n", "\n", "You can get an idea if you type:\n", " \n", " df_all.\n", " \n", "And then hit the left `` key.\n", "\n", "Notice that you get a lot of options....\n", "\n", "Suppose you want to know how the `sort_values()` method for a DataFrame works. You can type:\n", "\n", " df_all.sort_values?\n", " \n", "This HELP menu gives you information about the function. type `q` or `` to leave the HELP.\n", "\n", "## Sorting a DataFrame\n", "\n", "Let us try it out:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromstartendnearest_geneexprefseqensemblgenenamesample1sample2sample3sample4
0chr117777440677775406NM_003251exp1NM_003251ENSG00000151365THRSP0.0260.0130.0230.181
1chr1775903687591368NM_001126112exp1NM_001126112ENSG00000141510TP5358.69747.56934.29928.180
2chr172690345126904451NM_005165exp1NM_005165ENSG00000109107ALDOC18.2185.33318.445131.013
\n", "
" ], "text/plain": [ " chrom start end nearest_gene exp refseq \\\n", "0 chr11 77774406 77775406 NM_003251 exp1 NM_003251 \n", "1 chr17 7590368 7591368 NM_001126112 exp1 NM_001126112 \n", "2 chr17 26903451 26904451 NM_005165 exp1 NM_005165 \n", "\n", " ensembl genename sample1 sample2 sample3 sample4 \n", "0 ENSG00000151365 THRSP 0.026 0.013 0.023 0.181 \n", "1 ENSG00000141510 TP53 58.697 47.569 34.299 28.180 \n", "2 ENSG00000109107 ALDOC 18.218 5.333 18.445 131.013 " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_all.sort_values(by='chrom')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromstartendnearest_geneexprefseqensemblgenenamesample1sample2sample3sample4
1chr1775903687591368NM_001126112exp1NM_001126112ENSG00000141510TP5358.69747.56934.29928.180
2chr172690345126904451NM_005165exp1NM_005165ENSG00000109107ALDOC18.2185.33318.445131.013
0chr117777440677775406NM_003251exp1NM_003251ENSG00000151365THRSP0.0260.0130.0230.181
\n", "
" ], "text/plain": [ " chrom start end nearest_gene exp refseq \\\n", "1 chr17 7590368 7591368 NM_001126112 exp1 NM_001126112 \n", "2 chr17 26903451 26904451 NM_005165 exp1 NM_005165 \n", "0 chr11 77774406 77775406 NM_003251 exp1 NM_003251 \n", "\n", " ensembl genename sample1 sample2 sample3 sample4 \n", "1 ENSG00000141510 TP53 58.697 47.569 34.299 28.180 \n", "2 ENSG00000109107 ALDOC 18.218 5.333 18.445 131.013 \n", "0 ENSG00000151365 THRSP 0.026 0.013 0.023 0.181 " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_all.sort_values(by='start')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that these commands did not replace the original DataFrame `df_all`!\n", "\n", "\n", "## Using `[ ]` to obtain DataFrame columns or rows\n", "\n", "Using `[ ]` we can obtain specific parts of the DataFrame.\n" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 THRSP\n", "1 TP53\n", "2 ALDOC\n", "Name: genename, dtype: object" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_all['genename']" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
genenameexp
0THRSPexp1
1TP53exp1
2ALDOCexp1
\n", "
" ], "text/plain": [ " genename exp\n", "0 THRSP exp1\n", "1 TP53 exp1\n", "2 ALDOC exp1" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## or, for multiple columns:\n", "df_all[['genename', 'exp']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using `iloc` we can do the same using indexes\n", "\n", "One index by default points to **rows** of a DataFrame\n", "\n", "To get the 2nd row, type:\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "chrom chr17\n", "start 26903451\n", "end 26904451\n", "nearest_gene NM_005165\n", "exp exp1\n", "refseq NM_005165\n", "ensembl ENSG00000109107\n", "genename ALDOC\n", "sample1 18.218\n", "sample2 5.333\n", "sample3 18.445\n", "sample4 131.013\n", "Name: 2, dtype: object" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_all.iloc[2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or, for multiple rows:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromstartendnearest_geneexprefseqensemblgenenamesample1sample2sample3sample4
2chr172690345126904451NM_005165exp1NM_005165ENSG00000109107ALDOC18.2185.33318.445131.013
\n", "
" ], "text/plain": [ " chrom start end nearest_gene exp refseq ensembl \\\n", "2 chr17 26903451 26904451 NM_005165 exp1 NM_005165 ENSG00000109107 \n", "\n", " genename sample1 sample2 sample3 sample4 \n", "2 ALDOC 18.218 5.333 18.445 131.013 " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_all.iloc[2:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This works very much like indexes of a `list`\n", "\n", "If you use two indices, you can obtain specific rows *and* columns.\n", "\n", "The first index is always *rows*, the second *columns*.\n" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
exprefseqensemblgenenamesample1sample2sample3
2exp1NM_005165ENSG00000109107ALDOC18.2185.33318.445
\n", "
" ], "text/plain": [ " exp refseq ensembl genename sample1 sample2 sample3\n", "2 exp1 NM_005165 ENSG00000109107 ALDOC 18.218 5.333 18.445" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_all.iloc[2:5, 4:11]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Removing missing values or `NaN`s.\n", "\n", "In many cases DataFrames contains `NaN`, missing values, and you often have to remove them.\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromstartendnearest_geneexprefseqensemblgenenamesample1sample2sample3sample4full_name
0chr117777440677775406NM_003251exp1NM_003251ENSG00000151365THRSP0.0260.0130.0230.181NaN
1chr1775903687591368NM_001126112exp1NM_001126112ENSG00000141510TP5358.69747.56934.29928.180tumor protein p53
2chr172690345126904451NM_005165exp1NM_005165ENSG00000109107ALDOC18.2185.33318.445131.013aldolase, fructose-bisphosphate C
\n", "
" ], "text/plain": [ " chrom start end nearest_gene exp refseq \\\n", "0 chr11 77774406 77775406 NM_003251 exp1 NM_003251 \n", "1 chr17 7590368 7591368 NM_001126112 exp1 NM_001126112 \n", "2 chr17 26903451 26904451 NM_005165 exp1 NM_005165 \n", "\n", " ensembl genename sample1 sample2 sample3 sample4 \\\n", "0 ENSG00000151365 THRSP 0.026 0.013 0.023 0.181 \n", "1 ENSG00000141510 TP53 58.697 47.569 34.299 28.180 \n", "2 ENSG00000109107 ALDOC 18.218 5.333 18.445 131.013 \n", "\n", " full_name \n", "0 NaN \n", "1 tumor protein p53 \n", "2 aldolase, fructose-bisphosphate C " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gene_info = pd.DataFrame({\"full_name\":[\"tumor protein p53\", \"aldolase, fructose-bisphosphate C\"]},\n", " index=[\"TP53\", \"ALDOC\"])\n", "df_all = df_all.join(gene_info, on=\"genename\")\n", "df_all" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can obtain a `boolean` (`True`/`False`) for the column called `full_name` like this:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 False\n", "2 False\n", "Name: full_name, dtype: bool" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_all['full_name'].isnull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By using the `~` symbol we can invert this `boolean` column. The `~` acts as `not` for a pandas Series." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 True\n", "Name: full_name, dtype: bool" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "~df_all['full_name'].isnull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use this strategy to remove all rows that have `NaN` in the column `full_name`." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromstartendnearest_geneexprefseqensemblgenenamesample1sample2sample3sample4full_name
1chr1775903687591368NM_001126112exp1NM_001126112ENSG00000141510TP5358.69747.56934.29928.180tumor protein p53
2chr172690345126904451NM_005165exp1NM_005165ENSG00000109107ALDOC18.2185.33318.445131.013aldolase, fructose-bisphosphate C
\n", "
" ], "text/plain": [ " chrom start end nearest_gene exp refseq \\\n", "1 chr17 7590368 7591368 NM_001126112 exp1 NM_001126112 \n", "2 chr17 26903451 26904451 NM_005165 exp1 NM_005165 \n", "\n", " ensembl genename sample1 sample2 sample3 sample4 \\\n", "1 ENSG00000141510 TP53 58.697 47.569 34.299 28.180 \n", "2 ENSG00000109107 ALDOC 18.218 5.333 18.445 131.013 \n", "\n", " full_name \n", "1 tumor protein p53 \n", "2 aldolase, fructose-bisphosphate C " ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_all[~df_all['full_name'].isnull()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Make sure you understand how this worked!\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic calculations\n", "\n", "There are a lot of methods that allows you to do basic calculation on DataFrames.\n", "\n", "As an illustration we will calculate the pairwise correlations between the columns `sample1`, `sample2`, `sample3`, and `sample4`.\n", "\n", "We take the appropriate columns first\n" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "df_1 = df_all[['sample1', 'sample2', 'sample3', 'sample4']]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can use the pandas `corr` method to calculate all pairwise correlations:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sample1sample2sample3sample4
sample11.0000000.9789710.966614-0.011258
sample20.9789711.0000000.894017-0.215006
sample30.9666140.8940171.0000000.245337
sample4-0.011258-0.2150060.2453371.000000
\n", "
" ], "text/plain": [ " sample1 sample2 sample3 sample4\n", "sample1 1.000000 0.978971 0.966614 -0.011258\n", "sample2 0.978971 1.000000 0.894017 -0.215006\n", "sample3 0.966614 0.894017 1.000000 0.245337\n", "sample4 -0.011258 -0.215006 0.245337 1.000000" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_1.corr()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }