{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to pandas DataFrames\n",
    "\n",
    "In this tutorial you are going to use the module `pandas`.\n",
    "We will focus on DataFrames, a convenient object type used in pandas.\n",
    "\n",
    "```{note}\n",
    "This tutorial has no exercises, but serves as an introduction to pandas.\n",
    "It will illustrate what you can do with pandas DataFrames. \n",
    "```\n",
    "\n",
    "If you want more information on pandas, a (quick) tutorial is found here:\n",
    "\n",
    "[https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)\n",
    "\n",
    "## Importing the pandas module\n",
    "\n",
    "We start by importing the `pandas` module, for which we use the following command:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we imported the `pandas` module and renamed it to `pd`. This allows us to use all `pandas` functionality using `pd` (instead of `pandas`). This is very common: it saves typing since `pd` is shorter than `pandas`.\n",
    "\n",
    "## Creating a pandas DataFrame from lists\n",
    "\n",
    "Here we will create a pandas DataFrame containing the following columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "chrom = ['chr6', 'chr11', 'chr3', 'chr2', 'chr17', 'chr17']\n",
    "start = [143889976, 77774406, 14988512, 203102822, 7590368, 26903451]\n",
    "end = [143890976, 77775406, 14989512, 203103822, 7591368, 26904451]\n",
    "refseq =['NR_027113', 'NM_003251', 'NR_046253', 'NM_003352', 'NM_001126112', 'NM_005165']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case we create an empty DataFrame first, and we fill it up with the individual columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_peaks = pd.DataFrame()\n",
    "\n",
    "df_peaks['chrom'] = chrom\n",
    "df_peaks['start'] = start\n",
    "df_peaks['end'] = end\n",
    "df_peaks['nearest_gene'] = refseq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can check what your DataFrame looks like using  `print()`, or simply type `df_peaks`. The latter only works in a Jupyter notebook, but the advantage is that you get a nicely formatted table!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   chrom      start        end  nearest_gene\n",
      "0   chr6  143889976  143890976     NR_027113\n",
      "1  chr11   77774406   77775406     NM_003251\n",
      "2   chr3   14988512   14989512     NR_046253\n",
      "3   chr2  203102822  203103822     NM_003352\n",
      "4  chr17    7590368    7591368  NM_001126112\n",
      "5  chr17   26903451   26904451     NM_005165\n"
     ]
    }
   ],
   "source": [
    "print(df_peaks)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>chrom</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>nearest_gene</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chr6</td>\n",
       "      <td>143889976</td>\n",
       "      <td>143890976</td>\n",
       "      <td>NR_027113</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr11</td>\n",
       "      <td>77774406</td>\n",
       "      <td>77775406</td>\n",
       "      <td>NM_003251</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr3</td>\n",
       "      <td>14988512</td>\n",
       "      <td>14989512</td>\n",
       "      <td>NR_046253</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>chr2</td>\n",
       "      <td>203102822</td>\n",
       "      <td>203103822</td>\n",
       "      <td>NM_003352</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>chr17</td>\n",
       "      <td>7590368</td>\n",
       "      <td>7591368</td>\n",
       "      <td>NM_001126112</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>chr17</td>\n",
       "      <td>26903451</td>\n",
       "      <td>26904451</td>\n",
       "      <td>NM_005165</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   chrom      start        end  nearest_gene\n",
       "0   chr6  143889976  143890976     NR_027113\n",
       "1  chr11   77774406   77775406     NM_003251\n",
       "2   chr3   14988512   14989512     NR_046253\n",
       "3   chr2  203102822  203103822     NM_003352\n",
       "4  chr17    7590368    7591368  NM_001126112\n",
       "5  chr17   26903451   26904451     NM_005165"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_peaks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We make a second DataFrame, which corresponds to a second experiment ('exp2')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "chr_exp2 = [\"chr6_dbb_hap3\", \"chr7\", \"chr9\", \"chrX\", \"chrX\", \"chrX\"]\n",
    "start_exp2 = [4324541, 98029927, 90497271, 71496641, 73506544, 152162171]\n",
    "end_exp2 = [4325541, 98030927, 90498271, 71497641, 73507544, 152163171]\n",
    "refseq_exp2 = [\"NM_002121\", \"NM_018842\", \"NM_178828\", \"NM_001007\", \"NR_030258\", \"NM_001184924\"]\n",
    "\n",
    "df_peaks_exp2 = pd.DataFrame()\n",
    "df_peaks_exp2['chrom'] = chr_exp2\n",
    "df_peaks_exp2['start'] = start_exp2\n",
    "df_peaks_exp2['end'] = end_exp2\n",
    "df_peaks_exp2['nearest_gene'] = refseq_exp2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Look at both DataFrames. The `head()` function shows the first five rows of a DataFrame.\n",
    "\n",
    "They have the same column layout (`chrom`, `start`, `end`, `refseq`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>chrom</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>nearest_gene</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chr6</td>\n",
       "      <td>143889976</td>\n",
       "      <td>143890976</td>\n",
       "      <td>NR_027113</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr11</td>\n",
       "      <td>77774406</td>\n",
       "      <td>77775406</td>\n",
       "      <td>NM_003251</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr3</td>\n",
       "      <td>14988512</td>\n",
       "      <td>14989512</td>\n",
       "      <td>NR_046253</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>chr2</td>\n",
       "      <td>203102822</td>\n",
       "      <td>203103822</td>\n",
       "      <td>NM_003352</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>chr17</td>\n",
       "      <td>7590368</td>\n",
       "      <td>7591368</td>\n",
       "      <td>NM_001126112</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   chrom      start        end  nearest_gene\n",
       "0   chr6  143889976  143890976     NR_027113\n",
       "1  chr11   77774406   77775406     NM_003251\n",
       "2   chr3   14988512   14989512     NR_046253\n",
       "3   chr2  203102822  203103822     NM_003352\n",
       "4  chr17    7590368    7591368  NM_001126112"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_peaks.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>chrom</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>nearest_gene</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chr6_dbb_hap3</td>\n",
       "      <td>4324541</td>\n",
       "      <td>4325541</td>\n",
       "      <td>NM_002121</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr7</td>\n",
       "      <td>98029927</td>\n",
       "      <td>98030927</td>\n",
       "      <td>NM_018842</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr9</td>\n",
       "      <td>90497271</td>\n",
       "      <td>90498271</td>\n",
       "      <td>NM_178828</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>chrX</td>\n",
       "      <td>71496641</td>\n",
       "      <td>71497641</td>\n",
       "      <td>NM_001007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>chrX</td>\n",
       "      <td>73506544</td>\n",
       "      <td>73507544</td>\n",
       "      <td>NR_030258</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           chrom     start       end nearest_gene\n",
       "0  chr6_dbb_hap3   4324541   4325541    NM_002121\n",
       "1           chr7  98029927  98030927    NM_018842\n",
       "2           chr9  90497271  90498271    NM_178828\n",
       "3           chrX  71496641  71497641    NM_001007\n",
       "4           chrX  73506544  73507544    NR_030258"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_peaks_exp2.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "They both represent peaks of a ChIP-seq experiment.\n",
    "\n",
    "For both DataFrames we can add an additional column indicating from which experiment the data comes ('exp1' or 'exp2')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_peaks['exp'] = 'exp1'\n",
    "df_peaks_exp2['exp'] = 'exp2'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Take a look at the two DataFrames, and note how we did this:\n",
    "\n",
    "* We used `[ ]` to define a new colum, with the column name between `[ ]`\n",
    "* We used only one string (`'exp1'` or `'exp2'`), but this was automatically expanded to the whole column\n",
    "\n",
    "\n",
    "## Creating a pandas DataFrame using a `dictionary`\n",
    "\n",
    "We can also create a pandas DataFrame from a dictionary. The dictionary should contain *column names as keys* and *lists as values* for the columns. To show this in action, we will use the three following lists:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "refseq = [\"NM_001007\", \"NM_001098638\", \"NM_001110221\", \"NM_001126112\", \"NM_001142599\", \n",
    "          \"NM_001164283\", \"NM_001164386\", \"NM_001184924\", \"NM_001290043\",\"NM_002121\",\n",
    "          \"NM_002441\", \"NM_003251\", \"NM_003352\", \"NM_004263\", \"NM_005165\", \"NM_006017\",\n",
    "          \"NM_007145\", \"NM_016522\", \"NM_018842\", \"NM_024097\", \"NM_147181\", \"NM_173856\",\n",
    "          \"NM_178828\"]\n",
    "\n",
    "ensembl = [\"ENSG00000198034\", \"ENSG00000166439\", \"ENSG00000121742\", \"ENSG00000141510\",\n",
    "           \"ENSG00000072682\", \"ENSG00000056972\", \"ENSG00000158987\", \"ENSG00000198883\",\n",
    "           \"ENSG00000225967\", \"ENSG00000237710\", \"ENSG00000235569\", \"ENSG00000151365\",\n",
    "           \"ENSG00000116030\", \"ENSG00000135622\", \"ENSG00000109107\", \"ENSG00000007062\",\n",
    "           \"ENSG00000167635\", \"ENSG00000182667\", \"ENSG00000006453\", \"ENSG00000164008\",\n",
    "           \"ENSG00000185774\", \"ENSG00000196131\", \"ENSG00000177992\"]\n",
    "\n",
    "genename = [\"RPS4X\", \"RNF169\", \"GJB6\", \"TP53\", \"P4HA2\", \"TRAF3IP2\", \"RAPGEF6\", \"PNMA5\", \n",
    "            \"TAP2\", \"HLA-DPB1\", \"MSH5\", \"THRSP\", \"SUMO1\", \"SEMA4F\", \"ALDOC\", \"PROM1\",\n",
    "            \"ZNF146\", \"NTM\", \"BAIAP2L1\", \"C1orf50\", \"KCNIP4\", \"VN1R2\", \"SPATA31E1\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we use these lists to create a dictionary\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "dict_genes = {\n",
    "    'refseq': refseq,\n",
    "    'ensembl': ensembl,\n",
    "    'genename': genename\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have the dictionary, we can create the DataFrame.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_genes = pd.DataFrame(dict_genes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Look at the resulting DataFrame called `df_genes`. This DataFrame will help us to map RefSeq to ENSEMBL gene identifiers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first three rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>refseq</th>\n",
       "      <th>ensembl</th>\n",
       "      <th>genename</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NM_001007</td>\n",
       "      <td>ENSG00000198034</td>\n",
       "      <td>RPS4X</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NM_001098638</td>\n",
       "      <td>ENSG00000166439</td>\n",
       "      <td>RNF169</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NM_001110221</td>\n",
       "      <td>ENSG00000121742</td>\n",
       "      <td>GJB6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         refseq          ensembl genename\n",
       "0     NM_001007  ENSG00000198034    RPS4X\n",
       "1  NM_001098638  ENSG00000166439   RNF169\n",
       "2  NM_001110221  ENSG00000121742     GJB6"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_genes.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last three rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>refseq</th>\n",
       "      <th>ensembl</th>\n",
       "      <th>genename</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>NM_147181</td>\n",
       "      <td>ENSG00000185774</td>\n",
       "      <td>KCNIP4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>NM_173856</td>\n",
       "      <td>ENSG00000196131</td>\n",
       "      <td>VN1R2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>NM_178828</td>\n",
       "      <td>ENSG00000177992</td>\n",
       "      <td>SPATA31E1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       refseq          ensembl   genename\n",
       "20  NM_147181  ENSG00000185774     KCNIP4\n",
       "21  NM_173856  ENSG00000196131      VN1R2\n",
       "22  NM_178828  ENSG00000177992  SPATA31E1"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_genes.tail(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or three random rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>refseq</th>\n",
       "      <th>ensembl</th>\n",
       "      <th>genename</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>NM_001290043</td>\n",
       "      <td>ENSG00000225967</td>\n",
       "      <td>TAP2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>NM_001184924</td>\n",
       "      <td>ENSG00000198883</td>\n",
       "      <td>PNMA5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>NM_003251</td>\n",
       "      <td>ENSG00000151365</td>\n",
       "      <td>THRSP</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          refseq          ensembl genename\n",
       "8   NM_001290043  ENSG00000225967     TAP2\n",
       "7   NM_001184924  ENSG00000198883    PNMA5\n",
       "11     NM_003251  ENSG00000151365    THRSP"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_genes.sample(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we create another DataFrame with gene expression values\n",
    "\n",
    "Here, we will use the dictionary method to create a `DataFrame` without explicitly creating the dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "ensembl_1 = [\"ENSG00000237710\", \"ENSG00000182667\", \"ENSG00000121742\", \"ENSG00000158987\", \n",
    "             \"ENSG00000196131\", \"ENSG00000151365\", \"ENSG00000056972\", \"ENSG00000235569\", \n",
    "             \"ENSG00000006453\", \"ENSG00000198883\", \"ENSG00000225967\", \"ENSG00000166439\", \n",
    "             \"ENSG00000135622\", \"ENSG00000007062\", \"ENSG00000109107\", \"ENSG00000167635\", \n",
    "             \"ENSG00000164008\", \"ENSG00000141510\", \"ENSG00000177992\"]\n",
    "\n",
    "sample_1 = [None, 1.686, 0.063, 4.222, 0.021, 0.026, 8.169, None, 16.159, 0.08, None, \n",
    "            3.972, 0.844, 26.208, 18.218, 91.049, 3.828, 58.697, 0.024]\n",
    "\n",
    "sample_2 = [None, 1.159, 0.011, 4.291, 0.004, 0.013, 0.306, None, 12.671, 0, None, \n",
    "            6.983, 5.352, 14.336, 5.333, 60.28, 5.137, 47.569, 0]\n",
    "\n",
    "sample_3 = [None, 6.865, 0, 4.727, 0.039, 0.023, 10.459, None, 0.225, 0.011, None, \n",
    "            7.007, 3.402, 6.076, 18.445, 28.716, 4.128, 34.299, 0]\n",
    "\n",
    "sample_4 = [None, 15.691, 2.16, 5.284, 0, 0.181, 1.428, None, 0.184, 0.395, None, \n",
    "            11.842, 4.186, 6.01, 131.013, 25.126, 9.933, 28.18, 0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The next code could all be written on one line. However, to improve readability\n",
    "# we write in this manner.\n",
    "df_expr = pd.DataFrame(\n",
    "    {\n",
    "        'ensembl': ensembl_1, \n",
    "        'sample1': sample_1, \n",
    "        'sample2': sample_2, \n",
    "        'sample3': sample_3, \n",
    "        'sample4': sample_4\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Look at the DataFrames using `print()` or, in Jupyter, by just typing the DataFrame name followed by `<Enter>`.\n",
    "\n",
    "In the case where you would have very large DataFrames, you can always use `df.head()` to show the first rows, or `df.tail()` to show the last rows, where `df` is your DataFrame name."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Merging pandas DataFrames\n",
    "\n",
    "\n",
    "### Merging DataFrames on column values\n",
    "\n",
    "We now have 3 types of data in our 3 DataFrames:\n",
    "\n",
    "* ChIP-seq peaks (`df_peaks`)\n",
    "* A RefSeq to ENSEMBL gene identifier mapping (`df_genes`)\n",
    "* Expression values of genes in 4 samples, with ENSEMBL identifiers (`df_expr`)\n",
    "\n",
    "This is a typical example for any (bioinformatical) analysis in which you want to incorporate different data types, often from different sources. You need to merge them if you want to do anything useful. Pandas DataFrames can be of big help here.\n",
    "\n",
    "So let us try to merge these pieces of data.\n",
    "\n",
    "### Merging DataFrames: `concat`\n",
    "\n",
    "Using the command `concat()` we can concatenate multiple DataFrames.\n",
    "Let us do this with the 2 ChIP-seq peaks DataFrames:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>chrom</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>nearest_gene</th>\n",
       "      <th>exp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chr6</td>\n",
       "      <td>143889976</td>\n",
       "      <td>143890976</td>\n",
       "      <td>NR_027113</td>\n",
       "      <td>exp1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr11</td>\n",
       "      <td>77774406</td>\n",
       "      <td>77775406</td>\n",
       "      <td>NM_003251</td>\n",
       "      <td>exp1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr3</td>\n",
       "      <td>14988512</td>\n",
       "      <td>14989512</td>\n",
       "      <td>NR_046253</td>\n",
       "      <td>exp1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>chr2</td>\n",
       "      <td>203102822</td>\n",
       "      <td>203103822</td>\n",
       "      <td>NM_003352</td>\n",
       "      <td>exp1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>chr17</td>\n",
       "      <td>7590368</td>\n",
       "      <td>7591368</td>\n",
       "      <td>NM_001126112</td>\n",
       "      <td>exp1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>chr17</td>\n",
       "      <td>26903451</td>\n",
       "      <td>26904451</td>\n",
       "      <td>NM_005165</td>\n",
       "      <td>exp1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chr6_dbb_hap3</td>\n",
       "      <td>4324541</td>\n",
       "      <td>4325541</td>\n",
       "      <td>NM_002121</td>\n",
       "      <td>exp2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr7</td>\n",
       "      <td>98029927</td>\n",
       "      <td>98030927</td>\n",
       "      <td>NM_018842</td>\n",
       "      <td>exp2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr9</td>\n",
       "      <td>90497271</td>\n",
       "      <td>90498271</td>\n",
       "      <td>NM_178828</td>\n",
       "      <td>exp2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>chrX</td>\n",
       "      <td>71496641</td>\n",
       "      <td>71497641</td>\n",
       "      <td>NM_001007</td>\n",
       "      <td>exp2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>chrX</td>\n",
       "      <td>73506544</td>\n",
       "      <td>73507544</td>\n",
       "      <td>NR_030258</td>\n",
       "      <td>exp2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>chrX</td>\n",
       "      <td>152162171</td>\n",
       "      <td>152163171</td>\n",
       "      <td>NM_001184924</td>\n",
       "      <td>exp2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           chrom      start        end  nearest_gene   exp\n",
       "0           chr6  143889976  143890976     NR_027113  exp1\n",
       "1          chr11   77774406   77775406     NM_003251  exp1\n",
       "2           chr3   14988512   14989512     NR_046253  exp1\n",
       "3           chr2  203102822  203103822     NM_003352  exp1\n",
       "4          chr17    7590368    7591368  NM_001126112  exp1\n",
       "5          chr17   26903451   26904451     NM_005165  exp1\n",
       "0  chr6_dbb_hap3    4324541    4325541     NM_002121  exp2\n",
       "1           chr7   98029927   98030927     NM_018842  exp2\n",
       "2           chr9   90497271   90498271     NM_178828  exp2\n",
       "3           chrX   71496641   71497641     NM_001007  exp2\n",
       "4           chrX   73506544   73507544     NR_030258  exp2\n",
       "5           chrX  152162171  152163171  NM_001184924  exp2"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_all_peaks = pd.concat([df_peaks, df_peaks_exp2])\n",
    "df_all_peaks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check out the result, and note how the `exp` column still allows us to distinguish the two experiments. \n",
    "\n",
    "You can use `concat()` to combine two DataFrames vertically (`axis=0`) or horizontally (`axis=1`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>chrom</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>nearest_gene</th>\n",
       "      <th>exp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chr6</td>\n",
       "      <td>143889976</td>\n",
       "      <td>143890976</td>\n",
       "      <td>NR_027113</td>\n",
       "      <td>exp1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr11</td>\n",
       "      <td>77774406</td>\n",
       "      <td>77775406</td>\n",
       "      <td>NM_003251</td>\n",
       "      <td>exp1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr3</td>\n",
       "      <td>14988512</td>\n",
       "      <td>14989512</td>\n",
       "      <td>NR_046253</td>\n",
       "      <td>exp1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>chr2</td>\n",
       "      <td>203102822</td>\n",
       "      <td>203103822</td>\n",
       "      <td>NM_003352</td>\n",
       "      <td>exp1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>chr17</td>\n",
       "      <td>7590368</td>\n",
       "      <td>7591368</td>\n",
       "      <td>NM_001126112</td>\n",
       "      <td>exp1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>chr17</td>\n",
       "      <td>26903451</td>\n",
       "      <td>26904451</td>\n",
       "      <td>NM_005165</td>\n",
       "      <td>exp1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chr6_dbb_hap3</td>\n",
       "      <td>4324541</td>\n",
       "      <td>4325541</td>\n",
       "      <td>NM_002121</td>\n",
       "      <td>exp2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr7</td>\n",
       "      <td>98029927</td>\n",
       "      <td>98030927</td>\n",
       "      <td>NM_018842</td>\n",
       "      <td>exp2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr9</td>\n",
       "      <td>90497271</td>\n",
       "      <td>90498271</td>\n",
       "      <td>NM_178828</td>\n",
       "      <td>exp2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>chrX</td>\n",
       "      <td>71496641</td>\n",
       "      <td>71497641</td>\n",
       "      <td>NM_001007</td>\n",
       "      <td>exp2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>chrX</td>\n",
       "      <td>73506544</td>\n",
       "      <td>73507544</td>\n",
       "      <td>NR_030258</td>\n",
       "      <td>exp2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>chrX</td>\n",
       "      <td>152162171</td>\n",
       "      <td>152163171</td>\n",
       "      <td>NM_001184924</td>\n",
       "      <td>exp2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           chrom      start        end  nearest_gene   exp\n",
       "0           chr6  143889976  143890976     NR_027113  exp1\n",
       "1          chr11   77774406   77775406     NM_003251  exp1\n",
       "2           chr3   14988512   14989512     NR_046253  exp1\n",
       "3           chr2  203102822  203103822     NM_003352  exp1\n",
       "4          chr17    7590368    7591368  NM_001126112  exp1\n",
       "5          chr17   26903451   26904451     NM_005165  exp1\n",
       "0  chr6_dbb_hap3    4324541    4325541     NM_002121  exp2\n",
       "1           chr7   98029927   98030927     NM_018842  exp2\n",
       "2           chr9   90497271   90498271     NM_178828  exp2\n",
       "3           chrX   71496641   71497641     NM_001007  exp2\n",
       "4           chrX   73506544   73507544     NR_030258  exp2\n",
       "5           chrX  152162171  152163171  NM_001184924  exp2"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.concat([df_peaks, df_peaks_exp2], axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>chrom</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>nearest_gene</th>\n",
       "      <th>exp</th>\n",
       "      <th>chrom</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>nearest_gene</th>\n",
       "      <th>exp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chr6</td>\n",
       "      <td>143889976</td>\n",
       "      <td>143890976</td>\n",
       "      <td>NR_027113</td>\n",
       "      <td>exp1</td>\n",
       "      <td>chr6_dbb_hap3</td>\n",
       "      <td>4324541</td>\n",
       "      <td>4325541</td>\n",
       "      <td>NM_002121</td>\n",
       "      <td>exp2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr11</td>\n",
       "      <td>77774406</td>\n",
       "      <td>77775406</td>\n",
       "      <td>NM_003251</td>\n",
       "      <td>exp1</td>\n",
       "      <td>chr7</td>\n",
       "      <td>98029927</td>\n",
       "      <td>98030927</td>\n",
       "      <td>NM_018842</td>\n",
       "      <td>exp2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr3</td>\n",
       "      <td>14988512</td>\n",
       "      <td>14989512</td>\n",
       "      <td>NR_046253</td>\n",
       "      <td>exp1</td>\n",
       "      <td>chr9</td>\n",
       "      <td>90497271</td>\n",
       "      <td>90498271</td>\n",
       "      <td>NM_178828</td>\n",
       "      <td>exp2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>chr2</td>\n",
       "      <td>203102822</td>\n",
       "      <td>203103822</td>\n",
       "      <td>NM_003352</td>\n",
       "      <td>exp1</td>\n",
       "      <td>chrX</td>\n",
       "      <td>71496641</td>\n",
       "      <td>71497641</td>\n",
       "      <td>NM_001007</td>\n",
       "      <td>exp2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>chr17</td>\n",
       "      <td>7590368</td>\n",
       "      <td>7591368</td>\n",
       "      <td>NM_001126112</td>\n",
       "      <td>exp1</td>\n",
       "      <td>chrX</td>\n",
       "      <td>73506544</td>\n",
       "      <td>73507544</td>\n",
       "      <td>NR_030258</td>\n",
       "      <td>exp2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>chr17</td>\n",
       "      <td>26903451</td>\n",
       "      <td>26904451</td>\n",
       "      <td>NM_005165</td>\n",
       "      <td>exp1</td>\n",
       "      <td>chrX</td>\n",
       "      <td>152162171</td>\n",
       "      <td>152163171</td>\n",
       "      <td>NM_001184924</td>\n",
       "      <td>exp2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   chrom      start        end  nearest_gene   exp          chrom      start  \\\n",
       "0   chr6  143889976  143890976     NR_027113  exp1  chr6_dbb_hap3    4324541   \n",
       "1  chr11   77774406   77775406     NM_003251  exp1           chr7   98029927   \n",
       "2   chr3   14988512   14989512     NR_046253  exp1           chr9   90497271   \n",
       "3   chr2  203102822  203103822     NM_003352  exp1           chrX   71496641   \n",
       "4  chr17    7590368    7591368  NM_001126112  exp1           chrX   73506544   \n",
       "5  chr17   26903451   26904451     NM_005165  exp1           chrX  152162171   \n",
       "\n",
       "         end  nearest_gene   exp  \n",
       "0    4325541     NM_002121  exp2  \n",
       "1   98030927     NM_018842  exp2  \n",
       "2   90498271     NM_178828  exp2  \n",
       "3   71497641     NM_001007  exp2  \n",
       "4   73507544     NR_030258  exp2  \n",
       "5  152163171  NM_001184924  exp2  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.concat([df_peaks, df_peaks_exp2], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) contains more examples.\n",
    "\n",
    "### Merging DataFrames: `merge`\n",
    "\n",
    "In the above case, we want to simply combine two DataFrames by concatenating, i.e. simply adding the rows of two DataFrames together. You can also join two DataFrames in other ways. The `df_peaks` DataFrame has a `nearest_gene` column. This is a RefSeq gene identifier that is also present in the `df_genes` DataFrame. If we would like to annotate the genes in `df_peaks`, we would only need the relevant columns for the genes that are in `df_peaks`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>chrom</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>nearest_gene</th>\n",
       "      <th>exp</th>\n",
       "      <th>refseq</th>\n",
       "      <th>ensembl</th>\n",
       "      <th>genename</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chr11</td>\n",
       "      <td>77774406</td>\n",
       "      <td>77775406</td>\n",
       "      <td>NM_003251</td>\n",
       "      <td>exp1</td>\n",
       "      <td>NM_003251</td>\n",
       "      <td>ENSG00000151365</td>\n",
       "      <td>THRSP</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr2</td>\n",
       "      <td>203102822</td>\n",
       "      <td>203103822</td>\n",
       "      <td>NM_003352</td>\n",
       "      <td>exp1</td>\n",
       "      <td>NM_003352</td>\n",
       "      <td>ENSG00000116030</td>\n",
       "      <td>SUMO1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr17</td>\n",
       "      <td>7590368</td>\n",
       "      <td>7591368</td>\n",
       "      <td>NM_001126112</td>\n",
       "      <td>exp1</td>\n",
       "      <td>NM_001126112</td>\n",
       "      <td>ENSG00000141510</td>\n",
       "      <td>TP53</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>chr17</td>\n",
       "      <td>26903451</td>\n",
       "      <td>26904451</td>\n",
       "      <td>NM_005165</td>\n",
       "      <td>exp1</td>\n",
       "      <td>NM_005165</td>\n",
       "      <td>ENSG00000109107</td>\n",
       "      <td>ALDOC</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   chrom      start        end  nearest_gene   exp        refseq  \\\n",
       "0  chr11   77774406   77775406     NM_003251  exp1     NM_003251   \n",
       "1   chr2  203102822  203103822     NM_003352  exp1     NM_003352   \n",
       "2  chr17    7590368    7591368  NM_001126112  exp1  NM_001126112   \n",
       "3  chr17   26903451   26904451     NM_005165  exp1     NM_005165   \n",
       "\n",
       "           ensembl genename  \n",
       "0  ENSG00000151365    THRSP  \n",
       "1  ENSG00000116030    SUMO1  \n",
       "2  ENSG00000141510     TP53  \n",
       "3  ENSG00000109107    ALDOC  "
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_genes_peaks = df_peaks.merge(df_genes, left_on=\"nearest_gene\", right_on=\"refseq\")\n",
    "df_genes_peaks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check the resulting DataFrame. For every row in `df_peaks`, all the columns from `df_genes` have been added if the `refseq` identifier matches the identifier specified in `nearest_gene`.\n",
    "\n",
    "Our DataFrame now has the ENSEMBL gene identifiers, so we can also merge the `df_expr`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_genes_peaks_expr = df_genes_peaks.merge(df_expr, on='ensembl')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that here the name of the columns that will be used for merging is the same in `df_genes_peaks` and `df_exp`, so we can use the `on='ensembl'` argument."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For convenience, we will use a shorter name for our merged DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_all = df_genes_peaks_expr"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tab completion and help\n",
    "\n",
    "Now that we have merged our data, we can do several neat things.\n",
    "\n",
    "Because `df_all` is a pandas DataFrame, it automatically inherits DataFrame methods.\n",
    "\n",
    "You can get an idea if you type:\n",
    "    \n",
    "    df_all.\n",
    "    \n",
    "And then hit the left `<Tab>` key.\n",
    "\n",
    "Notice that you get a lot of options....\n",
    "\n",
    "Suppose you want to know how the `sort_values()` method for a DataFrame works. You can type:\n",
    "\n",
    "    df_all.sort_values?\n",
    "    \n",
    "This HELP menu gives you information about the function. type `q` or `<ESC>` to leave the HELP.\n",
    "\n",
    "## Sorting a DataFrame\n",
    "\n",
    "Let us try it out:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>chrom</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>nearest_gene</th>\n",
       "      <th>exp</th>\n",
       "      <th>refseq</th>\n",
       "      <th>ensembl</th>\n",
       "      <th>genename</th>\n",
       "      <th>sample1</th>\n",
       "      <th>sample2</th>\n",
       "      <th>sample3</th>\n",
       "      <th>sample4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chr11</td>\n",
       "      <td>77774406</td>\n",
       "      <td>77775406</td>\n",
       "      <td>NM_003251</td>\n",
       "      <td>exp1</td>\n",
       "      <td>NM_003251</td>\n",
       "      <td>ENSG00000151365</td>\n",
       "      <td>THRSP</td>\n",
       "      <td>0.026</td>\n",
       "      <td>0.013</td>\n",
       "      <td>0.023</td>\n",
       "      <td>0.181</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr17</td>\n",
       "      <td>7590368</td>\n",
       "      <td>7591368</td>\n",
       "      <td>NM_001126112</td>\n",
       "      <td>exp1</td>\n",
       "      <td>NM_001126112</td>\n",
       "      <td>ENSG00000141510</td>\n",
       "      <td>TP53</td>\n",
       "      <td>58.697</td>\n",
       "      <td>47.569</td>\n",
       "      <td>34.299</td>\n",
       "      <td>28.180</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr17</td>\n",
       "      <td>26903451</td>\n",
       "      <td>26904451</td>\n",
       "      <td>NM_005165</td>\n",
       "      <td>exp1</td>\n",
       "      <td>NM_005165</td>\n",
       "      <td>ENSG00000109107</td>\n",
       "      <td>ALDOC</td>\n",
       "      <td>18.218</td>\n",
       "      <td>5.333</td>\n",
       "      <td>18.445</td>\n",
       "      <td>131.013</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   chrom     start       end  nearest_gene   exp        refseq  \\\n",
       "0  chr11  77774406  77775406     NM_003251  exp1     NM_003251   \n",
       "1  chr17   7590368   7591368  NM_001126112  exp1  NM_001126112   \n",
       "2  chr17  26903451  26904451     NM_005165  exp1     NM_005165   \n",
       "\n",
       "           ensembl genename  sample1  sample2  sample3  sample4  \n",
       "0  ENSG00000151365    THRSP    0.026    0.013    0.023    0.181  \n",
       "1  ENSG00000141510     TP53   58.697   47.569   34.299   28.180  \n",
       "2  ENSG00000109107    ALDOC   18.218    5.333   18.445  131.013  "
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_all.sort_values(by='chrom')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "or"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>chrom</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>nearest_gene</th>\n",
       "      <th>exp</th>\n",
       "      <th>refseq</th>\n",
       "      <th>ensembl</th>\n",
       "      <th>genename</th>\n",
       "      <th>sample1</th>\n",
       "      <th>sample2</th>\n",
       "      <th>sample3</th>\n",
       "      <th>sample4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr17</td>\n",
       "      <td>7590368</td>\n",
       "      <td>7591368</td>\n",
       "      <td>NM_001126112</td>\n",
       "      <td>exp1</td>\n",
       "      <td>NM_001126112</td>\n",
       "      <td>ENSG00000141510</td>\n",
       "      <td>TP53</td>\n",
       "      <td>58.697</td>\n",
       "      <td>47.569</td>\n",
       "      <td>34.299</td>\n",
       "      <td>28.180</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr17</td>\n",
       "      <td>26903451</td>\n",
       "      <td>26904451</td>\n",
       "      <td>NM_005165</td>\n",
       "      <td>exp1</td>\n",
       "      <td>NM_005165</td>\n",
       "      <td>ENSG00000109107</td>\n",
       "      <td>ALDOC</td>\n",
       "      <td>18.218</td>\n",
       "      <td>5.333</td>\n",
       "      <td>18.445</td>\n",
       "      <td>131.013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chr11</td>\n",
       "      <td>77774406</td>\n",
       "      <td>77775406</td>\n",
       "      <td>NM_003251</td>\n",
       "      <td>exp1</td>\n",
       "      <td>NM_003251</td>\n",
       "      <td>ENSG00000151365</td>\n",
       "      <td>THRSP</td>\n",
       "      <td>0.026</td>\n",
       "      <td>0.013</td>\n",
       "      <td>0.023</td>\n",
       "      <td>0.181</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   chrom     start       end  nearest_gene   exp        refseq  \\\n",
       "1  chr17   7590368   7591368  NM_001126112  exp1  NM_001126112   \n",
       "2  chr17  26903451  26904451     NM_005165  exp1     NM_005165   \n",
       "0  chr11  77774406  77775406     NM_003251  exp1     NM_003251   \n",
       "\n",
       "           ensembl genename  sample1  sample2  sample3  sample4  \n",
       "1  ENSG00000141510     TP53   58.697   47.569   34.299   28.180  \n",
       "2  ENSG00000109107    ALDOC   18.218    5.333   18.445  131.013  \n",
       "0  ENSG00000151365    THRSP    0.026    0.013    0.023    0.181  "
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_all.sort_values(by='start')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that these commands did not replace the original DataFrame `df_all`!\n",
    "\n",
    "\n",
    "## Using `[ ]` to obtain DataFrame columns or rows\n",
    "\n",
    "Using `[ ]` we can obtain specific parts of the DataFrame.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    THRSP\n",
       "1     TP53\n",
       "2    ALDOC\n",
       "Name: genename, dtype: object"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_all['genename']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>genename</th>\n",
       "      <th>exp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>THRSP</td>\n",
       "      <td>exp1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>TP53</td>\n",
       "      <td>exp1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ALDOC</td>\n",
       "      <td>exp1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  genename   exp\n",
       "0    THRSP  exp1\n",
       "1     TP53  exp1\n",
       "2    ALDOC  exp1"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## or, for multiple columns:\n",
    "df_all[['genename', 'exp']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using `iloc` we can do the same using indexes\n",
    "\n",
    "One index by default points to **rows** of a DataFrame\n",
    "\n",
    "To get the 2nd row, type:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "chrom                     chr17\n",
       "start                  26903451\n",
       "end                    26904451\n",
       "nearest_gene          NM_005165\n",
       "exp                        exp1\n",
       "refseq                NM_005165\n",
       "ensembl         ENSG00000109107\n",
       "genename                  ALDOC\n",
       "sample1                  18.218\n",
       "sample2                   5.333\n",
       "sample3                  18.445\n",
       "sample4                 131.013\n",
       "Name: 2, dtype: object"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_all.iloc[2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or, for multiple rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>chrom</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>nearest_gene</th>\n",
       "      <th>exp</th>\n",
       "      <th>refseq</th>\n",
       "      <th>ensembl</th>\n",
       "      <th>genename</th>\n",
       "      <th>sample1</th>\n",
       "      <th>sample2</th>\n",
       "      <th>sample3</th>\n",
       "      <th>sample4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr17</td>\n",
       "      <td>26903451</td>\n",
       "      <td>26904451</td>\n",
       "      <td>NM_005165</td>\n",
       "      <td>exp1</td>\n",
       "      <td>NM_005165</td>\n",
       "      <td>ENSG00000109107</td>\n",
       "      <td>ALDOC</td>\n",
       "      <td>18.218</td>\n",
       "      <td>5.333</td>\n",
       "      <td>18.445</td>\n",
       "      <td>131.013</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   chrom     start       end nearest_gene   exp     refseq          ensembl  \\\n",
       "2  chr17  26903451  26904451    NM_005165  exp1  NM_005165  ENSG00000109107   \n",
       "\n",
       "  genename  sample1  sample2  sample3  sample4  \n",
       "2    ALDOC   18.218    5.333   18.445  131.013  "
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_all.iloc[2:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This works very much like indexes of a `list`\n",
    "\n",
    "If you use two indices, you can obtain specific rows *and* columns.\n",
    "\n",
    "The first index is always *rows*, the second *columns*.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>exp</th>\n",
       "      <th>refseq</th>\n",
       "      <th>ensembl</th>\n",
       "      <th>genename</th>\n",
       "      <th>sample1</th>\n",
       "      <th>sample2</th>\n",
       "      <th>sample3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>exp1</td>\n",
       "      <td>NM_005165</td>\n",
       "      <td>ENSG00000109107</td>\n",
       "      <td>ALDOC</td>\n",
       "      <td>18.218</td>\n",
       "      <td>5.333</td>\n",
       "      <td>18.445</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    exp     refseq          ensembl genename  sample1  sample2  sample3\n",
       "2  exp1  NM_005165  ENSG00000109107    ALDOC   18.218    5.333   18.445"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_all.iloc[2:5, 4:11]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Removing missing values or `NaN`s.\n",
    "\n",
    "In many cases DataFrames contains `NaN`, missing values, and you often have to remove them.\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>chrom</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>nearest_gene</th>\n",
       "      <th>exp</th>\n",
       "      <th>refseq</th>\n",
       "      <th>ensembl</th>\n",
       "      <th>genename</th>\n",
       "      <th>sample1</th>\n",
       "      <th>sample2</th>\n",
       "      <th>sample3</th>\n",
       "      <th>sample4</th>\n",
       "      <th>full_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chr11</td>\n",
       "      <td>77774406</td>\n",
       "      <td>77775406</td>\n",
       "      <td>NM_003251</td>\n",
       "      <td>exp1</td>\n",
       "      <td>NM_003251</td>\n",
       "      <td>ENSG00000151365</td>\n",
       "      <td>THRSP</td>\n",
       "      <td>0.026</td>\n",
       "      <td>0.013</td>\n",
       "      <td>0.023</td>\n",
       "      <td>0.181</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr17</td>\n",
       "      <td>7590368</td>\n",
       "      <td>7591368</td>\n",
       "      <td>NM_001126112</td>\n",
       "      <td>exp1</td>\n",
       "      <td>NM_001126112</td>\n",
       "      <td>ENSG00000141510</td>\n",
       "      <td>TP53</td>\n",
       "      <td>58.697</td>\n",
       "      <td>47.569</td>\n",
       "      <td>34.299</td>\n",
       "      <td>28.180</td>\n",
       "      <td>tumor protein p53</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr17</td>\n",
       "      <td>26903451</td>\n",
       "      <td>26904451</td>\n",
       "      <td>NM_005165</td>\n",
       "      <td>exp1</td>\n",
       "      <td>NM_005165</td>\n",
       "      <td>ENSG00000109107</td>\n",
       "      <td>ALDOC</td>\n",
       "      <td>18.218</td>\n",
       "      <td>5.333</td>\n",
       "      <td>18.445</td>\n",
       "      <td>131.013</td>\n",
       "      <td>aldolase, fructose-bisphosphate C</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   chrom     start       end  nearest_gene   exp        refseq  \\\n",
       "0  chr11  77774406  77775406     NM_003251  exp1     NM_003251   \n",
       "1  chr17   7590368   7591368  NM_001126112  exp1  NM_001126112   \n",
       "2  chr17  26903451  26904451     NM_005165  exp1     NM_005165   \n",
       "\n",
       "           ensembl genename  sample1  sample2  sample3  sample4  \\\n",
       "0  ENSG00000151365    THRSP    0.026    0.013    0.023    0.181   \n",
       "1  ENSG00000141510     TP53   58.697   47.569   34.299   28.180   \n",
       "2  ENSG00000109107    ALDOC   18.218    5.333   18.445  131.013   \n",
       "\n",
       "                           full_name  \n",
       "0                                NaN  \n",
       "1                  tumor protein p53  \n",
       "2  aldolase, fructose-bisphosphate C  "
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gene_info = pd.DataFrame({\"full_name\":[\"tumor protein p53\", \"aldolase, fructose-bisphosphate C\"]},\n",
    "                         index=[\"TP53\", \"ALDOC\"])\n",
    "df_all = df_all.join(gene_info, on=\"genename\")\n",
    "df_all"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can obtain a `boolean` (`True`/`False`) for the column called `full_name` like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     True\n",
       "1    False\n",
       "2    False\n",
       "Name: full_name, dtype: bool"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_all['full_name'].isnull()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By using the `~` symbol we can invert this `boolean` column. The `~` acts as `not` for a pandas Series."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    False\n",
       "1     True\n",
       "2     True\n",
       "Name: full_name, dtype: bool"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "~df_all['full_name'].isnull()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use this strategy to remove all rows that have `NaN` in the column `full_name`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>chrom</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>nearest_gene</th>\n",
       "      <th>exp</th>\n",
       "      <th>refseq</th>\n",
       "      <th>ensembl</th>\n",
       "      <th>genename</th>\n",
       "      <th>sample1</th>\n",
       "      <th>sample2</th>\n",
       "      <th>sample3</th>\n",
       "      <th>sample4</th>\n",
       "      <th>full_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr17</td>\n",
       "      <td>7590368</td>\n",
       "      <td>7591368</td>\n",
       "      <td>NM_001126112</td>\n",
       "      <td>exp1</td>\n",
       "      <td>NM_001126112</td>\n",
       "      <td>ENSG00000141510</td>\n",
       "      <td>TP53</td>\n",
       "      <td>58.697</td>\n",
       "      <td>47.569</td>\n",
       "      <td>34.299</td>\n",
       "      <td>28.180</td>\n",
       "      <td>tumor protein p53</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr17</td>\n",
       "      <td>26903451</td>\n",
       "      <td>26904451</td>\n",
       "      <td>NM_005165</td>\n",
       "      <td>exp1</td>\n",
       "      <td>NM_005165</td>\n",
       "      <td>ENSG00000109107</td>\n",
       "      <td>ALDOC</td>\n",
       "      <td>18.218</td>\n",
       "      <td>5.333</td>\n",
       "      <td>18.445</td>\n",
       "      <td>131.013</td>\n",
       "      <td>aldolase, fructose-bisphosphate C</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   chrom     start       end  nearest_gene   exp        refseq  \\\n",
       "1  chr17   7590368   7591368  NM_001126112  exp1  NM_001126112   \n",
       "2  chr17  26903451  26904451     NM_005165  exp1     NM_005165   \n",
       "\n",
       "           ensembl genename  sample1  sample2  sample3  sample4  \\\n",
       "1  ENSG00000141510     TP53   58.697   47.569   34.299   28.180   \n",
       "2  ENSG00000109107    ALDOC   18.218    5.333   18.445  131.013   \n",
       "\n",
       "                           full_name  \n",
       "1                  tumor protein p53  \n",
       "2  aldolase, fructose-bisphosphate C  "
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_all[~df_all['full_name'].isnull()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Make sure you understand how this worked!\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic calculations\n",
    "\n",
    "There are a lot of methods that allows you to do basic calculation on DataFrames.\n",
    "\n",
    "As an illustration we will calculate the pairwise correlations between the columns `sample1`, `sample2`, `sample3`, and `sample4`.\n",
    "\n",
    "We take the appropriate columns first\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_1 = df_all[['sample1', 'sample2', 'sample3', 'sample4']]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can use the pandas `corr` method to calculate all pairwise correlations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sample1</th>\n",
       "      <th>sample2</th>\n",
       "      <th>sample3</th>\n",
       "      <th>sample4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>sample1</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.978971</td>\n",
       "      <td>0.966614</td>\n",
       "      <td>-0.011258</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>sample2</th>\n",
       "      <td>0.978971</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.894017</td>\n",
       "      <td>-0.215006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>sample3</th>\n",
       "      <td>0.966614</td>\n",
       "      <td>0.894017</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.245337</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>sample4</th>\n",
       "      <td>-0.011258</td>\n",
       "      <td>-0.215006</td>\n",
       "      <td>0.245337</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          sample1   sample2   sample3   sample4\n",
       "sample1  1.000000  0.978971  0.966614 -0.011258\n",
       "sample2  0.978971  1.000000  0.894017 -0.215006\n",
       "sample3  0.966614  0.894017  1.000000  0.245337\n",
       "sample4 -0.011258 -0.215006  0.245337  1.000000"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_1.corr()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}