9. Scripts

9.1. Exercises: scripts

Exercise 9.1: Copying the files required for the exercises

Open a terminal from the Jupyter “Home” page in your web browser (the view of your home directory), by clicking on “Terminal” in the “New” pull down menu.

If you haven’t already done so, first create a folder/directory named cfb_2021 in your network folder for this course:

cd                # go to root of your home directory
mkdir cfb_2021    # mkdir creates a directory
cd cfb_2021       # change directory
pwd               # print current working directory. 
                  # In my case, the result is: /home/kalbers/cfb_2021

Now we are going the copy the files and scripts necessary to perform these exercises to the “cfb_2021” directory. The Linux command “cp” copies files and directories.

cp -rp /vol/cursus/CFB/scripts_and_debugging ~/cfb_2021/

Note

The ~ refers to the root of your home directory (whatever your username is), and the directory cfb_2021 has to exist for this particular copy command to work.

Now change directory (using cd) to the directory ~/cfb_2021/scripts_and_debugging. Validate that you are in the correct directory using the command pwd as above.

Exercise 9.2: Comparing the Jupyter notebook with running scripts from the command line

Open a new Python3 Jupyter-notebook for these exercises. Copy the following code into a cell and execute the cell:

print("Hello")
my_number = 1
my_number

Questions:

  1. Describe the exact output.

  2. Which of the three lines in the above code produce output?

  3. Is this output formatted in the same way?

Now create a new text file by clicking “Text File” in the “New” drop-down menu in the file-browser tab-page of your Jupyter notebook. Make sure to first go the correct subdirectory (~/cfb_2021/scripts_and_debugging) by clicking on the respective subdirectories in the Jupyter file browser window. Put the same three lines of code in this text file. Then, rename the text file to “exercise2.py” by clicking on the file name at the top of the screen. (The default file name is “untitled.txt”).

If everything went well, the word “print” should now have the color green, and the text “Hello” should be colored red. This is called syntax highlighting.

Now, run the script from the command line in the terminal window as follows:

python exercise2.py

Questions:

  1. Is the output of the script printed to the terminal window exactly the same as the output printed to screen when executing the cell in a Jupyter notebook?

  2. How does a Jupyter-notebook differ from a script in terms of the output, the information that is written to the screen?

Exercise 9.3: Permanence of Python variables in memory

Copy the following code to a cell in a Jupyter notebook and execute the cell.

my_number = 42

Copy the following code to a new cell below the previous one in the same Jupyter notebook, and execute the cell:

print("my_number:", my_number)

Question:

  1. What is the output of the second cell in the Jupyter notebook?

Next, create a Python script called exercise3a.py with exactly the same code as the first cell (i.e. my_number = 42). Create a second Python script called exercise3b.py with exactly the same code as the second cell (print("my_number:", my_number)).

Now go to your Terminal and execute the first script and the second script in that order:

python exercise3a.py
python exercise3b.py

Questions:

  1. Does executing the first script produce an error message? If not, what is the output printed to the screen? If yes, why?

  2. Does executing the second script produce an error message? If not, what is the output printed to the screen? If yes, why?

  3. What is the state of the Python memory when it starts to execute a Python script?

9.2. Using script with variable input

Create the a file called module_using_sys.py with the following contents:

import sys

print('The command line arguments are:')
for i in sys.argv:
    print(i)

Let’s run this program from the shell.

$ python module_using_sys.py we are arguments
The command line arguments are:
module_using_sys.py
we
are
arguments

The argv variable in the sys module contains everything that you have typed on the command line. The sys.argv variable is a list of strings. Specifically, the sys.argv contains the list of command line arguments i.e. the arguments passed to your program using the command line.

Here, when we execute python module_using_sys.py we are arguments, we run the module module_using_sys.py with the python command and the other things that follow are arguments passed to the program. Python stores the command line arguments in the sys.argv variable for us to use.

Remember, the name of the script running is always the first argument in the sys.argv list. So, in this case we will have 'module_using_sys.py' as sys.argv[0], 'we' as sys.argv[1], 'are' as sys.argv[2] and 'arguments' as sys.argv[3].

9.3. Exercises using sys.argv

Exercise 9.4: Reading the command line arguments

Make a script called nuc_arg.py that prints the nucleotide content of a sequence that you give as argument on the command line. For example, this command:

$ python nuc_arg.py TGACTCA

should print the following output:

2 2 1 2

Hint: if you want, you can use the nucleotide count function from the lecture.

Now you can edit the script nuc_arg.py.

Exercise 9.5: FASTA statistics

Write a script called nuc_fasta.py. This script should accept the name of a FASTA file as argument, read the FASTA sequences in that file and print the nucleotide content of all the sequences. The script should print a header line, followed by the FASTA id (the sequence name) of every sequence followed by the nucleotide content of that sequence. The output should be tab-separated. The nucleotide content should be specified as a fraction of the sequence length, with two digits in the order A, C, G and T. When you run the script nuc_fasta.py on the input file /vol/cursus/CFB/scripts_debugging/sequences.fa the output should exactly mach the following:

name	A	C	G	T
chr14:89352059-89352259	0.31	0.28	0.20	0.22
chr5:74264624-74264824	0.34	0.20	0.20	0.26
chr2:132500203-132500403	0.23	0.12	0.21	0.43
chr6:30630663-30630863	0.28	0.23	0.27	0.22
chr15_KI270905v1_alt:1999423-1999623	0.35	0.22	0.15	0.28

9.4. Using available modules: argparse

The Python standard library, which is included with every Python version, contains a lot of useful modules (see here) In addition, there are many third-party modules available. Some of these we will use in this course, such as pandas for data analysis and matplotlib for making figures. However, it is not possible to exhaustively cover all these modules in this course. Therefore, it is very useful to be able to search for modules and for examples and tutorials on how to use tese modules. With the subjects covered so far, you should have enough understanding of basic Python principles.

Let’s take an example, the argparse module. This is a very useful module for scripts that take (a lot of) command-line arguments. Some examples on how to use this module:

https://docs.python.org/3/howto/argparse.html

Have a look, Google it, try it out! We will use the argparse module in our final assignment for today.

9.5. Exercises: putting it all together

Exercise 9.6: motif scanning

Write a command-line Python script that scans a FASTA file with a IUPAC consensus sequence. As an optional argument to the script, a use should be able to specify a number of mismatches. It should be possible to specify these arguments on the command line, use the argparse module. As output, the script should print, for every match, the ID of the sequence and the position of the match within the sequence.

Take care of the following:

  • Comment your code

  • Use functions

  • Follow the style guide

  • Test your code

Start small. It is better to have a working script that has limited functionality but works well, than a script that tries to do everything and fails.

Test data

Two examples of consensus sequences and FASTA files (for c-Myc and STAT3 from Chen et al. 2008) are located in /vol/cursus/CFB/consensus_motif.

  • consensus.txt

  • c-Myc.fa

  • STAT3.fa

Extension (optional!)

Implement the ‘Match’ algorithm from Kel et al. 2003 to scan for matches to a positional weight matrix.

The log function can be imported from the math module. See the definition of this function here.