10. Debugging¶
10.1. Exercises: debugging¶
Exercise 10.1
The purpose of the following code is to count how often each TF in the network is a target of a TF in the network (it also counts if a TF regulates itself). In other words, how many TFs in the network bind in the promoter of a given TF.
The below code is provided in the script
exercise4_with_errors.py
.Use the terminal and command line to copy the file to a new file named
my_exercise4.py
.Edit the new file by clicking on it in the Jupyter browser.
Every time you change the file, save it.
Execute the Python script from the command line with
python my_exercise4.py
.
Questions
What is the purpose of this code? (This should be obvious from the short description given above and the code itself)
What is the input and what is the expected output? Answer this question before running or changing the code, but make use of the network as defined in the dictionary
hsc_network
.Identify the source of the runtime errors, and fix the mistake into the code so that the script runs without error. In that case, the last line of the output should be “Finished analysis of HSC network”.
After you fixed the errors in the code, does the code produce the expected output?
What is the output (list each TF and its target frequency)?
How would you solve this counting problem more elegantly? Hint: e.g. by using a single dictionary to replace the list
list_of_tfs
and dictionarytf_to_index
.
Advice: make liberal use of print()
statements to understand
What each line in the code is trying to achieve
How to make the code function correctly, and make sure that it indeed functions correctly.
Hint
There are multiple bugs in the code below. Some are obvious mistakes (a typo). However, some are mistakes that do not produce runtime errors that produce an incorrect result.
# define HSC network from Bonzanni et al, Bioinformatics 2013
hsc_network = { # each key is a source TF, the value for each key is a list of target TFs
"Scl" : ["Scl","Gata1","Gata2","Hhex","Zfpm1","Fli1","Erg","Smad6","Runx1","Eto2"],
"Pu.1" : ["Pu.1","Runx1","Gata1"],
"Runx1" : ["Runx1","Pu.1","Erg"],
"Smad6" : ["Runx1"],
"Eto2" : ["Erg"],
"Gata1" : ["Gata1","Scl","Pu.1","Gata2"],
"Fli1" : ["Fli1","Runx1","Pu.1","Scl","Gata2","Hhex","Erg","Smad6"],
"Erg" : ["Smad6","Runx1","Erg","Hhex","Gata2","Fli1"],
"Zfpm1" : ["Gata2"],
"Gata2" : ["Zfpm1","Runx1","Smad6","Eto2","Scl","Gata2","Hhex","Fli1","Erg"],
"Hhex" : ["Gata2"]
}
target_frequencies = [0,0,0,0,0,0,0,0] # How often is each TF a target of another TF?
list_of_tfs = [] # list of transcription factors in the network
tf_to_index = {} # dictionary linking TF name to numerical index in list
index = 0 # initialize index to zero
for tf in hsc_network:
list_of_tfs.append(tf) # append tf to the list of transcription factors in the network
index = index + 1 # update index
tf_to_index[tf] = index # associate the transcription factor tf with the index
for source_tf in hsc_network: # iterate over TFs in the network
list_of_target_tfs = hsc_network[source_tf]
for target_tf in list_of_target_tfs: # iterate over all TFs that are targeted by source_tf
index_in_list = tf_to_index[target_tf] # retrieve list index associated with this TF
target_frequencies[index_in_list] = target_frequency[index_in_list] + 1
for index in range(len(list_of_tfs)):
tf = list_of_tfs[index]
target_frequency_tf = target_frequencies[ index ]
print("Transcription factor",tf,"is",target_frequency_tf,"times a target of another TF")
print("Finished analysis of HSC network")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-2-e83b46195bf2> in <module>
31 for target_tf in list_of_target_tfs: # iterate over all TFs that are targeted by source_tf
32 index_in_list = tf_to_index[target_tf] # retrieve list index associated with this TF
---> 33 target_frequencies[index_in_list] = target_frequency[index_in_list] + 1
34
35 for index in range(len(list_of_tfs)):
NameError: name 'target_frequency' is not defined
Exercise 10.2
The following code (provided in the file exercise5.py
)
tests whether a particular TF of interest directly regulates itself (in other words, if a TF binds in its own promoter).
The name of the TF of interest is assigned to the variable my_tf_of_interest
.
The code should print
Transcription factor my_tf_of_interest regulates itself
or
Transcription factor my_tf_of_interest does not regulate itself
to screen (with “my_tf_of_interest” replaced by the actual TF name), depending on whether there is a self-regulation relationship encoded in the
hsc_network
dictionary variable or not.
Questions
This script contains a bug (otherwise it wouldn’t be a question here). Can you identify the bug without running the code (try for at least 10 minutes)?
Run the code as a script from the command line in a terminal. Does the code run without error?
Does the code provide the correct output for the TF named ‘Scl’?
Does the code provide the correct output for all TFs? If not, how would you adapt the code (extend or change it) so that it does produce the desired output?
# Does a TF regulate itself?
# each key is a source TF, the value for each key is a list of target TFs
hsc_network = {
"Scl" : ["Scl","Gata1","Gata2","Hhex","Zfpm1","Fli1","Erg","Smad6","Runx1","Eto2"],
"Pu.1" : ["Pu.1","Runx1","Gata1"],
"Runx1" : ["Runx1","Pu.1","Erg"],
"Smad6" : ["Runx1"],
"Eto2" : ["Erg"],
"Gata1" : ["Gata1","Scl","Pu.1","Gata2"],
"Fli1" : ["Fli1","Runx1","Pu.1","Scl","Gata2","Hhex","Erg","Smad6"],
"Erg" : ["Smad6","Runx1","Erg","Hhex","Gata2","Fli1"],
"Zfpm1" : ["Gata2"],
"Gata2" : ["Zfpm1","Runx1","Smad6","Eto2","Scl","Gata2","Hhex","Fli1","Erg"],
"Hhex" : ["Gata2"]
}
my_tf_of_interest = 'Scl'
for target_tf in hsc_network[my_tf_of_interest]:
if target_tf == my_tf_of_interest:
my_tf_regulates_itself = True
if my_tf_regulates_itself:
print("Transcription factor",my_tf_of_interest,"regulates itself")
else:
print("Transcription factor",my_tf_of_interest,"does not regulate itself")
Transcription factor Scl regulates itself
Exercise 10.3
This exercise focuses on problems that may arise when an input file does not have the formatting you expect, contains errors or otherwise unexpected events.
For small networks it is feasible to type your network directly as a dictionary, like in the example above. In other realistic scenarios, the network file is produced by another analysis (either your own Python script or another program), and you would like to analyze the network produced by that program.
The purpose of the below code is to determine which TFs are provided in a flat text file, and count how many target genes each TF has. To keep things managable, the text file contains only a small network.
The setting is as follows:
We have a flat text input file containing the network links in the following format:
source_tf_1 target_1 target_2 target_3 source_tf_2 target_4 target_5
We want to parse the input file and count for each source_tf that occurs how many targets it has.
To test the correctness of our code, we provide you with an input file that has several issues that cause the program to misbehave.
The following code is contained in the script exercise6.py
.
It reads a network from the file network_exercise_6.txt
.
The code contained in it is as follows:
target_tfs_dict = {} # will contain a list of the target genes for each source gene
input_file = open("/vol/cursus/CFB/scripts_debugging/network_exercise_6.txt",'r') # change path/filename if you want to run it on a different file
for line in input_file:
entries = line.strip().split()
source_tf = entries[0] # source TF is always the first string in a line
targets = entries[1:] # the strings after the first one are the target genes
target_tfs_dict[source_tf] = targets
input_file.close()
for tf in sorted(target_tfs_dict):
print("Source TF:\t",tf,"\tNumber of targets:\t",len(target_tfs_dict[tf]))
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-5-b5ec3a33a1ca> in <module>
5 for line in input_file:
6 entries = line.strip().split()
----> 7 source_tf = entries[0] # source TF is always the first string in a line
8 targets = entries[1:] # the strings after the first one are the target genes
9 target_tfs_dict[source_tf] = targets
IndexError: list index out of range
Run the script from the command line in a terminal. You can view the network file by clicking on it in the Jupyter file browser.
Expected output
If the script functions correctly, the desired output is:
Source TF: Erg Number of targets: 6
Source TF: Eto2 Number of targets: 1
Source TF: Fli1 Number of targets: 8
Source TF: Gata1 Number of targets: 4
Source TF: Gata2 Number of targets: 9
Source TF: Hhex Number of targets: 1
Source TF: Pu.1 Number of targets: 3
Source TF: Runx1 Number of targets: 3
Source TF: Scl Number of targets: 10
Source TF: Smad6 Number of targets: 1
Source TF: Zfpm1 Number of targets: 1
Question/assignment:
The script exercise6.py
makes three assumptions about the input file that are not valid (the input file violates the assumption).
Describe at least three assumptions that this script makes about the input file and provide a script that produces the desired output.
For example, describe the assumptions as specifically as in the following examples (this is just an example, not relevant to this script):
The script assumes that each line has precisely 20 characters
The script assumes that each line starts with a capital letter
Note
You have the change the Python script; you are NOT allowed to change the input file!
Hints:
Use
if
statements to check whether a specific assumption that you require for correct execution is satisfied. For instance, you can use anif
statement to check the length of a list, or whether a key is present in the dictionary.Add
print()
statements to understand what each line is doing and what the value is of a variable.You are allowed to change the input file to understand where the problem is, but you have to provide code that produces the correct output for the original input file.