7. Assembling the pieces

In this chapter, no new Python topics will be introduced. You will practice by writing functions that combine the different topics that we have covered.

7.1. Exercises

Exercise 7.1: Translation

Write a function mrna2protein that converts an mRNA sequence (using the A,C,G,T alphabet) into a protein sequence. You can use the codon table that was used in exercise 6.9.

def mrna2protein(seq):
    # Adapt this function with your own code
    
    return prot

Exercise 7.2: Open reading frames

Write a function longest_orfs that will take a DNA sequence as input and returns the protein sequence encoded by the longest ORF for all three frames. The return value should be a dictionary, with the frame number as the key and the protein sequence as the value. This means that the dictionary will have three key:value pairs. The three keys are 1, 2 and 3 and the corresponding values are the longest ORF for frame 1, the longest ORF for frame 2 and the longest ORF for frame 3.

The stop codon (* in the amino acid table) should not be included in the translation of the ORF. For this exercise an open reading frame does not need to start with an ATG, it can start with any codon, except a stop codon. For example, take the sequence "actgcgtagagagctggagagattaggc". The translation of the full sequence for frame 1 would look like this:

TA*RAGEIR

In this example, the longest ORF is RAGEIR.

You can re-use code that you have written for exercise 4.8 and exercise 7.1.

def longest_orfs(seq):
    # Adapt this function with your own code
    return orfs

Exercise 7.3: Hamming distance

Write a function called hamming that takes two strings s and t of equal length as arguments and computes the number of differences between them. The function should return the number of symbols that differ in s and t.

def hamming(s, t):
    # adapt this function with your own code
    return 0

print(hamming("ACTG", "ACTG"))
print(hamming("ACTG", "GTCA"))
print(hamming("AACC", "AATT") + hamming("CTGA", "TCGA"))
0
0
0

If your function is defined correctly, the code above should exactly match the following output when run:

0
4
4

Exercise 7.4 Find exact motif matches

Given two strings s and t, t is a substring of s if t is contained as a contiguous collection of symbols in s (as a result, t must be no longer than s). Write a function called find_match that takes two arguments, s and t, and returns a list of all positions of the substring t in s.

# define your function here

When you have finished the function run the code below:

# Don't change anything here, but update the function definition above!
print(find_match("GATATATGCATATACTT", "ATAT"))
print(find_match("AUGCUUCAGAAAGGUCUUACG", "U"))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_1834/874657765.py in <module>
      1 # Don't change anything here, but update the function definition above!
----> 2 print(find_match("GATATATGCATATACTT", "ATAT"))
      3 print(find_match("AUGCUUCAGAAAGGUCUUACG", "U"))

NameError: name 'find_match' is not defined

Instead of giving an error, the output above should exactly match the following:

[2, 4, 10] 
[2, 5, 6, 15, 17, 18] 

Exercise 7.5: Motif conversion

Write a function that converts a consensus sequence into a positional weight matrix. The consensus sequence can be composed of symbols from the IUPAC DNA code. The function should ignore any non-IUPAC character. The positional weight matrix should be a two-dimensional list. Every element in the first list is a list containing the relative frequencies of A, C, G and T, in that specific order, that together sum up to 1.0. The function should work, regardless of the input being upper-case, lower-case or a mix.

Exercise 7.6: Analysis of a regulatory network

For this question use the regulatory network represented in the following dictionary:

hsc_network = {
    "Scl" : ["Scl","Gata1","Gata2","Hhex","Zfpm1","Fli1","Erg","Smad6","Runx1","Eto2"],
    "Pu.1" : ["Pu.1","Runx1","Gata1"],
    "Runx1" : ["Runx1","Pu.1","Erg"],
    "Smad6" : ["Runx1"],
    "Eto2" : ["Erg"],
    "Gata1" : ["Gata1","Scl","Pu.1","Gata2"],
    "Fli1" : ["Fli1","Runx1","Pu.1","Scl","Gata2","Hhex","Erg","Smad6"],
    "Erg" : ["Smad6","Runx1","Erg","Hhex","Gata2","Fli1"],
    "Zfpm1" : ["Gata2"],
    "Gata2" : ["Zfpm1","Runx1","Smad6","Eto2","Scl","Gata2","Hhex","Fli1","Erg"],
    "Hhex" : ["Gata2"]
}

Each key is a transcription factor and each value is a list of target genes.

Write a function common_targets that accepts three arguments: a TF-Target dictionary, and two TF names. The function should return a list with all the targets that are shared between the two TFs. The target list should be alphabetically ordered. When there are no common targets the function should return an empty list. With the network above, the following code:

print(common_targets(hsc_network, "Scl", "Gata2"))
print(common_targets(hsc_network, "Smad6", "Eto2"))
print(common_targets(hsc_network, "Zfpm1", "Hhex"))

Should result in:

['Erg', 'Eto2', 'Fli1', 'Gata2', 'Hhex', 'Runx1', 'Scl', 'Smad6', 'Zfpm1']
[]
['Gata2']