8. Files¶
8.1. Introduction to files¶
Thus far we have used Python without reading input from files and writing output to files. We have created objects like lists and dictionaries, but nothing has been saved persistently on a disk, meaning that if we close Python, everything is gone. Usually you want to open a file, read its contents (line by line), analyze it, and write it to another file. To work with files we need the Python file
object. file
objects serve as a link to a file residing on a disk of a computer.
To open a file, write some contents to it and close it:
myfile = open('testfile.txt', 'w')
myfile.write('some text\n')
myfile.close()
Note the following:
myfile
is the name of the Pythonfile
object'testfile.txt'
is the name of the file written to the disk of the computerextension is not important, but suggests that it contains text
The
'w'
argument to theopen()
function opens the file for writingIf the file
'testfile.txt'
already exists, you will overwrite its contentsIf not, the file
'testfile.txt'
will be createdYou can also use
'a'
instead, which would append any output to the existing file called'testfile.txt'
If you use
'r'
instead, you only open the file for only reading (it has to exist of course)write()
is a method belonging to thefile
objectwrite()
does not add a newline character to the end of a line, that is why we add\n
to our textclose()
is a method belonging to thefile
object
Many files used for our analyses are large. We could open them, store their contents in the memory of computer (RAM) and work on the complete contents. However, in most cases we want to avoid loading all contents into memory, and as most of the files we work with contain lines of text, we rather open the file, process its contents line-by-line, and write output to another file.
To open a file and read its contents line-by-line we first open the file using open()
. To step through the file line-by-line we use a for
or while
loop to ‘iterate’ over the resulting file
object. Each iteration returns a line of the file, from top to bottom.
The for
or while
loop automatically ends as soon as there are no lines anymore (=after the last line).
At the end of the code that you execute for each line, you should close the file.
infile = open('/vol/cursus/CFB/list-dict-loop/example.bed', 'r')
for line in infile:
print(line)
infile.close()
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
/tmp/ipykernel_1853/2966629597.py in <module>
----> 1 infile = open('/vol/cursus/CFB/list-dict-loop/example.bed', 'r')
2 for line in infile:
3 print(line)
4 infile.close()
FileNotFoundError: [Errno 2] No such file or directory: '/vol/cursus/CFB/list-dict-loop/example.bed'
The file /vol/cursus/CFB/list-dict-loop/example.bed
is an example BED file. BED files are used to store genomic coordinates along with additional information. BED files are for instance used to store genomic coordinates of ChIP-seq peaks. The first three fields (columns) contain the chromosome, start, and end of the features, respectively. The file /vol/cursus/CFB/list-dict-loop/example.bed
also contains ‘header’ lines that are used when the file is displayed in the UCSC genome browser. This is not always the case for BED files.
While the code above functions just fine, it is nowadays more common in Python to open files with the with
statement.
with open('/vol/cursus/CFB/list-dict-loop/example.bed', 'r') as infile:
for line in infile:
print(line)
browser position chr7:127471196-127495720
browser hide all
track name="example" description="example_bed_file" visibility=2 itemRgb="On"
chr7 127471196 127472363 Pos1 0 + 127471196 127472363 255,0,0
chr7 127472363 127473530 Pos2 0 + 127472363 127473530 255,0,0
chr7 127473530 127474697 Pos3 0 + 127473530 127474697 255,0,0
chr7 127474697 127475864 Pos4 0 + 127474697 127475864 255,0,0
chr7 127475864 127477031 Neg1 0 - 127475864 127477031 0,0,255
chr7 127477031 127478198 Neg2 0 - 127477031 127478198 0,0,255
chr7 127478198 127479365 Neg3 0 - 127478198 127479365 0,0,255
chr7 127479365 127480532 Pos5 0 + 127479365 127480532 255,0,0
chr7 127480532 127481699 Neg4 0 - 127480532 127481699 0,0,255
You see that the code is very similar. However, you don’t have to explicitly close the file. As soon as the code exits the with
block, the file is closed automatically.
Note
The with
statement is the preferred way to work with files.
In the next few examples we will see some analysis of this file in action. Each example adds some steps and builds from the previous example.
Example 1:
open the
/vol/cursus/CFB/list-dict-loop/example.bed
fileread the file line-by-line
ignore header lines starting with ‘browser’ or ‘track’
split each line into a list based on the
<tab>
characterprint each line
print the values
with open('/vol/cursus/CFB/list-dict-loop/example.bed', 'r') as infile:
for line in infile: # use a variable called 'line' for each line
line = line.strip() # take away the '\\n' character at the end of each line
# only proceed with lines lacking 'browser' or 'track' at the beginning
if not line.startswith('browser') and not line.startswith('track'):
# split the line (`string`) into a `list`, with the '\\t' as a delimiter
vals = line.split('\t')
print(vals)
['chr7', '127471196', '127472363', 'Pos1', '0', '+', '127471196', '127472363', '255,0,0']
['chr7', '127472363', '127473530', 'Pos2', '0', '+', '127472363', '127473530', '255,0,0']
['chr7', '127473530', '127474697', 'Pos3', '0', '+', '127473530', '127474697', '255,0,0']
['chr7', '127474697', '127475864', 'Pos4', '0', '+', '127474697', '127475864', '255,0,0']
['chr7', '127475864', '127477031', 'Neg1', '0', '-', '127475864', '127477031', '0,0,255']
['chr7', '127477031', '127478198', 'Neg2', '0', '-', '127477031', '127478198', '0,0,255']
['chr7', '127478198', '127479365', 'Neg3', '0', '-', '127478198', '127479365', '0,0,255']
['chr7', '127479365', '127480532', 'Pos5', '0', '+', '127479365', '127480532', '255,0,0']
['chr7', '127480532', '127481699', 'Neg4', '0', '-', '127480532', '127481699', '0,0,255']
Note how we used startswith()
to check if a string starts with ‘browser’ or ‘track’
Example 2:
assign the first three elements of this list to the variables
chrom
,start
, andend
with open('/vol/cursus/CFB/list-dict-loop/example.bed', 'r') as infile:
for line in infile:
line = line.strip()
if not line.startswith('browser') and not line.startswith('track'):
vals = line.split('\t')
chrom, start, end = vals[0:3]
print(chrom, start, end)
chr7 127471196 127472363
chr7 127472363 127473530
chr7 127473530 127474697
chr7 127474697 127475864
chr7 127475864 127477031
chr7 127477031 127478198
chr7 127478198 127479365
chr7 127479365 127480532
chr7 127480532 127481699
Example 3:
convert the
start
andend
variables to an integerint
(they are strings at first)calculate the width (size in bases) of each feature
with open('/vol/cursus/CFB/list-dict-loop/example.bed', 'r') as infile:
for line in infile:
line = line.strip()
if not line.startswith('browser') and not line.startswith('track'):
line = line.split('\t')
chrom, start, end = line[0:3]
start = int(start) # convert start to `int`
end = int(end) # convert end to `int`
width = end - start
print(width)
1167
1167
1167
1167
1167
1167
1167
1167
1167
Example 4:
calculate how many features with strand ‘+’ and strand ‘-‘ are present
plus = 0
minus = 0
with open('/vol/cursus/CFB/list-dict-loop/example.bed', 'r') as infile:
for line in infile:
line = line.strip()
if not line.startswith('browser') and not line.startswith('track'):
vals = line.split('\t')
strand = vals[5]
if strand == '+':
plus = plus + 1
elif strand == '-':
minus = minus + 1
print(plus, minus)
5 4
Thus far we have used the print()
statement to print output to the screen. But if you want to be more explicit in the formatting of the output, you should use format()
, which creates a string
that can either be printed or written to a file. The standard way of using format()
is like this: 'some text {} more text {}'.format(var1, var2)
. See the following example in which format()
is used:
cell_lines = ['HCT116', 'HeLa', 'HEK293']
for i in range(3):
args = 'Cell line {} is {}'.format(i + 1, cell_lines[i])
print(args)
Cell line 1 is HCT116
Cell line 2 is HeLa
Cell line 3 is HEK293
Note that the print()
statement by default adds a newline character "\n"
. If you write to a file you need to add the newline character "\n"
to the string:
with open('cell_lines.txt', 'w') as outfile:
cell_lines = ['HCT116', 'HeLa', 'HEK293']
for i in range(3):
args = 'Cell line {} is\t{}\n'.format(i + 1, cell_lines[i])
outfile.write(args)
Note how we also used "\t"
here. This is the character that specifies a <tab>
.
8.2. Glossary of file terms and functions¶
open()
with
for
while
write()
startswith()
format()
8.3. Exercises: files¶
Exercise 8.1
In this exercise you are going to analyze data from ChIP-seq of several transcription factors in mouse embryonic stem cells (Chen et al. 2008, 13;133(6):1106-17, PMID 18555785). Please have a look at Figure 1 and 2 of this paper. Figure 1 shows the ChIP-seq profiles for the transcription factors.
The ChIP-seq peaks that you see represent in vivo binding sites of these transcription factors. In Figure 2 the authors have analyzed the DNA sequence underneath these peaks, and identified specific DNA sequences (motifs) for each transcription factor.
You are going to analyze these sequences as well, to see which motifs you can identify. You will count k-mers (short sequences, e.g. 8 bases in length) in the sequences underneath the ChIP-seq peaks of a specific transcription factor, and analyze which k-mers are overrepresented.
DNA (or amino acid) sequence files are usually in FASTA format. Check out https://en.wikipedia.org/wiki/FASTA_format for a FASTA description and an example. Just read the first part.
Before you start off using the FASTA files from the study mentioned above, you will work on a file called testfile.fasta
(present in /vol/cursus/CFB/list-dict-loop
), which is a small FASTA file. This allows you to make and test your code quicker.
Provide the Python code that allows you to read the FASTA file while ignoring the description lines containing
>
. For now it is OK to print the output.Extend your code with the following: For each sequence, extract all possible k-mers of length 4. Skip sequences that are shorter than 4 bases. Print the output. Make sure your code allows you to easily run the same analysis with different k-mer lengths.
Extend your code with the following: Count the number of occurrences for each k-mer that is encountered (in all sequences together).
Extend your code with the following: Write the top 10 most abundant k-mer sequences to an output file, along with the number of times they occur.
Now that you have assembled your code (you can remove the printing statements you used for testing) adapt your code to count kmers of length 8. Run your code on 2 files present in the
/vol/cursus/CFB/list-dict-loop
directory:
GSM288346_ES_Oct4_mm9.peaks.fasta
GSM288356_ES_c-Myc_mm9.peaks.fasta
GSM288353_ES_Stat3_mm9.peaks.fasta
GSM288354_ES_Klf4_mm9.peaks.fasta
For the top most occurring 8-mers that you obtained for each file, do these resemble the motifs for the corresponding transcription factors as displayed in Figure 2 of the paper?
Write a function called
top_kmers()
. The function should have three arguments:fname
,k
,n_top
. Thefname
represents the FASTA filename,k
the length of the k-mer andn_top
the number of most abundant k-mers to return. The function should not print the k-mers, but return them as a list. Thek
andn_top
arguments should be optional, with default values of 8 and 10, respectively.
Exercise 8.2: FASTA files
The FASTA format is a widely used text format to represent nucleotide or peptide sequences.
Write a function read_fasta
that reads a multiple sequence FASTA file.
The function should take the file name as an argument and return a dictionary
with the sequence identifier as key and the actual sequence as value. Note that the >
symbol is not part of the sequence identifier.
Create files for all your test cases.
You can create text files from the jupyter window (New
-> Text File
) or use the Upload
button to upload files you created locally.
Exercise 8.3: convert gene annotation file
The file /vol/cursus/CFB/files/genes.txt
contains gene annotation. Each line represents one gene. The file is
tab-separated and contains the following columns:
1 - chromosome
2 - gene start
3 - gene end
4 - strand
5 - gene name
6 - number of exons
7 - a comma-separated list of exon sizes
8 - a comma-separated list of exon starts, relative to the gene start
Write a function called gene2exons
that reads the gene annotation from this file and converts it to a file with exon information. The function should accept two arguments, the input_file
and the output_file
. The output should be tab-separated and will contain 5 columns:
1 - chromosome
2 - exon start
3 - exon end
4 - exon name: gene name followed by a . and the exon number
5 - strand
For instance, for the gene SALL4 the output should look like this:
chr20 51782330 51784684 SALL4.1 -
chr20 51788860 51789141 SALL4.2 -
chr20 51790021 51792352 SALL4.3 -
chr20 51802278 51802520 SALL4.4 -
The gene SALL4 has four exons, which are numbered from 1 to 4. Exon numbers are not dependent on the strand, they are numbered from low start position to high start position.