Ten Strategies

Pick up the pieces

If a program fails you can troubleshoot it by querying its components in the shell window (see more help on IDLE). If you access the shell before any other program is run, you can investigate the objects created by the program to see if they match expectations. For example, suppose you wrote a program that parses a file into my_list_1 and then makes my_dict_1 that is used to print a new file. You run it and find that the product file is empty. To fix this problem you need to find the place where the program failed. Start by querying my_list_1, entering

>>> len(my_list_1)

if you expected a million item in the list, yet your list is only 1 item long you need to look no further. If the number is as expected, another thing to do is to examine the structure of the list items:

(ten list items are printed)

Note that the list is indexed to display only the first 10 items. If you type “my_list_1” the shell may attempt to display millions of items stalling the process. If the item’s structure is as expected, then the problem is downstream and you should examine the dict by a similar approach.

Debug with print statements

You launch a program, sit back and wait, and wait… Would it not be nice to know what the program is doing? You can extract that information by placing print statements at strategic places. The print statement will appear in the shell window when the program has reached that point. It could just say “Hey, you are at this place” or provide even more information, such as the length of a list. Example:

>>> print 'made list2'

By following the statements on the shell window you can identify steps that are slow or that get hang up.

Watch your memory

Again: you launch a program, sit back and wait, and wait… If you followed the tip above you may even know where your program is hanging up. But, why is it doing so? Is it because your code is inefficient and maybe has a circular loop that repeats for ever, or it is doing something that takes time, or could it be your memory? Python as any program will claim RAM as needed. Memorizing large datasets can exceed the memory available in your computer. At that point, the system will use virtual memory, essentially swapping data to and from the hard drive. This is very inefficient. You can follow your RAM use with a utility such as iStatMenu, which puts a little RAM widget at the top of your computer screen. A cheap alternative is to open the Terminal and launch “top”. If the RAM indicator tops off, you can try turning off other programs, redesigning your Python program, or buying more memory.


When designing a program is advantageous to set up a simplified but essentially similar dataset to act upon. For example, instead of running your program on a genome-sized file of reads, you can make a very small and simple proxy. A very small input set can be dealt with interactively, i.e. in the shell and new strategies can be tested rapidly

Work line by line

This is a well known strategy in programming but it is worth repeating. Suppose you are processing a very large file, such as a 2 Gb Illumina read file. Once the file has been opened, you can process it all at once, or line by line. The first will take a lot of memory, the second hardly any. The syntax to process a line at the time goes like this:

for line in my file:
    process the line
Take advantage of specialized libraries

In addition to the regular Python distribution there are a number of useful “add-ons’, i.e. packages of Python-based software modules that let you do fancy stuff with relative ease. For example, want to graph some data and Excel is frustrating you? install Matplotlib and Numpy and you will be able to make very nice charts and graphs. If installing them seems daunting, here is a possible fix: Go to Enthought and download their Python distribution. If you are in academia, it is free. Install the package and take the Matplotlib tutorial and any other that may seem appropriate. For an example of what you can do with Matplotlib see GC_Plotter.

Make a time counter

Some times I write a program, launch it and …wait. What is happening? I like to see the program works. One strategy I find useful is explained above and it involves the use of print statements. These helps but it is not easy to use them inside a long loop that processes a file. I recently experimented with a simple counter that lets me know how far the loop has gone. It works as follows:

import time
start = time.clock()
line_counter = 0
print 'time, no. lines'
for line in sam_file:
       line_counter += 1
       if line_counter > 1000000000:
       elif line_counter > 1000000:       
           if str(line_counter)[-6:] == '000000':
                  elap_time = time.clock() - start
                  print elap_time, line_counter,
      process the line as planned
   # example output:
   time, no. lines
   17.67 2000000
   25.66 3000000 
   33.81 4000000
Coming soon: another tip

Back to the Python page

%d bloggers like this: