This page is an invitation to learn Python and apply it to bioinformatics. It is designed for people with no programming experience who are interested in the possibility of learning how to program. It uses my limited experience in this area to demonstrate that even if you were never exposed to any programming it is possible to learn enough Python to write programs that analyze sequence data and produce results. In addition, programming is fun and it beats Sudoku and crossword puzzles as a constructive brain teaser and pastime. -Luca Comai
What is Python?
Python is a modern programming language that is easy to code and use. It resembles another very common language called Perl. Python code is easy to understand because the language resembles natural language and because it uses indentation to separate blocks of code and to convey their hierarchy. Python is very powerful and is used by major private and public institutions such as Google and NASA.
How I learned
I first tried Perl, another programming language, and became frustrated by my inability to understand the syntax. Now that I know a little bit about programming, I am not clear why this is the case because I can understand (kind of) Perl and it is really not difficult. This just goes to show that learning programming may seem difficult at first. When that happened, I dropped the effort. A few months later, I decided to try again. Victor Missirian, a bioinformatician who worked with me, showed me the Python tutorial written by the Python inventor, Guido von Rossum. It appeared simple to follow and so I decided to try again. I downloaded the Python installation package from python.org, bought a couple of books and never looked back. I started writing a program that would count and report all the restriction fragments in a genome. Having a specific objective helped me focusing and motivating me. Since then I have written dozens of little programs to do all kinds of stuff, such as parsing Illumina sequencing files, grading my class, performing in silico comparative genomic hybridization, analyzing the genomic outcome of crosses, and so on. Now, do not get me wrong, I am not sure I would call myself an expert and there is a lot left for me to learn. But this is really the good news. You do not need a degree in computer sciences to have fun and be productive.
How you can learn
Apple computers come with a version of Python installed. It is useful, however, to download a Python as part of a software package manager called Anaconda. You could get a Python installer from the official Python website, but Anaconda is handy because it installs a set of auxiliary programs for data science. Python comes in two versions: the old one is 2.x, the new one is 3.x. I suggest you download the latest 3.x, such as 3.8 (June 2020). Anaconda provides an Interactive Developer Environment (IDE) called Spyder, in which you can write a program and execute it. There are many other IDE available. Another way to write and execute a program is a code editor. It is similar to an IDE, but more sophisticated in handling text. A very good one is BBedit (free version available). Last, you can run Python inside a notebook, a very useful method that will be discussed in a different post.
With Python open in an IDE or in the Terminal (for Apple computers), follow the Python tutorial . Once I learned the basics, I found two books very useful. The first is Learning Python by Mark Lutz. The second is Python Cookbook by B.K. Jones. Beyond the books, the Internet has an answer for just about any problem you may encounter: just google it. By the way, while you are working on your skills it would be good to also learn basic Unix commands: see this tutorial.
Examples of programs
To download a program go the program page using the link below, then click the download link. Depending on what you did the two following outcomes are possible.
- If your browser takes you to a new page, select all the text in the page and copy it into IDE or Text Wrangler. Save it by adding a “.py” extension. The program will color the code to distinguish annotation from code.
- If you right-button click on the download link, you should be able to download the program as a “.py” file.
Note that the program contains both “live” code and annotation. The latter is any line that is flanked by triple apostrophes: “‘ “‘ , and any line that starts with “#”. The annotation is there to explain the program or to remind the author or user about the function for each line of code.
Barcode generator produces DNA barcodes of desired length and sequence distance for use in adapters for sequencing libraries .
CNV Mapping uses mapped sequencing reads from related individuals in a population to derive linkage between copy number variable loci. This program is written as a notebook: it is divided in cells within a program called Jupyter Notebook, also part of the Anaconda installation. The notebook format has great utility, which I will explain in another post.
The research experience on which the Python pages are based is funded by the National Science Foundation Plant Genome grant DBI-0733857 (Functional Genomics of Polyploids), NSF Plant Genome award DBI-0822383, TRPGR: Efficient identification of induced mutations in crop species by ultra-high-throughput DNA sequencing, and National Institutes of Health R01 GM076103-01A1 (Dosage dependent regulation in hybridization) to LC.
The persimmon genome reveals clues to the evolution of a lineage-specific sex determination system in plants
We developed a reference genome for Diospyros lotus (diploid persimmon) to further investigate the evolutionary steps that lead to dioecy in this system. Using this new genome assembly, we were able to identify a species-specific whole-genome duplication (WGD) event, as well as add to our understanding of the sex determination system in persimmon (published previously here in Science), by identifying a third homolog of MeGI and OGI, called Sister-of-MeGI (SiMeGI), originating from a local duplication event. Our data suggests that this WGD event initiated the evolution of the current sex determination system and further highlights the role of gene duplications on the transition between sexual systems.
In this article geared towards middle-school students, we introduce the different kinds of sexual systems in plants, their advantages and disadvantages, how scientists can learn more about them, and why it is important to understand how they work.
In this collaboration with the International Potato Center (CIP), we investigated potato lines with an unusual property: when used to cross-pollinate another potato, their chromosomes sometimes disappear from the ensuing progeny, resulting in haploid plants. More than just a genetic oddity, this phenomenon can speed up crop improvement by several years. To better understand how this chromosome disappearing act works, we used whole-genome sequencing to ask whether the genomes of over 150 potato haploids raised at CIP over a decade ago contained residual pieces of pollinator chromosomes.
Trends in Genetics Volume 35, Issue 11, November 2019 Pages 791-803
Advances in DNA sequencing and genome analysis enable both reinterpretation of historical data as well as discovery of plant genome instability in the field and in experimental systems. Genome instability, which in animals is associated with cancer, can be triggered in plants by multiple causes, including crosses between parents with incompatible genomes. Mechanisms leading to instability are common across plant and animal kingdoms, involving failure in chromosome partitioning between dividing cells, DNA breaks, and faulty repair.
Haploid induction, an important tool in plant breeding, can result from alteration of a chromatin protein that determines centromeres and promotes genome instability. Plant tolerance to genomic imbalance and to aneuploidy may provide increased opportunity for evolutionary success of karyotypic novelty generated by genome instability.
A comprehensive genomic scan reveals gene dosage balance impacts on quantitative traits in Populus trees
In this collaboration with Heloise Bastiaanse and Andrew Groover (USFS), we are following-up on our poplar gamma irradiated mutant population (http://www.plantcell.org/content/27/9/2370.long) to describe the effect of dosage variation on quantitative traits. Hundreds of sibling hybrid trees carrying large-scale deletions and insertions were established in a replicated field in Placerville, CA and characterized for a variety of phenotypic traits. In this article, we introduce the concept of dosage qtl, the association between gene dosage and trait values, and discuss the effect of gene variation on phenology and biomass related traits.
M Fossi, K Amundson, S Kuppu, A Britt, L Comai Plant physiology 180 (1), 78-86
Plants have the remarkable ability to regenerate an entire plant from a single cell. Researchers have long known that this regeneration introduces genetic and phenotypic variation, but the underlying causes remain elusive. Here, we used whole-genome sequencing to document the types and extent of changes to chromosome number and structure that can occur in potato plants that have been regenerated from single cells.
This review looks at the theoretical paths leading to dioecy, or the presence of separate sexes, and examines recent molecular and genomic evidence from the few plant species where those mechanisms have been uncovered. We discuss whether the evidence fits previously proposed theoretical models, and potential common mechanisms between the few known cases so far, including the role of duplication in allowing the development of new functions.
Chromothripsis is a relatively newly discovered phenomenon in plants, a situation in which a part of the genome is dramatically reorganized, akin to certain cancerous cells in humans. It has so far gone unrecognized, partially for the lack of methods to detect it. In this protocol, we present a simple method for the detection and characterization of chromothripsis in plant genomic data.