Bioinformatics ProgrammingUnfortunately, the thorough Python coverage proved to be too thorough -- long descriptions of language features and lists of library methods with few examples of applications to biology problems. This exposition-heavy style makes the book more challenging than introductory books such as Mark Lutz's Learning Python or even more advanced but example-rich books like Lutz's Programming Python or Mark Pilgrim's Dive Into Python. Most importantly, given the target audience, the presentation does not take advantage of the rich analogies available between biology and object-oriented programming, and the sparse examples do not reveal the idiosyncrasies particular to biological data.

Three main examples run through the book: translating RNA sequences into protein sequences, searching for restriction enzyme cleavage sites, and parsing protein and nucleotide (DNA or RNA) sequence files (FASTA, a simple format for named sequences, and GenBank, a format allowing more detailed mark-up of sequence features).

The translation example is useful, giving methods for six frame translation and a clever regular expression for finding ATG-initiated open reading frames (RNA segments that can be translated into bacterial proteins), but omits two practical points. First, although the book takes the physically accurate approach of translating from RNA
sequences ([UAGC]+), it is more convenient to handle RNA internally as cDNA ([TAGC]+) because DNA is the more common input format. Second, the protein sequences are handled as three letter rather than one letter
codes. In addition to being easier to handle as strings, one letter codes are much more common, and a familiarity with these codes and the corresponding amino acid properties is a prerequisite for sequence analysis (Table 2-2 of O'Reilly's BLAST book gives a good overview. For a more complete primer, Branden and Tooze's "Introduction to
Protein Structure" can be read in about a week and is great for developing a feel for the relationship between protein sequence and biological function).

The restriction enzyme example is more problematic. It does not cover the common cases of searching both strands of the DNA molecule (necessary when the restriction site is not palindromic) or searching a circular DNA molecule (necessary when cloning a gene into a plasmid, the most common application of restriction enzymes). When a restriction digest is used as an example for a function returning multiple values (pg 62), the digest product is returned as two strings, which is useful only for the rare case of restriction enzymes that produce "blunt" DNA ends. Most restriction enzymes produce "sticky" ends, in which one strand of DNA overshoots the other and is
free to recognize a DNA molecule with a complimentary sticky end. This is the basis for most gene cloning, and exploring a class to correctly implement sticky ends would have made for a richer programming lesson. As a final quibble, the string example on page 4 gives the amino acid sequences of "some unusually small bacterial restriction enzymes", including "MNKMDLVADVAEKTDLSKAKATEVIDAVFA", which is much too short to be plausible as an independently folding protein, much less an enzyme. A quick check against the REBASE link in the book shows that this sequence is the entry for Aor13HI. The abstract for the literature reference given by REBASE
(Biosci. Biotechnol. Biochem. 57:1716-1721, shows that this sequence is only the N-terminal 30 amino acids of a much larger 30 kDa (~270aa) protein (this kind of short read is common for proteins sequenced by Edman degradation). This is a minor point, but relevant given that one of the most important steps in drawing a conclusion from a hit in a bioinformatics database is following the link to the primary reference in the scientific literature (i.e., the
reliability and interpretation of a given record needs to be judged based on the experiment that it was drawn from). The only example in the book to touch on this is the GenBank parser on page 149, which skips the literature fields as part of its skip_intro function.

FASTA and GenBank are the most likely file formats that a student will have to deal with, and the example GenBank parsers are particularly good, giving a link to the corresponding NCBI documentation and building up a complex parser from simple methods, using stub implementations with small, easily parsed data chunks along
the way. Likewise, the regular expression based FASTA parser on page 275 is consise and robust, but many of the other parsers (some of the FASTA parsers, the REBASE parser, and a screen-scraper for NCBI searches) take an ad hoc approach based on the data at hand (rather than documentation on the format), and without strong checks on the assumptions made by the parser. In particular, the NCBI scraper misses the fact that there is a well documented API available for retrieving these search results in an easy to parse XML format ( -- this would have made a good example for the section on urllib, in place of the pypi example in the book). While it is true that undocumented file formats are common in bioinformatics, we are making steady progress towards well documented standards, and teaching students to find and write to those standards is an important element of their propagation.

A more critical omission in the parsers is the lack of motivation for why one would want to parse a file in the first place. A GenBank record is parsed into a GenBankEntry object composed of GenBankFeature object, but then nothing is done with it. Without an application, it is difficult to delve into the details of the parser. An application
that cared about which bits of a DNA sequence corresponded to a transcribed RNA would have to handle GenBank's tricky coordinate format, while an application for finding evolutionarily related sequences could simply jump to the sequence entry at the end of the file. Likewise, the design decisions for the GenBankFeature class,
particularly whether it should be sub-classed for different feature types, are missed without the context of how the class will be used. An obvious example would have been to combine the GenBank parser with the previous restriction enzyme code to generate a restriction map for the GenBank sequence (a typical first step in cloning a gene). This would also have been a useful application of the CGI interface to REBASE developed in the web programming and relational database chapters (as written, it supports querying by name, species, cut-site,
etc., but not the typical application of searching a given DNA sequence against all available enzymes).

The decision to base the book on Python 3 is forward thinking, but precludes using code that depends on the numpy module. In particular, this means that the structured graphics chapter can reference PyChart but not matplotlib and that the section on tree data structures can not reference the excellent NetworkX graph library. Other notable bioinformatics modules with numpy dependencies include BioPython and the Python interface to Cluster3. On the other hand, this choice simplifies some of the math examples and is a good step towards encouraging scientists to forward-port their code. I found the Python 3 details to be the most useful part of the book, and I would recommend it as a supplement to Mark Pilgrim's Dive Into Python 3 for programmers making the 2 to 3 transition.

On a positive note, the final chapter on structured graphics gives a good overview of creating plots with the Tkinter GUI toolkit or the SVG file format. The choice of a dotplot for the Tkinter example is especially good. Dotplots are an often overlooked tool for comparing two sequences and are useful for revealing features such as repeats and inversions that can be difficult to see by other methods. They are also a good foundation for discussing sequence alignment algorithms (e.g., the methods in chapter 3 of the O'Reilly BLAST book). There is a good discussion of the different patterns that can occur in a dotplot, with a link to a more extensive review on the web. A useful exercise for the reader would be to add bindings to the Tkinter canvas so that clicking on a location in the dotplot would display the corresponding sequence alignment (as with the DOTTER program from AceDB).

In conclusion, reading this book did not change my default recommendation of Lutz's Learning Python and Programming Python as a starting point for biologists without prior programming experience (and the O'Reilly BLAST book and "Biological Sequence Analysis" by Durbin, Eddy, Krogh, and Mitchison for programmers interested in bioinformatics), but I would recommend several sections of this book, particularly the last chapter, as useful supplemental reading. This would be an easy recommendation to make for students at schools with on-line access to the O'Reilly catalog via the Safari Bookshelf.