Sunday, October 11, 2009

Systems biology and administration

Remember the mid-nineties when they told all of us computer-enabled folks that
jobs would be raining on us as soon as we got out of university ?

In some fields it is still true. Notably, there is a significant subset of
bioinformatics that ought to be (and in all fairness, often is) consulting
work and concerns itself with building the kind of operational application
every software designer in the nineties developed for a private company with
too much money on its hands : integrating the "old" database format with the
"new", converting image files to the decades-old printer's driver format,
converting between VisioCalc and MS Excel and so on.

What gets published in bioinformatics is generally a) research work leading to
new methods (i.e. algorithms) that involve proofs and new mathematical
abstractions for common problems, or b) theoretical molecular biology advances
where bioinformatics is only the 'methods' section. Every once in a while, the
methods themselves get their own paper, which is generally abysmal. The
"cooler" folks put their scripts on their blogs or homepages.

In biology, there is a lot of information to memorize, which is why biology is
always so hard in high school. As academics are no better at rote remembering
than teenagers, but much less likely to take it upon themselves to do this
rote remembering, the need for organized information in biology appeared,
along with them came databases and Web services playing the role of giant
bio-repositories. Then, researchers got tired of looking through this
information and linking it together themselves and tried to get computers to
think in their place. So was born the field of systems biology.

We have on one hand, an economical need for software jobs requiring little
innovation and a lot of specifics. On another, a social need for thinking
machines that do your job. Combine both, and a plethora of applications for
individual needs come into existence. As everything in academia, the rewards
are supposed to be exposure (that is, publication) rather than paychecks. How
do we reward software writers who are in it for the resume ? Enter BMC
Bioinformatics.

Moutselos et al. bring us academics a brand new tool for converting XML into
XML, with a tweak. The tool is written in Java with Java's XML and DOM
libraries.

The authors identified a need for KEGG database format converting because
"existing tools" (KEGG2SBML) only run under Unix (thus we learn that Python,
Perl, Qt and Graphviz require Unix to run, which must come as a surprise to
everyone involved in their development and use).

To sum up twenty pages of absurdly long technical writing, the author's
program merges (Unix cat) database files obtained from the Internet (Unix
wget), changes the XML format into another (XSLT), replaces (Unix sed) IDs
with equivalents that can be stored in a database (sqlite) and eliminate
duplicate entries (a bit of creativity with sed, or two lines worth of
Python). Unsurprisingly, they even cite a bash script that does exactly that.

The contribution from Moutselos et al. ? Their program isn't a script, it's a
full length Java program that needs a server interface to run in a browser
(unlike, say, XSLT).

Knowing Windows users, especially academics, they probably have a point :
opening a text editor is hard, as is executing a file. Ready-made scripts,
therefore, lack the user interface necessary to the target users.

Let's assume a Web interface - what's wrong with XSLT and Javascript ? It's
obviously too easy to code and summarize in less than twenty pages. It doesn't
even require people to download a JRE. Java in this case is clearly
overdesigning (or a result of computer science courses pushing Java so hard
the pupils like it when it hurts).

As a conclusion, this is yet another instance of inadequate software earning
people citations (and fame). As variation is an important component of Nature,
next up is adequate contribution from biologists in completely wet
experiments.

References
Moutselos K, Kanaris I, Chatziioannou A, Maglogiannis I, Kolisis FN.
BMC Bioinformatics, doi:10.1186/1471-2105-10-324 (2009)

Saturday, October 10, 2009

Genome generation: no engineering allowed

Once upon a time, there was a small group of researchers who won the Nobel
Prize. This could be just about anyone, but let's say they are James Watson,
Francis Crick and Maurice Wilkins and they all but invented DNA, considering
how neglected it was before then.
Fast forward twenty years (if you know anything about molecular biology,
advances always take twenty years - it works like fashion and you have to wait
for it to come full circle). Double Nobel Prize winner Fred Sanger made
millions on a method to 'decode' DNA, and at the end of the Cold War, you can
bet decoding is on people's minds (right about five years prior, we officially
knew about RSA encryption). Fast forward another twenty years, as expected,
and now Sanger launched a bunch of scientists following his footsteps and the
main activity of several sequencing companies, who immediately caught on to
what the war is supposed to have taught America: decoding information,
especially that hidden from you for millenia, will make you powerful, and
power will make you wealthy. Watson, who isn't to be forgotten, is still alive
and kick-starting the 'human genome project', the most megalomaniac human
enterprise since the Roman Empire. Even then, at least Caesar believed in
Nature.

We are in the 2000s and the economy is down. What about DNA ? Society still
needs it, in the form of bioinformatics, information era's scavengers hunting
for leftover funding. The fundamental paradigm of molecular biology is almost
dead. As we now know, there is no single cancer gene,
no single obesity gene, no single terrorism gene. What we know is, between
genomes and proteins, there are several steps. Between genomes and reality ?
Why, says genomics, it's a slightly-distorted mirror image -- genome variation
means variation in observations (I bet you didn't see this one coming).
So we have genes, we have research aims (curing cancer, making life with HIV
possible, saving the ecosystem...) -- there needs to be a way to bridge them.
And there is one. Computers. If you've seen any movie between the 80s and now,
computers can do about anything including destroying the human race. Why not
enhance it, ask scientists in the dawn of the new millenium.

This is how the following is funded.

"The following" is a paper by Polish researcher J. Blazewicz. It doesn't have
an impact comparable to the previously-cited works, but is a good representant
of its class: a paper about software by non-software engineers, that addresses
the problem of turning DNA into analyzable files (ASCII, that is, text).
What software does: take a boring human task and automatize it. What
assemblers do: take the sequencer's output and render it usable. If you've
ever written software you will realize something crucial: this means an
assembler should be bundled with every sequencer. So does Roche/454, who gave
us Newbler, a fine piece of software that does exactly what I mentioned in
record time on their own technology only.
What do Blazewicz et al. do ? Get Newbler out of the way, argue it doesn't
parse its own output format and persuade biologists to adopt ill-maintained
user-made solutions instead of the proprietary, perfectly fine vendor-supplied
ones.
However, this is only ridiculous from a software engineering point of view.
Bioinformatics is not software engineering -- notably, it doesn't address the
problems of target users so much as the problems of specific researchers,
encouraging homebrew scripts and methods much more than general software
engineering does, and it has incredibly specific problems that often can't be
generalized. That makes inelegant solutions very common and unavoidable.

The main problem in the paper is a staggering lack of clarity, so the reader
needs some hand-holding there. Blazewicz and colleagues start their exposition
by mentioning their using a DNA graph, which is a graph that has the de Bruijn
property (every vertex has the same length and an overlap of exactly
length+1 with every vertex it is connected to by an edge) but isn't a complete
graph (it only contains information available from the reads given by the
sequencer). This is a common approach (known as the de Bruijn approach) in
genome assembly, except it isn't the approach used by Blazewicz et al. What
they are doing is taking the DNA graph, removing the de Bruijn property, and
considering variable length overlaps, and still calling it the DNA graph.
It's on par with calling a stochastic process a "Markovian process influenced by the
last 1000 states" when Markov's property is precisely that only the last state
matters.
From the engineering standpoint, the software's main flaw is that it has no
sensible defaults. Actually, it has no defaults at all and the user is
supposed to know what arbitrary two parameters are correct. These parameters
are overlap parameters (minimum length and acceptable error, the usual), which
is even more confusing when the article references de Bruijn approaches.
The algorithm is pretty classic with some tweaks: first, it gets help from the
two aforementioned parameters with no sensible defaults, which isn't very
intelligent, and second, it neglects the number of repeated nucleotides in the
alignment, only keeping one, which is smarter given that "homopolymeric
regions" containing the same base in a stretch of over 5 or 6 nucleotides, are
routinely mistreated by the 454 technology.
They mention some version of the Traveling Salesman Problem they made up,
self-referring on the way, but nowhere is it explained, or its relevance
discussed. We can safely assume it would have been too hard.

The algorithm is overlap-layout-consensus, called otherwise, given no
defaults for parameters, and with the homopolymeric-insensitive alignment. End
of story.

The results are, unfortunately, where it all falls apart like an apple that's
only bad on the inside. The coverage goes down as the number of reads goes up,
surprising given they are theoretically proportional (more reads equal more
copies of the genome being sequenced). Words are tricky beasts, especially
scientific words which often get redefined beyond recognition.
When assembler writers say coverage, they mean percentage of the full-size
genome covered by the assembly output. When these assembler writers say
coverage, they mean percentage of the assembly output covered by any contig of
itself. How is that a measure of the assembly's quality ? The ways of abstruse
science are impenetrable, but they might have meant that at least, their
assembly was consistent: when there was only one contig, it covered the whole
of itself. This is a creative new proof in support of the concept of identity!

More importantly, we have no idea as to what these contigs contain. While the
described algorithm is sound, the data is carefully anonymized, likely to
protect the bacterium Prochlorococcus marinus, who might not like its genome
being exposed to all and sundry.

The use of undetailed "expert finishing" sweeps under the rug the whole
procedure. Clearly, all assemblers do even better with carefully curated
information added manually by professionals. The point of software is not to
complete a task with assistance (see above). It is to entirely cut out the
need for assistance.

In summary the authors have no idea what software is (even less what it should
be). Their knowledge of the field is adequate, their understanding of current
algorithms also. This is a correct review of the field of genome assembly,
presenting both de Bruijn and overlap-layout-consensus paradigm. Know what it
isn't ? Original research or development.

References

Blazewicz et al., Journal of Computational Chemistry
doi:10.1016/j.compbiolchem.2009.04.005 (2009)