Saturday, October 10, 2009

Genome generation: no engineering allowed

Once upon a time, there was a small group of researchers who won the Nobel
Prize. This could be just about anyone, but let's say they are James Watson,
Francis Crick and Maurice Wilkins and they all but invented DNA, considering
how neglected it was before then.
Fast forward twenty years (if you know anything about molecular biology,
advances always take twenty years - it works like fashion and you have to wait
for it to come full circle). Double Nobel Prize winner Fred Sanger made
millions on a method to 'decode' DNA, and at the end of the Cold War, you can
bet decoding is on people's minds (right about five years prior, we officially
knew about RSA encryption). Fast forward another twenty years, as expected,
and now Sanger launched a bunch of scientists following his footsteps and the
main activity of several sequencing companies, who immediately caught on to
what the war is supposed to have taught America: decoding information,
especially that hidden from you for millenia, will make you powerful, and
power will make you wealthy. Watson, who isn't to be forgotten, is still alive
and kick-starting the 'human genome project', the most megalomaniac human
enterprise since the Roman Empire. Even then, at least Caesar believed in
Nature.

We are in the 2000s and the economy is down. What about DNA ? Society still
needs it, in the form of bioinformatics, information era's scavengers hunting
for leftover funding. The fundamental paradigm of molecular biology is almost
dead. As we now know, there is no single cancer gene,
no single obesity gene, no single terrorism gene. What we know is, between
genomes and proteins, there are several steps. Between genomes and reality ?
Why, says genomics, it's a slightly-distorted mirror image -- genome variation
means variation in observations (I bet you didn't see this one coming).
So we have genes, we have research aims (curing cancer, making life with HIV
possible, saving the ecosystem...) -- there needs to be a way to bridge them.
And there is one. Computers. If you've seen any movie between the 80s and now,
computers can do about anything including destroying the human race. Why not
enhance it, ask scientists in the dawn of the new millenium.

This is how the following is funded.

"The following" is a paper by Polish researcher J. Blazewicz. It doesn't have
an impact comparable to the previously-cited works, but is a good representant
of its class: a paper about software by non-software engineers, that addresses
the problem of turning DNA into analyzable files (ASCII, that is, text).
What software does: take a boring human task and automatize it. What
assemblers do: take the sequencer's output and render it usable. If you've
ever written software you will realize something crucial: this means an
assembler should be bundled with every sequencer. So does Roche/454, who gave
us Newbler, a fine piece of software that does exactly what I mentioned in
record time on their own technology only.
What do Blazewicz et al. do ? Get Newbler out of the way, argue it doesn't
parse its own output format and persuade biologists to adopt ill-maintained
user-made solutions instead of the proprietary, perfectly fine vendor-supplied
ones.
However, this is only ridiculous from a software engineering point of view.
Bioinformatics is not software engineering -- notably, it doesn't address the
problems of target users so much as the problems of specific researchers,
encouraging homebrew scripts and methods much more than general software
engineering does, and it has incredibly specific problems that often can't be
generalized. That makes inelegant solutions very common and unavoidable.

The main problem in the paper is a staggering lack of clarity, so the reader
needs some hand-holding there. Blazewicz and colleagues start their exposition
by mentioning their using a DNA graph, which is a graph that has the de Bruijn
property (every vertex has the same length and an overlap of exactly
length+1 with every vertex it is connected to by an edge) but isn't a complete
graph (it only contains information available from the reads given by the
sequencer). This is a common approach (known as the de Bruijn approach) in
genome assembly, except it isn't the approach used by Blazewicz et al. What
they are doing is taking the DNA graph, removing the de Bruijn property, and
considering variable length overlaps, and still calling it the DNA graph.
It's on par with calling a stochastic process a "Markovian process influenced by the
last 1000 states" when Markov's property is precisely that only the last state
matters.
From the engineering standpoint, the software's main flaw is that it has no
sensible defaults. Actually, it has no defaults at all and the user is
supposed to know what arbitrary two parameters are correct. These parameters
are overlap parameters (minimum length and acceptable error, the usual), which
is even more confusing when the article references de Bruijn approaches.
The algorithm is pretty classic with some tweaks: first, it gets help from the
two aforementioned parameters with no sensible defaults, which isn't very
intelligent, and second, it neglects the number of repeated nucleotides in the
alignment, only keeping one, which is smarter given that "homopolymeric
regions" containing the same base in a stretch of over 5 or 6 nucleotides, are
routinely mistreated by the 454 technology.
They mention some version of the Traveling Salesman Problem they made up,
self-referring on the way, but nowhere is it explained, or its relevance
discussed. We can safely assume it would have been too hard.

The algorithm is overlap-layout-consensus, called otherwise, given no
defaults for parameters, and with the homopolymeric-insensitive alignment. End
of story.

The results are, unfortunately, where it all falls apart like an apple that's
only bad on the inside. The coverage goes down as the number of reads goes up,
surprising given they are theoretically proportional (more reads equal more
copies of the genome being sequenced). Words are tricky beasts, especially
scientific words which often get redefined beyond recognition.
When assembler writers say coverage, they mean percentage of the full-size
genome covered by the assembly output. When these assembler writers say
coverage, they mean percentage of the assembly output covered by any contig of
itself. How is that a measure of the assembly's quality ? The ways of abstruse
science are impenetrable, but they might have meant that at least, their
assembly was consistent: when there was only one contig, it covered the whole
of itself. This is a creative new proof in support of the concept of identity!

More importantly, we have no idea as to what these contigs contain. While the
described algorithm is sound, the data is carefully anonymized, likely to
protect the bacterium Prochlorococcus marinus, who might not like its genome
being exposed to all and sundry.

The use of undetailed "expert finishing" sweeps under the rug the whole
procedure. Clearly, all assemblers do even better with carefully curated
information added manually by professionals. The point of software is not to
complete a task with assistance (see above). It is to entirely cut out the
need for assistance.

In summary the authors have no idea what software is (even less what it should
be). Their knowledge of the field is adequate, their understanding of current
algorithms also. This is a correct review of the field of genome assembly,
presenting both de Bruijn and overlap-layout-consensus paradigm. Know what it
isn't ? Original research or development.

References

Blazewicz et al., Journal of Computational Chemistry
doi:10.1016/j.compbiolchem.2009.04.005 (2009)

No comments:

Post a Comment