First of all, I want to state quite clearly that I am not a code Nazi.  I don’t care about your coding practices.  Good architecture, an elegant object model, a stable API, version control, efficient code reuse, efficient code etc etc.  I don’t care.  Write all the unit tests you like, because if they fail, I’ll just force the install anyway.  I don’t care if you used extreme programming, whether or not you involved Tibetan monks and had your github repository blessed by the Dalai Lama.  Maybe you ensured the planets were aligned before you released version 0.1, or made sure all of your code monkeys had perfect Feng Shui in their bedsits.  I don’t care.  That don’t impress me much.

However, I do care that your code goddamn works.

I think, as a scientist, if I take some published code, that it should work.  Not much too ask is it?  Sure, a readme.txt or a manual.pdf would be nice too, but first and foremost, it has to just do the eff-ing job it’s supposed to.

Very recently, I’ve had the huge displeasure of downloading, installing and trying to run a published bioinformatics tool – and believe me, this one is a doozy.

I’m not going to name names (see below) but here are a few features that make me angry:

  • This thing has been published twice, once in 2010, once in 2012, both times in journals that are pretty much a dream for most bioinformaticians, both with double-digit impact factors
  • There is no manual, no README, no INSTALL.txt – nothing
  • The main code calls another script, that calls a few more other scripts, that call other scripts etc etc etc.  This is not easy to debug.
  • It outputs a number of text formats, all of which are completely novel and completely undocumented.
  • There is some help printed to the command-line, but that help contradicts itself
  • There are hard-coded URLs and filenames in the code
  • The code relies on filename and unique identifier conventions that have long since been retired
  • The vast majority of comments in the code are lines of code that have been commented out
  • It doesn’t work on its own example data

*sigh*

The thing is, I really want this code to work.  I have some data, this code would help, and damn it but I don’t have time to fix the thing!

I know what you’re going to say….

If everyone cared about good coding practice, and if everyone adhered to “the rules”, then we wouldn’t be in this position, would we?  Well, yes and no.  I take your point, we should *all* follow rules for writing good code.  However, let’s not forget, we have full time jobs as scientists.  Being a good coder is also a full time job.  It’s fairly difficult to hold down two full time jobs, and so often something has to give, and my attitude is that as long as the code does the job it’s supposed to do, then that is a higher priority.  Some of these thoughts came out in a discussion with C Titus Brown on his blog.  What I was trying to say is that, given a choice between spending two weeks streamlining code and making it more efficient, or spending two weeks interpreting the information that your code has produced in a biological context, I’d choose the latter.  Why?  Because that’s why we are here – to do science.  I’ve met far too many bioinformaticians who’ve forgotten they’re also supposed to be biologists, not just coders.

I know many of you will disagree, but it would be boring if we all had the same opinion, right?

So why won’t I name and shame?

Well, it seems like a personal attack, and I don’t want that.  This group is by no means alone, and why would I single one group out when many are bad at it?  We’re all in the same boat.  We’re all being pressured to publish, REF2014 is on the horizon and I feel a lot of kinship with my fellow PIs, I want their papers to get published, I want them to have high impact papers, I want them to feel secure and happy.  Who knows why the code is so bad?  We’ve all had post-docs who couldn’t tie their own show-laces never mind write papers/code, and we still have to publish.  I happen to know the PI involved, they’re not a bad scientist, they’re just responding to the environment in which they exist.

Why am I bothering to write this then?

It’s cathartic, I am angry at the World and the way everything is set up to allow, no, to encourage this type of work.  I’m not angry at the PI – they’re just responding to the pressures we all feel.

A completely unworkable solution

This wouldn’t be a blog post if I didn’t propose a completely unworkable solution that assumes a perfect world full of nice fluffy people.  Anyway, here it is: the bioinformatics police.

The bioinformatics police are a voluntary organisation of kindly, helpful, experienced bioinformaticians.  Every month, the entire bioinformatics community votes for a piece of code or software tool that they would like the bioinformatics police to investigate.  When one is selected, three members of the bioinformatics police independently try to download, install and run the code/tool on a vanilla Ubuntu install.  If they are successful, they document how it is done.  If they are not successful, they contact the PI and work with them to improve the code until it is ready for the public to enjoy.

How does that sound?  I bet you thought I was going to suggest retraction, hanging or public humiliation, didn’t you?

Here’s to a perfect world!

Update – 19th January 2013

Just a quick update on the above, which caused quite a stir in certain circles.  I thought it was quite obvious that I wrote the above whilst very angry, and, given my mention of the Dalai Lama and Feng Shui, that I was being somewhat sarcastic.  It’s really important that you don’t completely miss the point, though 🙂

For the record, I’m not for one moment suggesting that good coding practices aren’t important.  Of course they are, and in an ideal world, everyone would use them.  My point is that they are not the most important aspects of scientific software.

Consider the following fictitious situation.  Software A is developed according to all of the rules of good coding practice, it is efficient and well documented…. but, it turns out that the algorithm used is flawed and more often than not, the answers the code produces are wrong.  In contrast, software B is a hacky perl script which reads entire genomic datasets into a hash array, uses tons of RAM and has no comments in the code to speak of.  However, the algorithm in this script is much better, and more often than not, the answers it produces are correct.

Which do you want to use on your data?  Which should get published?  On the latter point, I’m betting that most peer reviewers judge whether or not software reaches the correct answer; and pay little attention to the underlying coding practices.  I have no problem if you want to change this paradigm, and include good coding practice in the peer review process – but priority number one has to be the science.  It is absurd to suggest otherwise, as we may then see well engineered software being published that serves up incorrect answers (again, you might say “better than poorly engineered software that gets the answers wrong”, and I would agree – that’s what this post is about :))

Some of you may point out that well engineered and tested code is less likely to be buggy and less likely to make mistakes;  and I would agree, up to a point.  But no amount of good coding can make up for bad biology, and so I maintain my point is that in scientific software, the science is more important than the coding practices.  The latter, of course, are still important 🙂