This post is a response to Titus and Daniel‘s blog posts on whether published software should be re-useable, and covers some elements of an argument I had with Aylwyn on Twitter.

What is the purpose of an academic paper?

It’s a good question.  Reading up on the history of academic publishing, it seems that initially academic papers were little more than stories, with “trade secrets” kept by the authors and not revealed.  This did not work very well, and (this is unreferenced!) I like this sentence from the above Wiki page: “The Royal Society was steadfast in its not yet popular belief that science could only move forward through a transparent and open exchange of ideas backed by experimental evidence”.

Adding some of my own thoughts, when describing methods in a paper: is the point of a publication to show that you can do something?  Or is the point of a publication to show everyone else how to do it?

And what do your funders want?  I think I can guess.  Most government funding bodies want to drive the knowledge economy because it stimulates job creation and economic growth.  It also contributes to general health, happiness and well-being.  Sometimes this is best achieved by protecting intellectual property and commercialising discovery; however, most often it is best achieved through the sharing of ideas, knowledge and expertise – the academic publication route.

To put it bluntly – any data and techniques which you choose to keep exclusively for your own group’s use and advantage act as a barrier to discovery, because no matter how good you are with them, you plus hundreds of others would achieve far more.

So – the point of an academic paper is to share ideas, knowledge and expertise.  It’s not an exercise in showing off (“look what I can do!”)

Software: What is the point of your paper?

Titus mentions single-use software and I have some sympathy here.  If the point of your paper is not the software, then documenting, testing, writing tutorials, releasing and supporting can be very onerous and is somewhat of a distraction.  For example, if your paper is about a genome and how members of that species evolved, but to assemble that genome you write some scripts that hacked a few different assembly packages together, then you should definitely (i) release the scripts and (ii) describe the method in some detail, but I can see that this is single-use software not meant for re-use and doesn’t need to have all the bells and whistles attached to it.  So I think Titus and I agree on that one.  Release it, mark it up as “not for re-use” and move on.

Publishing a theory paper

This is a grey area.  I’d like to draw an analogy to laboratory methods here.  Let’s say some bright spark has an idea.  Maybe they look at CRISPR and Cas and they think “hey theoretically, that could be used to edit genomes”.  They write up their theory and publish it.  Fine.  It’s a theory, but without an implementation, pretty low impact.  It’s all a bit Elon Musk.  Then imagine the bright spark instead shows that they can edit a genome using CRISPR-Cas, but the method will only work for one gene in one organism.  We have the theory, and a limited-use implementation.  Again, publishable, but not very high impact.  Finally, imagine the same bright spark, they have the theory, and they have an implementation which can be used for most genes in most organisms.  That’s the wow paper right there.

We can apply this to theory papers in computational biology.  There are plenty of journals in which to publish theories.  If you have a theoretical method, publish it there.  The next stage is a theory with a software implementation that is not intended for re-use.  It’s fine, but it’s low impact and should be treated as such.  There are plenty of software tools that work in a single, given situation, but cannot and should not be applied elsewhere.  There are plenty of models that are over-trained and over-fitted.  Single-use software is not necessarily good evidence for your theory.  Finally, the gold standard – theory, and software that is generally applicable to many instances of a problem.  Clearly the latter is the highest impact, but more importantly, the two former cases are (in my opinion) of very limited use to the community.  I’ll give you an example: I have a theory (and many others do too!) that uncertainty in sequence reads should be represented as a graph, and that we should align read graphs to reference graphs to extract maximum information content in genomics.  Shall I just sit back and wait for the phone call from Nature?  I think not.  Ah but here, I have a Perl script that takes a single read-as-a-graph and aligns it perfectly to the human genome.  Again, Nature?  Maybe Science this time?  Hmmm?

Software/method publication by stealth

The problem of accepting a paper where the authors present a novel method but a software implementation that cannot be re-used, is that this is software publication by stealth.  Because the software will be re-used – by the authors.  This is a way in which computational biologists can publish a method without enabling others to easily carry out that method; it’s back to the “trade secrets” publication style of the 1700s.  It’s saying “I want the paper, but I am going to keep the software to myself, so that we have an advantage over everyone else”.  In short, it is the absolute antithesis of what I think a publication is for.

Software publications

It’s clear to me that publications focused on the release of software should of course make the software re-useable.  To jot down a few bullet points:

  1. The source code should be readable
  2. It should compile/run on readily available platforms
  3. The authors should list all dependencies
  4. The authors should state all assumptions the software makes
  5. The authors should state when the software should be used and when it shouldn’t be used
  6. There should be a tutorial with test datasets
  7. The authors should demonstrate use on both simulated and real data
  8. There should be methods by which users can get help
  9. If command line, the software should print out help when run with no parameters

What do you think?