Opiniomics

bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

On publishing software

This post is a response to Titus and Daniel‘s blog posts on whether published software should be re-useable, and covers some elements of an argument I had with Aylwyn on Twitter.

What is the purpose of an academic paper?

It’s a good question.  Reading up on the history of academic publishing, it seems that initially academic papers were little more than stories, with “trade secrets” kept by the authors and not revealed.  This did not work very well, and (this is unreferenced!) I like this sentence from the above Wiki page: “The Royal Society was steadfast in its not yet popular belief that science could only move forward through a transparent and open exchange of ideas backed by experimental evidence”.

Adding some of my own thoughts, when describing methods in a paper: is the point of a publication to show that you can do something?  Or is the point of a publication to show everyone else how to do it?

And what do your funders want?  I think I can guess.  Most government funding bodies want to drive the knowledge economy because it stimulates job creation and economic growth.  It also contributes to general health, happiness and well-being.  Sometimes this is best achieved by protecting intellectual property and commercialising discovery; however, most often it is best achieved through the sharing of ideas, knowledge and expertise – the academic publication route.

To put it bluntly – any data and techniques which you choose to keep exclusively for your own group’s use and advantage act as a barrier to discovery, because no matter how good you are with them, you plus hundreds of others would achieve far more.

So – the point of an academic paper is to share ideas, knowledge and expertise.  It’s not an exercise in showing off (“look what I can do!”)

Software: What is the point of your paper?

Titus mentions single-use software and I have some sympathy here.  If the point of your paper is not the software, then documenting, testing, writing tutorials, releasing and supporting can be very onerous and is somewhat of a distraction.  For example, if your paper is about a genome and how members of that species evolved, but to assemble that genome you write some scripts that hacked a few different assembly packages together, then you should definitely (i) release the scripts and (ii) describe the method in some detail, but I can see that this is single-use software not meant for re-use and doesn’t need to have all the bells and whistles attached to it.  So I think Titus and I agree on that one.  Release it, mark it up as “not for re-use” and move on.

Publishing a theory paper

This is a grey area.  I’d like to draw an analogy to laboratory methods here.  Let’s say some bright spark has an idea.  Maybe they look at CRISPR and Cas and they think “hey theoretically, that could be used to edit genomes”.  They write up their theory and publish it.  Fine.  It’s a theory, but without an implementation, pretty low impact.  It’s all a bit Elon Musk.  Then imagine the bright spark instead shows that they can edit a genome using CRISPR-Cas, but the method will only work for one gene in one organism.  We have the theory, and a limited-use implementation.  Again, publishable, but not very high impact.  Finally, imagine the same bright spark, they have the theory, and they have an implementation which can be used for most genes in most organisms.  That’s the wow paper right there.

We can apply this to theory papers in computational biology.  There are plenty of journals in which to publish theories.  If you have a theoretical method, publish it there.  The next stage is a theory with a software implementation that is not intended for re-use.  It’s fine, but it’s low impact and should be treated as such.  There are plenty of software tools that work in a single, given situation, but cannot and should not be applied elsewhere.  There are plenty of models that are over-trained and over-fitted.  Single-use software is not necessarily good evidence for your theory.  Finally, the gold standard – theory, and software that is generally applicable to many instances of a problem.  Clearly the latter is the highest impact, but more importantly, the two former cases are (in my opinion) of very limited use to the community.  I’ll give you an example: I have a theory (and many others do too!) that uncertainty in sequence reads should be represented as a graph, and that we should align read graphs to reference graphs to extract maximum information content in genomics.  Shall I just sit back and wait for the phone call from Nature?  I think not.  Ah but here, I have a Perl script that takes a single read-as-a-graph and aligns it perfectly to the human genome.  Again, Nature?  Maybe Science this time?  Hmmm?

Software/method publication by stealth

The problem of accepting a paper where the authors present a novel method but a software implementation that cannot be re-used, is that this is software publication by stealth.  Because the software will be re-used – by the authors.  This is a way in which computational biologists can publish a method without enabling others to easily carry out that method; it’s back to the “trade secrets” publication style of the 1700s.  It’s saying “I want the paper, but I am going to keep the software to myself, so that we have an advantage over everyone else”.  In short, it is the absolute antithesis of what I think a publication is for.

Software publications

It’s clear to me that publications focused on the release of software should of course make the software re-useable.  To jot down a few bullet points:

  1. The source code should be readable
  2. It should compile/run on readily available platforms
  3. The authors should list all dependencies
  4. The authors should state all assumptions the software makes
  5. The authors should state when the software should be used and when it shouldn’t be used
  6. There should be a tutorial with test datasets
  7. The authors should demonstrate use on both simulated and real data
  8. There should be methods by which users can get help
  9. If command line, the software should print out help when run with no parameters

What do you think?

11 Comments

  1. Great post! I like the reasoning. Question: would you require an OSS license allowing remixing (modification and redistribution), or not?

    • Thank you, I am glad you like it!

      I chose my words carefully “the source code should be readable”. Strictly speaking, there is no need to allow remixing as part of an academic publication; though I would argue it limits the impact and any editor should take that into account.

      I don’t think binaries are acceptable as not enough of the method is revealed.

  2. Hi Mick, thanks for your post on this.

    You almost lost me at the bullet point list at the end until I reread the introductory paragraph that this is about software papers, where I agree with that list.
    So for all I can tell you’re missing a class of papers in your list that I run into a lot as an “embedded bioinformatician”, which is the “wet lab paper with some bioinformatics attached”. These usually focus on some wet lab experiments and results of said experiments, but possibly some software was developed to assist the initial experimental design.

    In a specific example that comes to my mind, I’ve developed a small algorithm that pre-screened UTRs in microbial genomes for some sequence motifs, did a quick-and-dirty attempt at folding the identified sequences (as the usual algorithms didn’t work too well), and then basically printed out a list of interesting targets. These lists could be combined, used to create a blast database and in a second iteration, hits could also be ranked by “also found this in x many other organisms”. All in all, less then 1k lines of code. Given the nature of the paper, and the fact that it was published in a journal targeted at wet-lab scientists, I’m sure we would have gotten by even without mentioning the fact that we used some software to pre-screen our potential targets. After all, a handful of PhD students staring at sequences for half a year would have gotten us to the same results. We did instead chose to explain the algorithm used and discuss implementation details in the “experimental methods” section(*1).

    Now, is science better off with us publishing software that was (for all we cared) just useful for this specific kind of experiment, or is it better off with keeping this software unpublished because it’s not up to some arbitrary standards of documentation quality and reusability that we’re currently are brainstorming about in this and various other blog posts referenced above? As I said, the software isn’t doing anything we couldn’t have solved the old-fashioned way (throwing a couple of grad students at it). We’ve used the software (without any code changes, if I remember) to plan some follow-up experiments, and I’m not aware of anybody else having used the software. In my mind, this is as single-use as it gets, and faced with having to fulfill the bullet list mentioned for software papers would probably have led to another of these “identified by an in-house script” publications, as anything else would have meant fighting over word count allocations(*2) with the meat of the paper, which was the wet lab experiments.

    You people might be in the position to mainly worry about “pure bioinformatics”, but there’s also an applied side of our field, and it’s much harder to justify the word-count for things that co-authors, editors and reviewers all find irrelevant.

    Cheers,
    Kai

    footnotes:
    (*1) We even have the source code available under an OSI license on github, the code comes with unit tests, and the command line parsing supports “–help”, but that’s just me being unable to shake my software engineering reflexes, it’s not even mentioned in the paper.
    (*2) The question how relevant word-count limits are in this digital day and age is probably a full blog post in itself, likely a very ranty one.

    • Hi Kai

      It sounds like this might fall under the class of software in my first paragraph “Software: What is the point of your paper?”. Describe it, release it and move on. Things are not black and white. I think if you publish ten more papers and each time refer back to the “wet lab” paper when you refer to the software used then things might change

      Cheers
      Mick

    • Kai, you say your software was under 1K lines, but I spend most of my life with projects like this, and they often have much more code than that. For example, one project I’m working on at the moment is an equal collaboration between wet and dry, and has about 15K lines of code associated with it.

      Yet I wouldn’t really call any of that code “software” in the sense of it being tools that might be useful to others. The trouble is, at 15K lines of code, its almost impossible for anyone else to make head nor tail of, or verify the correctness of without documentation and testing. Its a difficult problem.

  3. As always in these things the extreme cases are obvious: Of course if your publication is about a piece of software you’ve written it must meet all the criteria you mention. And if your “code” is just a makefile that runs a bunch of tools in a particular order, then clearly although it has to be available, its not your duty to make that makefile work with everyone’s vaguely similar data. But its the grey areas that make things difficult.
    I think there is a lot of code written that doesn’t fit neatly into either of these categories, and I think about the difference between those whose focus is the biology and those whose focus is the method or the software.
    If my paper’s central message is “this piece of biology is the case”, that is different from “this method is the best way to do x”. I suppose its the difference between those that study computational methods, and those that use computational methods. But even those that use computational methods write something more than just glue code sometimes, but without wanting to go the lengths of rewriting as a fully documented, fully tested user tool. Sure, if the point of the publication is “you can do this”, then just presenting single use code isn’t much good. But if its about the biology, then surely the minimum is enough code so that people believe the biological claim, not so they can do a similar thing on their own data (although obviously that is better).

    • Absolutely, but what’s interesting is that I think this affects the impact of the publication, and editors need to take that into account. Methods, wet and dry, should be sufficient for someone else to repeat what you have done. If some of those methods are software, often it isn’t enough to just provide the code, that code has to run.

      • There is a difference between having to code being available and being able run, and being fully documented, tested and generalizable. I don’t necessarily think that not being these things does affect the impact of the publication if the impact of the publication derives from its biological findings.

        If I publish a paper saying I can cure cancer, the important thing is for me to demonstrate that i have indeed cured cancer. Its nice if you can use my methods to also cure alzheimer’s, but my paper is going to be pretty high impact, nonetheless.

        • What if your paper shows that you can accurately classify cancer/tumour type? And that leads to early cure. But your software only works on that one dataset….

  4. I think that many of the people here are confusing a “method” with a program. The CRISPR example is a good one—one can introduce a new method and describe it well without having to manufacture and give away kits for any one to duplicate the method easily.

    In many cases, the true test of a good method is that many different implementations of it can be built (such as the Burrows-Wheeler Transform), not that there is one implementation that anyone can use freely.

    Duplication of research is important, but that sometimes means rewriting the code, not running the same (buggy) code over and over.

Leave a Reply

Your email address will not be published.

*

*

code

© 2017 Opiniomics

Theme by Anders NorenUp ↑