Opiniomics

bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

The five habits of bad bioinformaticians

When ever I see bad bioinformatics, a little bit of me dies inside, because I know there is ultimately no reason for it to have happened.  As a community, bioinformaticians are wonderfully open, collaborative and helpful.  I genuinely believe that most problems can be fixed by appealing to a local expert or the wider community.  As Nick and I pointed out in our article, help is out there in the form of SeqAnswers and Biostars.

I die even more when I find that some poor soul has been doing bad bioinformatics for a long time.  This most often happens with isolated bioinformatics staff, often young, and either on their own or completely embedded within a wet-lab research group.  I wrote a blog post about pet bioinformaticians in 2013 and much of the advice I gave there still stands today.

There are so many aspects of bioinformatics that many wet lab PIs are simply incapable of managing them.  This isn’t a criticism.  There are very few superheros; few people who can actually span the gap of wet- and dry- science competently.  So I thought I’d write down the 5 bad habits of bad bioinformaticians.  If you are a pet bioinformatician, read these and figure out if you’re doing them.  If you’re a wet lab PI managing bioinformatics staff, try and find out if they have any of these habits!

Using inappropriate software

This can take many forms, and the most common is out-of-date software.  Perhaps they still use Maq, when BWA and other tools are now far more accurate; perhaps the software is a major version out of date (e.g. Bowtie instead of Bowtie2).  New releases come out for a reason, and major reasons are (i) new/better functionality; (ii) fixing bugs in old code.  If you run out of date software, you are probably introducing errors into your workflow; and you may be missing out on more accurate methods.

Another major cause of inappropriate software use is that people often use the software they can install rather than the software that is best for the job.  Sometimes this is a good reason, but most often it isn’t.  It is worth persevering – if the community says that a tool is the correct one to use, don’t give up on it because it won’t install.

Finally, there are just simple mistakes – using a non-spliced aligner for RNA-Seq, software that assumes the wrong statistical model (e.g. techniques that assume a normal distribution used on counts data) etc etc etc.

These aren’t minor annoyances, they can result in serious mistakes, and embed serious errors in your data which can result in bad science being published.  Systematic errors in analysis pipelines can often look like real results.

Not keeping up to date with the literature

Bioinformatics is a fast moving field and it is important to keep up to date with the latest developments.  A good recent example is RNA-Seq – for a long time, the workflow has been “align to the genome, quantify against annotated genes”.  However, there is increasing evidence that the alignment stage introduces a lot of noise/error, and there are new alignment free tools that are both faster and more accurate.  That’s not to say that you must always work with the most bleeding edge software tools, but there is a happy medium where new tools are compared to existing tools and shown to be superior.

Look for example here, a paper suggesting that the use of SAMtools for indel discovery may come with a 60% false discovery rate.  60%!  Wow….. of course that was written in 2013, and in 2014 a comparison of a more recent version of SAMtools shows a better performance (though still an inflated false positive rate).

Bioinformatics is a research subject.  It’s complicated.

All of this feeds back to point (1) above.  Keeping up to date with the literature is essential, otherwise you will use inappropriate software and introduce errors.

Writing terrible documentation, or not writing any at all

Bioinformaticians don’t document things in the same way wet lab scientists do.  Wet lab scientists keep lab books, which have many flaws; however they are a real physical thing that people are used to dealing with.  Lab books are reviewed and signed off regularly.  You can tell if things have not been documented well using this process.

How do bioinformaticians document things?  Well, often using things like readme.txt files, and on web-based media such as wikis or github.  My experience is that many bioinformaticians, especially young and inexperienced ones, will keep either (i) terrible or (ii) no notes on what they have done.  They’re equally as bad as one another.  Keeping notes, version controlled notes, on what happened in what order, what was done and by whom, is essential for reproducible research.  It is essential for good science.  If you don’t keep good notes, then you will forget what you did pretty quickly, and if you don’t know what you did, no-one else has a chance.

Not writing tests

Tests are essential.  They are the controls (positive and negative) of bioinformatics research and don’t just apply to software development.  Bad bioinformaticians don’t write tests.  As a really simple example, if you are converting a column of counts to a column of percentages you may want to sum the percentages to make sure they sum to 100.  Simple but it will catch errors.  You may want to find out the sex of all of the samples you are processing and make sure they map appropriately to the signal from sex chromosomes.  There are all sorts of internal tests that you can carry out when performing any analysis, and you must implement them, otherwise errors creep in and you won’t know about it.

Rewriting things that exist already

As Nick and I said, “Someone has already done this, find them!”.  No matter what problem you are working on, 99.9999% of the time someone else has already encountered it and either (i) solved it, or (ii) is working on it.  Don’t re-invent the wheel.  Find the work that has already been done and either use it or extend it.  Try not to reject it because it’s been written in the “wrong” programming language.

My experience is that most bioinformaticians, left to their own devices, will rewrite everything themselves in their preferred programming language and with their own quirks.  This is not only a waste of time, it is dangerous, because they may introduce errors or may not consider problems other groups have encountered and solved.  Working together in a community brings safety, it brings multiple minds to the table to consider and solve problems.

—————————————————

There we are.  That’s my five habits of bad bioinformaticians.  Are you doing any of these?  STOP.  Do you manage a bioinformatician?  SEND THEM THIS BLOG POST, and see what they say 😉

12 Comments

  1. Hi Mick. I think the analyses from http://www.genomebiology.com/2015/16/1/177 are not conclusive. 1000 reads is too little considering that most of the transcripts are greater than 1000nt in length and that you are using just one transcript per gene. Also, the cancer link is somehow expected as cancer genes tend to have more exons, expand longer genomic regions, have more transcripts, etc. I think the point should be that one should not jump into new methods or discard existing ones until proper evaluations have been made. As an example, arrays were thought to be rendered no longer useful with sequence, and here we are still discussing how to measure expression with sequencing data. In any case I agree that the literature is full of errors due to the use of outdated or wrong software and that one should be careful about it.

  2. I like this.

    Re: Number 5. When IS it acceptable to start again from scratch? When something the existing approach is simply wrong? When the existing solution needs extending, but isn’t open source (what about if it doesn’t need extending, but isn’t open source)? When you need an API call, but all your given is an executable? When the code isn’t version controlled? or isn’t documented? or tested? When you need to it fit a standard interface? Or you simply don’t trust the code? When you think the solution is trivial and it would make a good exercise for a student?

    I’ve tried to arrange these is descending order.Obviously if the whole approach is wrong, you’ll want to start again. And on the other end you probably wouldn’t trust your new big paper to code you gave a student to write as an exercise. I think the closed source, non-version controlled questions are interesting though.

    Several of these cases can be fixed by extending the existing code. This is of course where language choice comes in. I’d be very wary of contributing to a tool written in a language i’m not an expert in.

  3. First month of my PhD I made a very silly mistake while writing a piece of code. A simple error caused a hidden error in the underlying dataset for a long time. All be cause I didn’t do a simple verification test (“because my coding task was so easy”). I now test absolutely everything. Great point.

  4. I agree with this in principle but have had trouble applying it in practice. Projects are long enough in duration that they can start with data analyzed using tool X and finish after tool X++ is published. Scientists are reluctant to re-analyze data upon which many biological hypotheses have been made, even if they don’t change things that dramatically. It’s been frustrating for sure, but I think it might be good to distinguish ‘bad bioinformatics’ from ‘best bioinformatics given the situation’.

  5. 4 (why should it be a pleasant surprise when the documentation is good?) and 5 recurring problems. I would expand 5 (?5b) to not complying with, or worse, re-inventing data standards when they already exist. For example, want to describe a genome variant? Don’t re-invent the wheel, go here: http://www.hgvs.org/mutnomen/

  6. Hi Eduardo

    Indeed the analyses in our paper are flawed, but only in the same sense that all other RNA-Seq analyses are flawed, because there is no perfect test/simulated dataset. What’s interesting about our analysis is that it detected exactly what we would expect i.e.:

    – multimap reads are a problem
    – gene families with high sequence homology are a problem
    – cufflinks in default mode over-estimates expression of short genes
    – mapping to short exons is a problem
    – htseq has a false negative problem

    So whilst the methodology might seem flawed, the results are valid. We also repeated the analysis many times, with different simulated datasets, with variable expression, reads with errors, variable subsets of genes expressed etc – and the results were stable. So the conclusions would have been the same had we used a more realistic simulation 🙂

    Cheers
    Mick

  7. Hi. Thanks for the reply. Well, a flawed methodology at least puts on halt the conclusions or part of them until more through analysis is performed. I would even add that “finding what we expect” is a dangerous epistemological assertion :). Your results mainly implies that there is a considerable number of genes for which there is no unique k-mer (of size comparable to a read) in any of their transcripts, or if there is, it never appears in the RNA-Seq. This is a testeable and would provide a description of a “worst case” set a bit more accurate. Salmon and other tools are not just counting kmers, so that the actual quantification result will be better than that. I believe similar tests are underway.

  8. Hi Eduardo

    As I said no RNASeq simulation is perfect, and I’d argue ours is better because we know exactly where the biases are. It is a non sequitur to say that if the data set is not perfect then the results are invalid. We use known results to validate what we see which is just science. Many human genes have no or few unique kmers. This Is possible to find analytically and has been done, but of course that tells us nothing about the behavior and assumptions of aligners or quantification tools. It is quite lazy to dismiss results because you don’t like the methods, I think if you feel the results are wrong, write a paper and demonstrate it 🙂

    Cheers

    Mick

  9. Some of these are not accurate at all — Bowtie2 is not a new version of Bowtie, it’s a different algorithm; Bowtie1 is perfectly fine (in fact even better suited) for certain applications.

    Alignment-free RNA-seq is not better than alignment-based quantification, it is good enough to make the trade-off between computational expense and accuracy worthwhile, which is an important distinction.

    Etc.

  10. I partially disagree with 5. It is better to rewrite something and KNOW WHAT IT DOES (remembering point 4!) than to use a black box that may or may not do what you want – the problem of point 1 caused by points 3 & 4. I like your post but perhaps it should be retitled: “Five habits of bad bioinformaticians in an ideal world.” (And consider being charitable and changing “bad” to “bad/overworked”.) If it’s quicker to do (5) than avoid (1) and (2) when it’s not your main field, sometimes that is reason enough – just not for hard problems. (They won’t be quicker.) Maybe I’m just a bad bioinformatician! :op (Points 3 & 4 are universals, though, and cause lots of issues.)

  11. Also it would be great if the pipleine developers included at least the version information about ALL their dependancies (ideally a copy of all sources of the dependancies, they used for pipeline development).
    For example try running the antismash pipeline with the latest version of the diamond aligment tool… (hint: diamond is in active development and it’s output was changed a few times during the last 3 years).
    Or try reproducing some ncbi blast workflow with blast+ (look for blastedbio blog for all underwater stones encountered).
    Or finally try running some python2.5 code under python3 or python2.7…
    For some reason the 20 years old perl4 code usually works, but 5-6 years old python one tends to fail…

Leave a Reply

© 2017 Opiniomics

Theme by Anders NorenUp ↑