We need to stop making this simple f*cking mistake

I’m not perfect.  Not in any way.  I am sure if anyone was so inclined, they could work their way through my research with clinical forensic attention-to-detail and uncover all sorts of mistakes.  The same will be true for any other scientist, I expect.  We’re human and we make mistakes.

However, there is one mistake in bioinformatics that is so common, and which has been around for so long, that it’s really annoying when it keeps happening:

It turns out the Carp genome is full of Illumina adapters.

One of the first things we teach people in our NGS courses is how to remove adapters.  It’s not hard – we use CutAdapt, but many other tools exist.   It’s simple, but really important – with De Bruijn graphs you will get paths through the graphs converging on kmers from adapters; and with OLC assemblers you will get spurious overlaps.  With gap-fillers, it’s possible to fill the gaps with sequences ending in adapters, and this may be what happened in the Carp genome.

Why then are we finding such elementary mistakes in such important papers?  Why aren’t reviewers picking up on this?  It’s frustrating.

This is a separate, but related issue, to genomic contamination – the Wheat genome has PhiX in it; tons of bacterial genomes do too; and lots of bacterial genes were problematically included in the Tardigrade genome and declared as horizontal gene transfer.

Genomic contamination can be hard to find, but sequence adapters are not.  Who isn’t adapter trimming in 2016?!


  1. I think genomic contamination is not hard to find either. It is equally unacceptable to leave bacterial genomes or PhiX in the assembly. Within minutes, you can have an estimation for the number of bacterial genomes occurring in your assembly using bacterial single-copy genes:


    This method predicted at least 10 genomes in the tardigrade assembly you mentioned, and we recovered 3 near-complete bacterial genomes from it:


  Romko Perevernykruchenko

    31st March 2016 at 9:07 am

    IMHO, there are several causes.

    1. The journals do not require full disclosure of methods that would make everything reproducible. I don’t know why the method section even exists in the main text of papers submitted to Nature. Standards for SI are much much lower, so you can get away with “custom python scripts” most of the time. Many steps/parameters are skipped as if they are common knowledge.

    2. Work is often done by students only once in their academic life time before they are pushed out into industry as spent material. There are very few professionals (and their number is dwindling) who would do a similar task repeatedly while continuously improving their skills and pipelines. PIs are too busy chasing grants to pay close attention. After all, the current academic system encourages marketing rather than reproducible research. The paragidm is to do it quickly then make it look pretty with a catchy title. Finally, there is not enough money for a diligent QA and too much pressure to publish everything as soon as possible. As a result, you get what you are paying for.

    3. Cheap sequencing makes it possible for anybody to do the assembly. Often, a lab would be interested in one or two organisms. Thus, assembly is done by somebody with little or no experience in bioinformatics. Of course, this does not apply to Nature-level papers but many draft assemblies are just like that.

  Bede Constantinides

    11th April 2016 at 2:38 pm

    I contacted the authors of the paper in question in mid 2015, and their response stated that updated sequences were already available from CarpBase, failing to address the issue of public database contamination and therefore somewhat missing the point. However they did also vaguely insinuate that EBI was dragging its heels, which would be disappointing if true.

    Thanks for posting – it’s frustrating that this *still* hasn’t been resolved.

