A friend and a colleague of mine once said about me “he’s a details man”, and it was after we had discussed the fact some of my papers consist solely in pointing out the errors other people ignore – in RNA-Seq for example, or in genome assemblies (I have another under review!).

By now, those of you familiar with my earlier work will be jumping up and shouting

“Of course!  Even as far back as 2012, he was warning us about the dangers of mis-annotated pseudogenes!  Just look at Figure 2 in his review of bacterial genome annotation!”

Well, let’s dig into that a bit more.

For the uninitiated, the vast majority of bases in most bacterial genomes fall within open-reading frames (ORFs) that are eventually translated into proteins.  Bacterial evolution, and often niche specificity, is characterised by the psedogen-isation of these ORFs – basically, many replication errors in ORFs (especially insertions and deletions) can introduce a top codon, the transcript and resulting protein are truncated, they are often no longer functional, and you have a pseudogene.  This happens when the ORFs are not under positive selection.

HOWEVER.  Sequencing errors can also introduce errors, and you end up with incorrectly annotated pseudogenes.

And OH NOES we just so happen to be flooded with data from long read technologies that have an indel problem.

Let’s assume you have a shiny new bacterial genome, sequenced on either PacBio or Nanopore, and you want to figure out if it has an indel problem.  Here is my simple and quick test:

First predict proteins using a gene finder.  Then map those proteins to UniProt (using blastp or diamond).  The ratio of the query length to the length of the top hit should be a tight and normal distribution around 1.

OK so what does this look like in practice?

Let’s look at a genome we know if OK, E coli MG1655:


On the left, we have the raw histogram, and on the right we have zoomed in so the y-axis ends at 500.  Generally speaking, this has worked I think – the vast majority of predicted proteins have a 1:1 ratio with their top hit in UniProt.

Here is Salmonella gallinarum which we know “has undergone extensive degradation through deletion and pseudogene formation”


Well, would you look at that!  There they are, all those pseudogenes appearing as predicted proteins that are shorter than their top hit.  For those of you saying “It’s not that obvious”, in the plots above, MG1655 has 157 proteins < 90% of the length of their top hit, whereas for Salmonella gallinarum the figure is 432.   By my recollection, gallinarum has over 200 annotated pseudogenes.

If you’re convinced that this actually might be working, then read on!

So here are a couple of PacBio genomes.  I genuinely chose these at random from recent Genome Announcements.  Generally speaking, all had high PacBio coverage, were assembled using Canu or HGAP, and polished using Pilon.

In no particular order:


(237 proteins < 90% of top hit)


(246 proteins < 90% of top hit)


(268 proteins < 90% of top hit)

Now, these are not terrible numbers, in fact they are quite good, but those long tails are a tiny bit of a worry, and are definitely worth checking out.  Bear in mind, if 200 pseudogenes in Salmonella gallinarum is “extensive degradation through deletion and pseudogene formation“, then 200 incorrect indel pseudogenes in your genome is a problem.  In the above, I can’t possibly tell you whether these  numbers of predicted short/interrupted ORFs are real or not because I don’t know about the biology.  However, I can say that given they were created with a technology known to have an indel problem, they are worth further investigation.

Indel sequencing errors are not the only problem, so are fragmented assemblies.  In fragmented assemblies, the contig ends also tend to be in ORFs and again, you get short/interrupted proteins.  Here is one of our MAGs from our recent rumen paper:


(310 proteins < 90% of top hit)

It’s not all bad news though, some of the MAGs are OK:


(126 proteins < 90% of top hit)

And at least other peoples’ MAGs are just as bad (or worse):


(265 proteins < 90% of top hit)

My absolute favourite way to assemble genomes is a hybrid approach, creating contigs with Illumina and then stitching together with a small amount of nanopore data.  We did this with B fragilis and the results look good:


(63 proteins < 90% of top hit)

So are there any examples of where it goes really badly hideously wrong?  Well the answer is yes.  We are assembling data from metagenomic samples using both PacBio and Nanopore.   The issue with metagenomes is that you may not have enough coverage to correct errors using the consensus approach – there simply aren’t enough bases at each position to get an accurate reflection.  We are seeing the same pattern for both PacBio and Nanopore, so I won’t name which technology, but…. well scroll down for the horror show.













Are you ready?  Scroll down….











A 3Mb contig from a Canu assembly:


(2635 proteins < 90% of top hit)

Now, the only work this has had done on it is that Canu performs read correction before assembly.  We did one round.  Sweet holy Jesus that’s bad isn’t it??

Can it be corrected?  To a degree.  Here it is after two rounds of Pilon and some technology-specific polishing:


(576 proteins < 90% of top hit).

So less bad, but still not great.  I’d say this assembly has problems, which is why we haven’t published yet.

What’s the take home message here?  Well there are a few:

  1. Errors matter.  Pay attention to them.
  2. Both long read technologies have indel problems and these probably cause frameshifts in your ORFs
  3. Polishing and consensus and Illumina correction helps but it doesn’t catch everything
  4. The problem will inevitably be worse if you have low coverage
  5. Comparing predicted proteins with their top hits in UniProt can identify where you have errors in your ORFs (or real pseudogenes!)

Code for doing this type of analysis will appear here.