I’m in real danger of sounding like a stuck record, but readers of the blog will know I have a bee in my bonnet about researchers who (unwittingly I’m sure) publish long-read assemblies with uncorrected indel errors in them. If you are new to the blog, please read about my simple method for detecting indels, a deeper dive into errors in single molecule bacterial genomes, and my analyses of problems with published and unpublished single molecule assemblies of the human genome.
If you can’t be bothered reading, then the summary is:
- BOTH single molecule sequencing technologies (PacBio and Nanopore), their major error mode is insertions / deletions
- Once a genome is assembled, some of these errors remain in the assembly
- If they are uncorrected, they inevitably cause a frameshift or premature stop codon in protein-coding regions
- It’s not that you can’t correct these errors, it’s that mostly, outside of the top assembly groups in the world, people don’t
Latest off the line is this cool paper – High-Quality Whole-Genome Sequences for 77 Shiga Toxin-Producing Escherichia coli Strains Generated with PacBio Sequencing. Being a man with a pipeline, I can just slot these right in and press go. That’s what I did.
The methods section is remarkably brief:
The sequence reads were then filtered and assembled de novo using Falcon, Canu, or the PacBio Hierarchical Genome Assembly Process version
However, the GenBank accession numbers are all there and they have more details on which genome was assembled with which software tool.
Importantly, though, there is no mention of whether Quiver, Arrow, Racon or Pilon were used to correct the assemblies. I know that Canu and HGAP have read error correction “built in”, but we have found this is often not enough, and a second or third round of correction is needed. Whether this occurred on these genomes I have no idea.
My basic approach is to take each genome, predict the protein sequences using Prodigal, search those against UniProt TREMBL using Diamond, then compare the length of the predicted protein with the length of the top hit. What we should see is a distribution tightly clustered around 1 i.e. the predicted protein should be about the same length as its top hit from the database. We then look at the number of proteins that are less than 90% of the length of the top hit.
Here are those data for 73 of the E coli genomes:
|Accession||Software||Version||# proteins < 0.9||Coverage||Strain||Serotype||# contigs||Length|
The key column is perhaps “# proteins < 0.9″ which is the number of proteins we predict have a premature stop codon. Many of these could be due to genuine pseudogenes, a known method of adaptation in bacterial genomes, however an excess indicates a problem. However, in my experience, one does not usually see more than a 100-200 pseudogenes annotation in any bacterial genome. As can be seen here, 4 of the E coli isolates have over 1000 genes that are predicted to be shorter than they should be!
Let’s look at this graphically. Here is CP027462, the “best” genome with only 137 predicted short genes:
And zoomed in:
This looks pretty good, nice histogram centred around 1 which is what we expect.
What about the worst? Here is CP027640, which has 1799 predicted short genes:
And zoomed in:
As you can see, there are way more proteins here that appear to be shorter than they should be.
I note there is no mention of pseudogenes, insertions or deletions in the paper, nor is there mention of error correction. At this point in time I would treat these genomes with care!
I wanted to see if there was any relationship between the number of short genes and the coverage, and there is but it’s not significant:
Call: lm(formula = d$coverage ~ d$num_short) Residuals: Min 1Q Median 3Q Max -66.671 -29.329 -8.588 19.749 119.103 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 118.31888 7.93060 14.919
If you remove the genomes with over 1000 short genes, then the line is basically flat, which suggests to me coverage is not the main issue here.
I’ll repeat what I said above – it’s not that you can’t correct errors in these assemblies, it’s that people don’t.
It’s hard and takes care and attention, which is difficult to do at scale.
If you are the authors of the above study, I’m sorry, this isn’t an attack on you. If you need help with these assemblies, there are many, many groups who could help you and would be willing to do so. Please get in touch if you want me to put you in touch with some of them.
However, at the moment, I am sorry but these genomes look like they should be treated with great care. And I’m sorry to sound like a stuck record.