I’m in real danger of sounding like a stuck record, but readers of the blog will know I have a bee in my bonnet about researchers who (unwittingly I’m sure) publish long-read assemblies with uncorrected indel errors in them.   If you are new to the blog, please read about my simple method for detecting indels, a deeper dive into errors in single molecule bacterial genomes, and my analyses of problems with published and unpublished single molecule assemblies of the human genome.

If you can’t be bothered reading, then the summary is:

  • BOTH single molecule sequencing technologies (PacBio and Nanopore), their major error mode is insertions / deletions
  • Once a genome is assembled, some of these errors remain in the assembly
  • If they are uncorrected, they inevitably cause a frameshift or premature stop codon in protein-coding regions
  • It’s not that you can’t correct these errors, it’s that mostly, outside of the top assembly groups in the world, people don’t

Latest off the line is this cool paper – High-Quality Whole-Genome Sequences for 77 Shiga Toxin-Producing Escherichia coli Strains Generated with PacBio Sequencing.  Being a man with a pipeline, I can just slot these right in and press go.  That’s what I did.

The methods section is remarkably brief:

The sequence reads were then filtered and assembled de novo using Falcon, Canu, or the PacBio Hierarchical Genome Assembly Process version

However, the GenBank accession numbers are all there and they have more details on which genome was assembled with which software tool.

Importantly, though, there is no mention of whether Quiver, Arrow, Racon or Pilon were used to correct the assemblies.  I know that Canu and HGAP have read error correction “built in”, but we have found this is often not enough, and a second or third round of correction is needed.   Whether this occurred on these genomes I have no idea.

My basic approach is to take each genome, predict the protein sequences using Prodigal, search those against UniProt TREMBL using Diamond, then compare the length of the predicted protein with the length of the top hit.  What we should see is a distribution tightly clustered around 1 i.e. the predicted protein should be about the same length as its top hit from the database.  We then look at the number of proteins that are less than 90% of the length of the top hit.

Here are those data for 73 of the E coli genomes:

Accession Software Version # proteins < 0.9 Coverage Strain Serotype # contigs Length
CP027640 HGAP v.3 1799 70.725 2014C-4705b O112:H21 2 5,329,029
CP027484 HGAP v.3 1636 57.246 2013C-4390 O76:H19 2 5,353,719
CP027675 HGAP v.3 1113 58.817 88-3510b O172:H25 2 5,140,386
CP027672 FALCON v0.3.0 1089 128.628 2014C-3003b O76:H19 3 5,234,640
CP027376 HGAP v.3 714 166.017 2013C-4404 O91:H14 4 5,009,822
CP027363 HGAP v.3 608 121.382 88-3001 O165:H25 2 5,195,753
CP027445 HGAP v.3 608 93.11 2013C-3492b O172:H25 2 5,196,105
CP027325 HGAP v.3 607 122.932 2013C-4830 O165:H25 3 5,135,675
CP027459 HGAP v.3 591 185.92 90-3040b O172:H25 2 5,253,712
CP027763 HGAP v.3 544 83.737 2015C-3125b O145:H28 3 5,471,132
CP027591 HGAP v.3 521 142.9 2014C-3011b O177:H25 4 5,168,350
CP027597 HGAP v.3 516 151.359 86-3153b O5:H9 2 5,342,528
CP027344 HGAP v.3 487 39.545 2014C-3946 O111:H8 3 5,264,938
CP027587 HGAP v.3 480 96.558 2013C-4974b O5:H9 2 5,235,560
CP027544 HGAP v.3 471 79.656 2013C-3264b O103:H25 2 5,486,407
CP027331 HGAP v.3 416 44.034 2013C-3277 O26:H11 4 5,438,694
CP027548 HGAP v.3 396 67.785 2014C-3061b O156:H25 2 5,303,935
CP027362 HGAP v.3 388 121.374 95-3192 O145:H28 1 5,385,516
CP027338 HGAP v.3 381 92.732 2014C-3051 O71:H11 2 5,597,475
CP027340 HGAP v.3 380 64.661 2015C-3121 O91:H14 2 5,366,577
CP027552 HGAP v.3 364 118.255 2015C-4498b O117:H8 2 5,434,442
CP027335 HGAP v.3 341 134.69 2014C-3716 O26:H11 3 5,568,215c
CP027472 HGAP v.3 328 47.07 2014C-3050b O118:H16 2 5,671,594
CP027454 HGAP v.3 322 69.53 2014C-4423b O121:H19 3 5,338,915
CP027351 HGAP v.3 321 55.972 2014C-3655 O121:H19 2 5,442,537
CP027317 HGAP v.3 320 120.949 2015C-3107 O121:H19 2 5,388,260
CP027368 HGAP v.3 320 126.946 2014C-3307 O178:H19 3 4,965,987
CP027555 HGAP v.3 310 121.607 2013C-3513b O186:H11 3 5,584,939
CP027766 HGAP v.3 308 119.094 2013C-3342 O117:H8 2 5,489,451
CP027361 HGAP v.3 303 98.329 2014C-4639 O26:H11 3 5,325,246
CP027593 HGAP v.3 303 112.231 2013C-3304 O71:H8 4 5,309,950c
CP027572 HGAP v.3 301 91.841 2013C-3996 O26:H11 2 5,858,766c
CP027352 HGAP v.3 296 80.59 2012C-4606 O26:H11 3 5,647,195
CP027310 HGAP v.3 285 110.963 2014C-4135 O113:H21 2 4,949,048
CP027355 HGAP v.3 285 96.531 2013C-4991 O80:H2 4 5,367,251
CP027435 HGAP v.3 285 119.123 2014C-3599 O121:H19 2 5,400,138
CP027550 HGAP v.3 285 78.081 2015C-4136CT1b O145:H34 2 4,836,918
CP027328 Canu v.1.6 282 197.393 2014C-3741 O174:H8 3 5,394,679c
CP027586 HGAP v.3 278 169.219 2012EL-2448b O91:H14 1 5,272,286
CP027347 HGAP v.3 276 81.415 2013C-4361 O111:H8 2 5,317,846
CP027599 HGAP v.3 274 113.492 97-3250 O26:H11 3 5,942,969
CP027390 HGAP v.3 272 86.476 2015C-4944 O26:H11 2 5,802,748
CP027452 HGAP v.3 271 44.21 2014C-3338b O183:H18 2 4,799,014
CP027437 HGAP v.3 269 94.01 2012C-4221b O101:H6 3 5,012,557
CP027319 HGAP v.3 259 103.5 2014C-3084 O145:H28 4 4,717,123
CP027442 HGAP v.3 257 81.78 2013C-3252 O69:H11 3 5,636,732
CP027387 HGAP v.3 254 81.261 2014C-3057 O26:H11 2 5,645,983
CP027380 HGAP v.3 251 98.093 2013C-3250 O111:H8 6 5,401,672
CP027221 HGAP v.3 248 70.272 2015C-3101 O111:H8 3 5,313,278
CP027573 HGAP v.3 247 136.013 2013C-4081b O111:H8 4 5,411,943
CP027323 HGAP v.3 243 230.753 2013C-3033 O146:H21 2 5,426,201
CP027577 HGAP v.3 243 76.725 2013C-4225b O103:H11 2 5,646,446
CP027546 HGAP v.3 241 69.884 2013C-4187b O71:H11 2 5,509,931
CP027464 HGAP v.3 239 157.835 2013C-4248 O186:H2 8 5,243,827
CP027582 HGAP v.3 232 94.362 2013C-4538b O118:H16 2 5,680,428
CP027312 HGAP v.3 230 110.444 2013C-3181 O113:H21 1 5,167,951
CP027307 HGAP v.3 229 75.219 2015C-3108 O111:H8 3 5,364,442
CP027219 HGAP v.3 228 73.109 2015C-3163 O103:H2 2 5,500,189
CP027366 HGAP v.3 227 99.281 89-3156 O174:H21 2 5,065,883
CP027388 HGAP v.3 226 100.369 2011C-4251 O45:H2 2 5,440,026
CP027313 HGAP v.3 225 104.403 2014C-3550 O118:H16 4 5,549,395
CP027520 HGAP v.3 225 84.93 89-3506b O126:H27 3 5,178,386c
CP027461 HGAP v.3 222 149.21 95-3322b O22:H5 1 5,095,223
CP027342 HGAP v.3 215 89.272 2014C-4587 OUND:H19 2 5,040,163
CP027449 HGAP v.3 193 60.29 2014C-3097b O181:H49 3 5,077,228
CP027371 HGAP v.3 186 185.611 2015C-3905 O181:H49 2 4,901,620
CP027584 HGAP v.3 181 186.121 00-3076b O113:H21 2 4,997,979
CP027373 HGAP v.3 167 169.936 05-3629 O8:H16 3 4,904,151
CP027440 HGAP v.3 158 127.87 2012C-4502 O185:H28 2 4,892,666
CP027579 HGAP v.3 155 98.187 2013C-4282b O77:H45 3 5,030,044
CP027457 HGAP v.3 149 222.41 88-3493b O137:H41 2 5,001,754
CP027447 HGAP v.3 141 157.52 2014C-3075 O36:H42 2 5,168,620
CP027462 FALCON v0.3.0 137 120.31 07-4299b O130:H11 2 4,847,172

 

The key column is perhaps “# proteins < 0.9″ which is the number of proteins we predict have a premature stop codon.  Many of these could be due to genuine pseudogenes, a known method of adaptation in bacterial genomes, however an excess indicates a problem.  However, in my experience, one does not usually see more than a 100-200 pseudogenes annotation in any bacterial genome.  As can be seen here, 4 of the E coli isolates have over 1000 genes that are predicted to be shorter than they should be!

Let’s look at this graphically.  Here is CP027462, the “best” genome with only 137 predicted short genes:

CP027462

And zoomed in:

CP027462.500

This looks pretty good, nice histogram centred around 1 which is what we expect.

What about the worst?  Here is CP027640, which has 1799 predicted short genes:

CP027640

And zoomed in:

CP027640.500

As you can see, there are way more proteins here that appear to be shorter than they should be.

I note there is no mention of pseudogenes, insertions or deletions in the paper, nor is there mention of error correction.  At this point in time I would treat these genomes with care!

I wanted to see if there was any relationship between the number of short genes and the coverage, and there is but it’s not significant:

lm

Call:
lm(formula = d$coverage ~ d$num_short)

Residuals:
    Min      1Q  Median      3Q     Max 
-66.671 -29.329  -8.588  19.749 119.103 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 118.31888    7.93060  14.919

 

If you remove the genomes with over 1000 short genes, then the line is basically flat, which suggests to me coverage is not the main issue here.



I’ll repeat what I said above – it’s not that you can’t correct errors in these assemblies, it’s that people don’t.

It’s hard and takes care and attention, which is difficult to do at scale.

If you are the authors of the above study, I’m sorry, this isn’t an attack on you.  If you need help with these assemblies, there are many, many groups who could help you and would be willing to do so.  Please get in touch if you want me to put you in touch with some of them.

However, at the moment, I am sorry but these genomes look like they should be treated with great care.  And I’m sorry to sound like a stuck record.