bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Category: sequencing (page 1 of 5)

Massive scale assembly of genomes from metagenomes in ruminants

Hopefully some of you will have seen our recent paper published in Nature Biotech:

Compendium of 4,941 rumen metagenome-assembled genomes for rumen microbiome biology and enzyme discovery
Robert D. Stewart, Marc D. Auffret, Amanda Warr, Alan W. Walker, Rainer Roehe & Mick Watson

Nature Biotechnology volume 37, pages 953‚Äď961 (2019)

In this paper we analyse samples from 283 beef cattle, generate between 50-60M PE150 reads from each, perform single sample assemblies, perform co-assemblies, bin the whole lot and talk about the 4941 metagenome-assembled-genomes that are at least 80% complete and at most 10% contaminated.

We also ran three MinION flowcells on one sample and generated lots of contigs >1Mb in length, and at least three whole bacterial chromosomes from novel bacterial species.

This was a real labour of love, and the entire dataset is available at DOI https://doi.org/10.7488/ds/2470 and at ENA accession PRJEB31266.

I just wanted to highlight a few things from the paper you may have missed in the details ūüėÄ

We have released all of our data

And I mean all of it. This includes the primary raw data, the single sample metagenome assemblies, the metagenome co-assemblies, the bin layer (more of what that means below), the 4941 MAGs, the annotations and predicted proteins, DIAMOND search hits against KEGG and UniProt – literally everything. If there is something you want that’s missing let us know. Some of it may not have uploaded to ENA yet, but it should all be there very soon. If it is not in the ENA yet, it should be in at the Edinburgh DataShare DOI https://doi.org/10.7488/ds/2470.

There are more MAGs than you think….

Recent massive scale re-analysis of the human microbiome data by the Segata, Finn and Kyrpides labs revealed 10,000s of new genomes from all human body sites. In theory these studies dwarf our own, but they used a cut-off of 50% complete, whereas our 4941 are all 80% complete.

Well, the new ENA model for metagenomic assemblies includes both a bin layer and a MAG layer. The MAGs are your final product – the fully analysed, cleaned and annotated genomes. In our case this is 4941. The bin layer is the layer of bins from which the MAGs derive. In our case we have chosen to put all bins that were at least 50% complete and no more than 10% contaminated into the bin layer. So in actual fact, whilst we do not talk about this in the paper, we are releasing 20,469 new MAGs from the rumen microbiome. These should all be available from the ENA soon.

We got a (near) complete bacterial chromosome from an Illumina metagenome assembly!!

Almost unheard of, in my experience, but we assembled a near complete genome of a novel Proteobacteria species as a single contig direct from the Illumina data using IDBA-UD. We are not sure why the stars aligned and enabled us to do it,  but we were certainly presently surprised.

These small Proteobacteria are numerous and abundant in our rumen datasets and have also been found in other metagenomic datasets. They do not appear to have been cultured and sequenced, and with their small genomes, we have to speculate that they depend on other microbes to survive, so can only be isolated from communities. Nevertheless, we know their CAZyme profile now and one of my students is going to have a crack at culturing one of these in a placement with Alan Walker later this year.

MAGs can be very high quality

I know there are MAG-sceptics out there, but I would just like to point them to Supplementary Figure 13.¬† The bottom left shows whole-genome alignments between an Illumina MAG (vertical axis) and a Nanopore whole-chromosome. These came from different samples. SO what is the take home? Well these are from two different samples, sequenced using two different technologies and assembled using completely different software and methods… yet they came up with the same answer. In other words, the Illumina MAG is of such high quality that it agrees, almost completely, with a single, whole-chromosome assembly from the Nanopore data. Remarkable.

This paper is a reference for IDEEL

I have blogged at great length about indels in long-read assembled genomes, and we created a simple Snakemake pipeline (IDEEL – Indels are not ideal) to demonstrate these were a problem. I was pleased when this pipeline got a shout out from Kat Holt on Twitter

Until now, IDEEL has not had a formal publication, and as we used it extensively in this paper, and mention it explicitly by name, then I propose that our new paper is the formal citation for IDEEL.

The real story is the proteins

Whilst the figures of 4941 MAGs and the single chromosome assemblies will take the headlines, for me the real story here is in the proteins. When clustering all known rumen miccrobial proteins, using a 90% identity cut-off, we ended up with 6.24million clusters. However, 3.67million of them are singletons that only contain proteins from our new MAGs. This is an incredible amount of new protein diversity, and figuring out what their function is will be the next major challenge for metagenomics researchers.

We also compared CAZymes to CAZy and find that CAZy poorly represents rumen CAZymes.

All microbiome work ends up with the need to culture

I increasingly believe we have to bring in to culture many of these microbes if we are to take the next step in microbiome research. We predict over 2000 novel species of bacteria and archaea. We cannot possibly know what they do, how they behave and interact, without bringing them in to culture.

There is a myth that 95% of rumen bacteria are uncultureable. This myth needs to die. It already killed one of our grant applications ūüôĀ There is nothing inherently uncultureable with the majority of these bugs, and even with the small Proteobacteria I mention above, it should be possible with a bit of work and the right co-culturing conditions. So please let this myth die!! All bugs are cultureable, some are more difficult than others.

Onwards and upwards ūüôā

On stuck records and indel errors; or “stop publishing bad genomes”

I’m in real danger of sounding like a stuck record, but readers of the blog will know I have a bee in my bonnet about researchers who (unwittingly I’m sure) publish long-read assemblies with uncorrected indel errors in them.¬† ¬†If you are new to the blog, please read about my simple method for detecting indels, a deeper dive into errors in single molecule bacterial genomes, and my analyses of problems with published and unpublished single molecule assemblies of the human genome.

If you can’t be bothered reading, then the summary is:

  • BOTH single molecule sequencing technologies (PacBio and Nanopore), their major error mode is insertions / deletions
  • Once a genome is assembled, some of these errors remain in the assembly
  • If they are uncorrected, they inevitably cause a frameshift or premature stop codon in protein-coding regions
  • It’s not that you¬†can’t correct these errors, it’s that mostly, outside of the top assembly groups in the world, people don’t

Latest off the line is this cool paper –¬†High-Quality Whole-Genome Sequences for 77 Shiga Toxin-Producing Escherichia coli Strains Generated with PacBio Sequencing.¬† Being a man with a pipeline, I can just slot these right in and press go.¬† That’s what I did.

The methods section is remarkably brief:

The sequence reads were then filtered and assembled de novo using Falcon, Canu, or the PacBio Hierarchical Genome Assembly Process version

However, the GenBank accession numbers are all there and they have more details on which genome was assembled with which software tool.

Importantly, though, there is no mention of whether Quiver, Arrow, Racon or Pilon were used to correct the assemblies.¬† I know that Canu and HGAP have read error correction “built in”, but we have found this is often not enough, and a second or third round of correction is needed.¬† ¬†Whether this occurred on these genomes I have no idea.

My basic approach is to take each genome, predict the protein sequences using Prodigal, search those against UniProt TREMBL using Diamond, then compare the length of the predicted protein with the length of the top hit.  What we should see is a distribution tightly clustered around 1 i.e. the predicted protein should be about the same length as its top hit from the database.  We then look at the number of proteins that are less than 90% of the length of the top hit.

Here are those data for 73 of the E coli genomes:

Accession Software Version # proteins < 0.9 Coverage Strain Serotype # contigs Length
CP027640 HGAP v.3 1799 70.725 2014C-4705b O112:H21 2 5,329,029
CP027484 HGAP v.3 1636 57.246 2013C-4390 O76:H19 2 5,353,719
CP027675 HGAP v.3 1113 58.817 88-3510b O172:H25 2 5,140,386
CP027672 FALCON v0.3.0 1089 128.628 2014C-3003b O76:H19 3 5,234,640
CP027376 HGAP v.3 714 166.017 2013C-4404 O91:H14 4 5,009,822
CP027363 HGAP v.3 608 121.382 88-3001 O165:H25 2 5,195,753
CP027445 HGAP v.3 608 93.11 2013C-3492b O172:H25 2 5,196,105
CP027325 HGAP v.3 607 122.932 2013C-4830 O165:H25 3 5,135,675
CP027459 HGAP v.3 591 185.92 90-3040b O172:H25 2 5,253,712
CP027763 HGAP v.3 544 83.737 2015C-3125b O145:H28 3 5,471,132
CP027591 HGAP v.3 521 142.9 2014C-3011b O177:H25 4 5,168,350
CP027597 HGAP v.3 516 151.359 86-3153b O5:H9 2 5,342,528
CP027344 HGAP v.3 487 39.545 2014C-3946 O111:H8 3 5,264,938
CP027587 HGAP v.3 480 96.558 2013C-4974b O5:H9 2 5,235,560
CP027544 HGAP v.3 471 79.656 2013C-3264b O103:H25 2 5,486,407
CP027331 HGAP v.3 416 44.034 2013C-3277 O26:H11 4 5,438,694
CP027548 HGAP v.3 396 67.785 2014C-3061b O156:H25 2 5,303,935
CP027362 HGAP v.3 388 121.374 95-3192 O145:H28 1 5,385,516
CP027338 HGAP v.3 381 92.732 2014C-3051 O71:H11 2 5,597,475
CP027340 HGAP v.3 380 64.661 2015C-3121 O91:H14 2 5,366,577
CP027552 HGAP v.3 364 118.255 2015C-4498b O117:H8 2 5,434,442
CP027335 HGAP v.3 341 134.69 2014C-3716 O26:H11 3 5,568,215c
CP027472 HGAP v.3 328 47.07 2014C-3050b O118:H16 2 5,671,594
CP027454 HGAP v.3 322 69.53 2014C-4423b O121:H19 3 5,338,915
CP027351 HGAP v.3 321 55.972 2014C-3655 O121:H19 2 5,442,537
CP027317 HGAP v.3 320 120.949 2015C-3107 O121:H19 2 5,388,260
CP027368 HGAP v.3 320 126.946 2014C-3307 O178:H19 3 4,965,987
CP027555 HGAP v.3 310 121.607 2013C-3513b O186:H11 3 5,584,939
CP027766 HGAP v.3 308 119.094 2013C-3342 O117:H8 2 5,489,451
CP027361 HGAP v.3 303 98.329 2014C-4639 O26:H11 3 5,325,246
CP027593 HGAP v.3 303 112.231 2013C-3304 O71:H8 4 5,309,950c
CP027572 HGAP v.3 301 91.841 2013C-3996 O26:H11 2 5,858,766c
CP027352 HGAP v.3 296 80.59 2012C-4606 O26:H11 3 5,647,195
CP027310 HGAP v.3 285 110.963 2014C-4135 O113:H21 2 4,949,048
CP027355 HGAP v.3 285 96.531 2013C-4991 O80:H2 4 5,367,251
CP027435 HGAP v.3 285 119.123 2014C-3599 O121:H19 2 5,400,138
CP027550 HGAP v.3 285 78.081 2015C-4136CT1b O145:H34 2 4,836,918
CP027328 Canu v.1.6 282 197.393 2014C-3741 O174:H8 3 5,394,679c
CP027586 HGAP v.3 278 169.219 2012EL-2448b O91:H14 1 5,272,286
CP027347 HGAP v.3 276 81.415 2013C-4361 O111:H8 2 5,317,846
CP027599 HGAP v.3 274 113.492 97-3250 O26:H11 3 5,942,969
CP027390 HGAP v.3 272 86.476 2015C-4944 O26:H11 2 5,802,748
CP027452 HGAP v.3 271 44.21 2014C-3338b O183:H18 2 4,799,014
CP027437 HGAP v.3 269 94.01 2012C-4221b O101:H6 3 5,012,557
CP027319 HGAP v.3 259 103.5 2014C-3084 O145:H28 4 4,717,123
CP027442 HGAP v.3 257 81.78 2013C-3252 O69:H11 3 5,636,732
CP027387 HGAP v.3 254 81.261 2014C-3057 O26:H11 2 5,645,983
CP027380 HGAP v.3 251 98.093 2013C-3250 O111:H8 6 5,401,672
CP027221 HGAP v.3 248 70.272 2015C-3101 O111:H8 3 5,313,278
CP027573 HGAP v.3 247 136.013 2013C-4081b O111:H8 4 5,411,943
CP027323 HGAP v.3 243 230.753 2013C-3033 O146:H21 2 5,426,201
CP027577 HGAP v.3 243 76.725 2013C-4225b O103:H11 2 5,646,446
CP027546 HGAP v.3 241 69.884 2013C-4187b O71:H11 2 5,509,931
CP027464 HGAP v.3 239 157.835 2013C-4248 O186:H2 8 5,243,827
CP027582 HGAP v.3 232 94.362 2013C-4538b O118:H16 2 5,680,428
CP027312 HGAP v.3 230 110.444 2013C-3181 O113:H21 1 5,167,951
CP027307 HGAP v.3 229 75.219 2015C-3108 O111:H8 3 5,364,442
CP027219 HGAP v.3 228 73.109 2015C-3163 O103:H2 2 5,500,189
CP027366 HGAP v.3 227 99.281 89-3156 O174:H21 2 5,065,883
CP027388 HGAP v.3 226 100.369 2011C-4251 O45:H2 2 5,440,026
CP027313 HGAP v.3 225 104.403 2014C-3550 O118:H16 4 5,549,395
CP027520 HGAP v.3 225 84.93 89-3506b O126:H27 3 5,178,386c
CP027461 HGAP v.3 222 149.21 95-3322b O22:H5 1 5,095,223
CP027342 HGAP v.3 215 89.272 2014C-4587 OUND:H19 2 5,040,163
CP027449 HGAP v.3 193 60.29 2014C-3097b O181:H49 3 5,077,228
CP027371 HGAP v.3 186 185.611 2015C-3905 O181:H49 2 4,901,620
CP027584 HGAP v.3 181 186.121 00-3076b O113:H21 2 4,997,979
CP027373 HGAP v.3 167 169.936 05-3629 O8:H16 3 4,904,151
CP027440 HGAP v.3 158 127.87 2012C-4502 O185:H28 2 4,892,666
CP027579 HGAP v.3 155 98.187 2013C-4282b O77:H45 3 5,030,044
CP027457 HGAP v.3 149 222.41 88-3493b O137:H41 2 5,001,754
CP027447 HGAP v.3 141 157.52 2014C-3075 O36:H42 2 5,168,620
CP027462 FALCON v0.3.0 137 120.31 07-4299b O130:H11 2 4,847,172


The key column is perhaps “# proteins < 0.9″¬†which is the number of proteins we predict have a premature stop codon.¬† Many of these could be due to genuine pseudogenes, a known method of adaptation in bacterial genomes, however an excess indicates a problem.¬† However, in my experience, one does not usually see more than a 100-200 pseudogenes annotation in any bacterial genome.¬† As can be seen here, 4 of the E coli isolates have over 1000 genes that are predicted to be shorter than they should be!

Let’s look at this graphically.¬† Here is¬†CP027462, the “best” genome with only 137 predicted short genes:


And zoomed in:


This looks pretty good, nice histogram centred around 1 which is what we expect.

What about the worst?  Here is CP027640, which has 1799 predicted short genes:


And zoomed in:


As you can see, there are way more proteins here that appear to be shorter than they should be.

I note there is no mention of pseudogenes, insertions or deletions in the paper, nor is there mention of error correction.  At this point in time I would treat these genomes with care!

I wanted to see if there was any relationship between the number of short genes and the coverage, and there is but it’s not significant:


lm(formula = d$coverage ~ d$num_short)

    Min      1Q  Median      3Q     Max 
-66.671 -29.329  -8.588  19.749 119.103 

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 118.31888    7.93060  14.919


If you remove the genomes with over 1000 short genes, then the line is basically flat, which suggests to me coverage is not the main issue here.

I’ll repeat what I said above – it’s not that you can’t correct errors in these assemblies, it’s that people don’t.

It’s hard and takes care and attention, which is difficult to do at scale.

If you are the authors of the above study, I’m sorry, this isn’t an attack on you.¬† If you need help with these assemblies, there are many, many groups who could help you and would be willing to do so.¬† Please get in touch if you want me to put you in touch with some of them.

However, at the moment, I am sorry but these genomes look like they should be treated with¬†great care.¬† And I’m sorry to sound like a stuck record.

How accurate is the nanopore-only assembly of GM12878?

Yes, I still blog!

A quick history.¬† ¬†As many of you will be aware, Jain¬†et al published a fantastic paper where they produced the first de novo genome assembly of a human genome using the MinION, a portable DNA sequencer that uses nanopores to detect the sequence of single molecules.¬† ¬†After fixing errors with Illumina, they report an accuracy of 99.8% (more on how useful that is later….)

Despite the fact I loved this paper, I was slightly frustrated that they didn’t tackle, head on, the major issues with single molecule assemblies – insertion and deletion errors.¬† In response, I wrote my own paper, showing that indel errors are ubiquitous in both PacBio and nanopore published single molecule assemblies.

The analysis I carried out in the linked bioRxiv is fairly basic to say the least, so we have now carried out a more complex analysis using full length CDS alignments with¬†splign.¬† We also introduce the concept of control assemblies.¬† GM12878 is a cell line, and as such, has accrued over time mutations and indels in areas of the genome it doesn’t use, including many genes.¬† However, Illumina-only assemblies of NA12878 exist e.g.¬†GCA_000185165.1 , so we can use this as a control – in other words, we can look for genes with indels in the single molecule assemblies that do not have indels in the appropriate Illumina-only assembly.

Serge Koren recently published a blog post detailing a new version of the nanopore assembly of NA12878 using¬†only the nanopore data, in other words they didn’t use Pilon/Illumina at all, simply polished with nanopore data using the in-built canu read-correction and signal-level polishing with nanopolish.¬† This is apparently 99.76% accurate.

So how do the two nanopore assemblies compare?

# transcripts with indels # genes with indels
Nanopore + pilon 9051 4282
Nanopore only 22440 10111

A couple of things to point out:

  • We assayed around 46000 CDS transcript sequences we had evidence might be problematic
  • We looked at the best, longest alignment for each CDS that was predicted to produce a protein
  • We only looked at alignments including >80% of the CDS
  • The above numbers are transcripts/genes that have indels in the nanopore assembly but not in the illumina assembly

We can see that both assemblies still have indel issues.  The polishing with pilon has removed many, but 4282 genes remain affected.  The nanopore only assembly, polished with nanopolish, has over 10,000 protein coding genes with indels.

You are shouting – show me an example!

Here is an example of a transcript aligned against an illumina assembly of NA12878: link

Here is the same transcript aligned against the nanopore-only assembly of NA12878: link

Finally, here is the same transcript in the nanpore+pilon assembly of NA12878: link

As you can see, there is no evidence from Illumina data that NA12878 has problems with this transcript.  There are indels in the nanopore-only assembly, that have been fixed in the nanopore+pilon assembly.

I’m not trying to attack anyone or any technology, but we can’t fix problems if we don’t talk about them.

I remain concerned that people are publishing pacbio and nanopore assemblies without paying sufficient attention to indel errors, and our work repeatedly demonstrates that both PacBio and Nanopore assemblies suffer from the problem, even after polishing.¬† Our own solution to this has been to fix the remaining errors manually.¬† Do not assume that you shouldn’t be doing this!

Nanopore is a fantastic technology, but we should not overstate its accuracy nor ignore its problems.

For me, statements that entire, complex, 3Gb assemblies are “99.8%” accurate are at best completely pointless and at worst misleading.¬† I don’t blame authors for using them, reviewers almost certainly ask for them, but they are genuinely pointless statistics.

I hope you enjoyed this blog post!




A simple test for uncorrected insertions and deletions (indels) in bacterial genomes

A friend and a colleague of mine once said about me “he’s a details man”, and it was after we had discussed the fact some of my papers consist solely in pointing out the errors other people ignore – in RNA-Seq for example, or in genome assemblies (I have another under review!).

By now, those of you familiar with my earlier work will be jumping up and shouting

“Of course!¬† Even as far back as 2012, he was warning us about the dangers of mis-annotated pseudogenes!¬† Just look at Figure 2 in his review of bacterial genome annotation!”

Well, let’s dig into that a bit more.

For the uninitiated, the vast majority of bases in most bacterial genomes fall within open-reading frames (ORFs) that are eventually translated into proteins.  Bacterial evolution, and often niche specificity, is characterised by the psedogen-isation of these ORFs Рbasically, many replication errors in ORFs (especially insertions and deletions) can introduce a top codon, the transcript and resulting protein are truncated, they are often no longer functional, and you have a pseudogene.  This happens when the ORFs are not under positive selection.

HOWEVER.  Sequencing errors can also introduce errors, and you end up with incorrectly annotated pseudogenes.

And OH NOES we just so happen to be flooded with data from long read technologies that have an indel problem.

Let’s assume you have a shiny new bacterial genome, sequenced on either PacBio or Nanopore, and you want to figure out if it has an indel problem.¬† Here is my simple and quick test:

First predict proteins using a gene finder.  Then map those proteins to UniProt (using blastp or diamond).  The ratio of the query length to the length of the top hit should be a tight and normal distribution around 1.

OK so what does this look like in practice?

Let’s look at a genome we know if OK, E coli MG1655:


On the left, we have the raw histogram, and on the right we have zoomed in so the y-axis ends at 500.  Generally speaking, this has worked I think Рthe vast majority of predicted proteins have a 1:1 ratio with their top hit in UniProt.

Here is¬†Salmonella gallinarum which we know “has undergone extensive degradation through deletion and pseudogene formation”


Well, would you look at that!¬† There they are, all those pseudogenes appearing as predicted proteins that are shorter than their top hit.¬†¬†For those of you saying “It’s not that obvious”, in the plots above, MG1655 has 157 proteins < 90% of the length of their top hit, whereas for¬†Salmonella¬†gallinarum the figure is 432.¬† ¬†By my recollection,¬†gallinarum has over 200 annotated pseudogenes.

If you’re convinced that this actually might be working, then read on!

So here are a couple of PacBio genomes.  I genuinely chose these at random from recent Genome Announcements.  Generally speaking, all had high PacBio coverage, were assembled using Canu or HGAP, and polished using Pilon.

In no particular order:


(237 proteins < 90% of top hit)


(246 proteins < 90% of top hit)


(268 proteins < 90% of top hit)

Now, these are not terrible numbers, in fact they are quite good, but those long tails are a tiny bit of a worry, and are definitely worth checking out.¬† Bear in mind, if 200 pseudogenes in¬†Salmonella gallinarum is “extensive degradation through deletion and pseudogene formation“, then 200 incorrect indel pseudogenes in your genome is a problem.¬† In the above,¬†I can’t possibly tell you whether these¬† numbers of predicted short/interrupted ORFs are real or not because I don’t know about the biology.¬† However, I can say that given they were created with a technology known to have an indel problem, they are worth further investigation.

Indel sequencing errors are not the only problem, so are fragmented assemblies.  In fragmented assemblies, the contig ends also tend to be in ORFs and again, you get short/interrupted proteins.  Here is one of our MAGs from our recent rumen paper:


(310 proteins < 90% of top hit)

It’s not all bad news though, some of the MAGs are OK:


(126 proteins < 90% of top hit)

And at least other peoples’ MAGs are just as bad (or worse):


(265 proteins < 90% of top hit)

My absolute favourite way to assemble genomes is a hybrid approach, creating contigs with Illumina and then stitching together with a small amount of nanopore data.  We did this with B fragilis and the results look good:


(63 proteins < 90% of top hit)

So are there any examples of where it goes really badly hideously wrong?¬† Well the answer is yes.¬† We are assembling data from metagenomic samples using both PacBio and Nanopore.¬† ¬†The issue with metagenomes is that you may not have enough coverage to correct errors using the consensus approach – there simply aren’t enough bases at each position to get an accurate reflection.¬† We are seeing the same pattern for both PacBio and Nanopore, so I won’t name which technology, but…. well¬†scroll down for the horror show.













Are you ready?¬† Scroll down….











A 3Mb contig from a Canu assembly:


(2635 proteins < 90% of top hit)

Now,¬†the only work this has had done on it is that Canu performs read correction before assembly.¬† We did one round.¬† Sweet holy Jesus that’s bad isn’t it??

Can it be corrected?  To a degree.  Here it is after two rounds of Pilon and some technology-specific polishing:


(576 proteins < 90% of top hit).

So less bad, but still not great.¬† I’d say this assembly has problems, which is why we haven’t published yet.

What’s the take home message here?¬† Well there are a few:

  1. Errors matter.  Pay attention to them.
  2. Both long read technologies have indel problems and these probably cause frameshifts in your ORFs
  3. Polishing and consensus and Illumina correction helps but it doesn’t catch everything
  4. The problem will inevitably be worse if you have low coverage
  5. Comparing predicted proteins with their top hits in UniProt can identify where you have errors in your ORFs (or real pseudogenes!)

Code for doing this type of analysis will appear here.

Judge rules in favour of Oxford Nanopore in patent dispute with PacBio

Forgive me if I get any of the details wrong, I am not a lawyer, but the title of this post is my take on a judgement passed down in the patent infringement case PacBio brought against ONT.

To get your hands on the documentation, you need to register and log in to EDIS, click “Advanced Search”, do a search for “single molecule sequencing” and click the top hit.

My interpretation of the¬†documentation is that the judge has massively limited the scope of the patents in question by expanding on the definition of “single molecule sequencing”. ¬†ONT argued that in the patents in question, “single molecule sequencing” referred only to “sequencing of a single molecule by template-dependent synthesis”, and the judge agreed with this definition.


All claims are then subsequently limited to template-dependent synthesis, which of course is NOT what Oxford Nanopore do.


The document then goes into an area that would make all biological ontologists rejoice – THEY TRY AND DEFINE THE TERM “SEQUENCE”. ¬†I can almost hear the voices shouting “I told you so!” coming out of Manchester and Cambridge as I write ūüėČ

HiSeq move over, here comes Nova! A first look at Illumina NovaSeq

Illumina have announced NovaSeq, an entirely new sequencing system that completely disrupts their existing HiSeq user-base. ¬†In my opinion, if you have a HiSeq and you are NOT currently engaged in planning to migrate to NovaSeq, then you will be out of business in 1-2 years time. ¬†It’s not quite the death knell for HiSeqs, but it’s pretty close and moving to NovaSeq over the next couple of years is now the only viable option if you see Illumina as an important part of your offering.

Illumina have done this before, it’s what they do, so no-one should be surprised.

The stats

I’ve taken the stats from the spec sheet linked above and produced the following. ¬†If there are any mistakes let me know.

There are two machines Рthe NovaSeq 5000 and 6000 Рand 4 flowcell types РS1, S2, S3 and S4.  The 6000 will run all four flowcell types and the 5000 will only run the first two.  Not all flowcell types are immediately available, with S4 scheduled for 2018 (See below)

S1 S2 S3 S4 2500 HO 4000 X
Reads per flowcell (billion) 1.6 3.3 6.6 10 2 2.8 3.44
Lanes per flowcell 2 2 4 4 8 8 8
Reads per lane (million) 800 1650 1650 2500 250 350 430
Throughput per lane (Gb) 240 495 495 750 62.5 105 129
Throughput per flowcell (Gb) 480 990 1980 3000 500 840 1032
Total Lanes 4 4 8 8 16 16 16
Total Flowcells 2 2 2 2 2 2 2
Run Throughput (Gb) 960 1980 3960 6000 1000 1680 2064
Run Time (days) 2-2.5 2-2.5 2-2.5 2-2.5 6 3.5 3

For X Ten, simply mutiply X figures by 10.  These are maximum figures, and assume maximum read lengths.

Read lengths available on NovaSeq 2×50, 2×100 and 2x150bp. ¬†This is unfortunate as the sweet spot for RNA-Seq and exomes is 2x75bp.

As you can see from the stats, the massive innovation here is the cluster density, which has hugely increased. We also have shorter run times.

So what does this all mean?

Well let’s put this to bed straight away – HiSeq X installations are still viable. ¬†This from an Illumina tech on Twitter:


We learn two things from this – first, that HiSeq X is still going to be cheaper for human genomes until S4 comes out, and S4 won’t be out until 2018.

So Illumina won’t sell any more HiSeq X, but current installations are still viable and still the cheapest way to sequence genomes.

I also have this from an un-named source:

speculation from Illumina rep “X’s will be king for awhile. Cost per GB on those will likely be adjusted to keep them competitive for a long time.”

So X is OK, for a while.

What about HiSeq 4000? Well to understand this, you need to understand 4000 and X.

The HiSeq 4000 and HiSeq X

First off, the HiSeq X IS NOT¬†a human genome only machine. ¬†It is a genome-only machine. ¬†You have been able to do non-human genomes for about a year now. ¬†Anything you like as long as it’s a whole genome and it’s 30X or above. ¬†The 4000 is reserved for everything else because you cannot do exomes, RNA-Seq, ChIP-Seq etc on the HiSeq X. ¬†HiSeq 4000 reagents are more expensive, which means that per-Gb every assay is more expensive than genome sequencing on Illumina.

However, no such restrictions exist on the NovaSeq Рwhich means that every assay will now cost the same on NovaSeq.   This is what led me to say this on Twitter:

At Edinburgh Genomics, roughly speaking, we charge approx. 2x as much for a 4000 lane as we do for an X lane.  Therefore, per Gb, RNA-Seq is approx. twice as expensive as genome sequencing.  NovaSeq promises to make this per-Gb cost the same, so does that mean RNA-Seq will be half price?  Not quite.  Of course no-one does a whole lane of RNA-Seq, we multiplex multiple samples in one lane.  When you do this, library prep costs begin to dominate, and for most of my own RNA-Seq samples, library prep is about 50% of the per-sample cost, and 50% is sequencing.  NovaSeq promises to half the sequencing costs, which means the per-sample cost will come down by 25%.

These are really rough numbers, but they will do for now.  To be honest, I think this will make a huge difference to some facilities, but not for others.  Larger centers will absolutely need to grab that 25% reduction to remain competitive, but smaller, boutique facilities may be able to ignore it for a while.

Capital outlay

Expect to get pay $985k for a NovaSeq 6000 and $850k for a 5000.

Time issues

One supposedly big advantage is that NovaSeq takes 40 hours to run, compared to the existing 3 days for a HiSeq X. ¬† Comparing like with like that’s 40 hours vs 72 hours. ¬†This might be important in the clinical space, but not for much else.

Putting this in context, when you send your samples to a facility, they will be QC-ed first, then put in library prep queue, then put in sequencing queue, then QC-ed bioinformatically before finally being delivered. ¬†Let’s be generous and say this takes 2 weeks. ¬†Out of that sequencing time is 3 days. ¬†So instead of waiting 14 days, you’re waiting 13 days. ¬†Who cares?

Clinically having the answer 1 day earlier may be important, but let’s not forget, even on our ¬£1M cluster, at scale the BWA+GATK pipeline itself takes 3 days. ¬†So again you’re looking at 5 days vs 6 days. ¬†Is that a massive advantage? ¬†I’m not sure. ¬†Of course you could buy one of the super-fast bioinformatics solutions, and maybe then the 40 hour run time will count.

Colours and quality

NovaSeq marks a switch from the traditional HiSeq 4 colour chemistry to the quicker NextSeq 2 colour chemistry.  As Brian Bushnell has noted on this blog, NextSeq data quality is quite a lot worse than HiSeq 2500, so we may see a dip in data quality, though Illumina claim 85% above Q30.


Is the long read sequencing war already over?

My enthusiasm for nanopore sequencing is well known; we have some awesome software for working with the data; we won a grant to support this work; and we successfully assembled a tricky bacterial genome.  This all led to Nick and I writing an editorial for Nature Methods.

So, clearly some bias towards ONT from me.

Having said all of that, when PacBio announced the Sequel, I was genuinely excited.   Why?  Well, revolutionary and wonderful as the MinION was at the time, we were getting ~100Mb runs.  Amazing technology, mobile sequencer, tri-corder, just incredible engineering Рbut 100Mb was never going to change the world.  Some uses, yes; but for other uses we need more data.  Enter Sequel.

However, it turns out Sequel isn’t really delivering on promises. ¬†Rather than 10Gb runs, folk are getting between 3 and 5Gb from the Sequel:

At the same time, MinION has been coming along great guns:

Whilst we are right to be skeptical about ONT’s claims about their own sequencer, other people who use the MinION have backed up these claims and say they regularly get figures similar to this. If you don’t believe me, go get some of the World’s first Nanopore human data here.

PacBio also released some data for Sequel here.

So how do they stack up against one another? ¬†I won’t deal with accuracy here, but we can look at #reads, read length and throughput.

To be clear, we are comparing “rel2-nanopore-wgs-216722908-FAB42316.fastq.gz”¬†a fairly middling run from the NA12878 release, m54113_160913_184949.subreads.bam¬†and one of the Sequel SMRT cell datasets released.

Read length histograms:


As you can see, the longer reads are roughly equivalent in length, but MinION has far more reads at shorter read lengths.  I know the PacBio samples were size selected on Blue Pippin, but unsure about the MinION data.

The MinION dataset includes 466,325 reads, over twice as many as the Sequel dataset at 208,573 reads.

In terms of throughput, MinION again came out on top, with 2.4Gbases of data compared to just 2Gbases for the Sequel.

We can limit to reads >1000bp, and see a bit more detail:


  • The MinION data has¬†326,466 reads greater than 1000bp summing to 2.37Gb.
  • The Sequel data has¬†192,718 reads greater than 1000bp, summing to 2Gb.

Finally, for reads over 10,000bp:

  • The MinION data has 84,803 reads greater than 10000bp summing to 1.36Gb.
  • The Sequel data has 83,771¬†reads greater than 10000bp, summing to 1.48Gb.

These are very interesting stats!

This is pretty bad news for PacBio.  If you add in the low cost of entry for MinION, and the £300k cost of the Sequel, the fact that MinION is performing as well as, if not better, than Sequel is incredible.  Both machines have a long way to go РPacBio will point to their roadmap, with longer reads scheduled and improvements in chemistry and flowcells.  In response, ONT will point to the incredible development path of MinION, increased sequencing speeds and bigger flowcells.  And then there is PromethION.

So is the war already over?   Not quite yet.  But PacBio are fighting for their lives.

Building a Kraken database with new FTP structure and no GI numbers

Progress is usually good, except when a reliable resource completely changes their data and file structures and mess up everything you have been trying to do. ¬†The NCBI used to arrange bacterial genomes in such a very easy way to understand and download; now it’s a bit tougher. ¬†So tough in fact that they had to create a new FAQ.

At the same time as moving around genome data, they also decided to retire GI numbers (thanks to Andreas for the link!).

This is a problem for Kraken, a metagenomic profiler, because it relies both on the old style genomic data structures and the GI numbers that come with them.  Below are my attempts to build a Kraken database, using the new FTP structure and no GIs.

Downloading the data

It is pretty easy to follow the FAQ above, but I have created some perl scripts here that should help.

You will need:

  • Perl
  • BioPerl
  • Git
  • A linux shell (bash) that has wget
  • Internet access
  • A folder that you have write access to
  • Maybe 40Gb of space

The magic in these scripts is that they download the data and then add the necessary¬†“|kraken:taxid|xxxx” text that replaces GI number for building what Kraken refers to as “custom databases”

Note also that the script only downloads fasta for assemblies classed as “Complete Genome”. ¬†Other options are:

  • Chromosome
  • Complete Genome
  • Contig
  • Scaffold

If you want these others then edit the scripts appropriately.

# clone the git repo
git clone https://github.com/mw55309/Kraken_db_install_scripts.git

# either put the scripts in your path or run them from that directory

# make download directory
mkdir downloads
cd downloads

# run for each branch of life you wish to download
perl download_bacteria.pl
perl download_archaea.pl
perl download_fungi.pl
perl download_protozoa.pl
perl download_viral.pl

# you should now have five directories, one for each branch

# build a new database 
# download taxonomy
kraken-build --download-taxonomy --db kraken_bvfpa_080416

# for each branch, add all fna in the directory to the database
for dir in fungi protozoa archaea viral bacteria; do
        for fna in `ls $dir/*.fna`; do
                kraken-build --add-to-library $fna --db kraken_bvfpa_080416

# build the actual database
kraken-build --build --db kraken_bvfpa_080416

I haven’t tested the above database yet, but the output of the build script ends with:

Creating GI number to seqID map (step 4 of 6)...
GI number to seqID map created. [14.935s]
Creating seqID to taxID map (step 5 of 6)...
7576 sequences mapped to taxa. [1.188s]
Setting LCAs in database (step 6 of 6)...
Database LCAs set. [21m24.805s]
Database construction complete. [Total: 1h1m28.108s]

Which certainly makes it look like things have worked.

I’ll update the post with info from testing in due course.


Update 13/04/2016

I simulated some reads from a random set of fastas in the database, then ran Kraken, and it worked (plus/minus a few false positives)


We need to stop making this simple f*cking mistake

I’m not perfect. ¬†Not in any way. ¬†I am sure if anyone was so inclined, they could work their way through my research with clinical forensic attention-to-detail and uncover all sorts of mistakes. ¬†The same will be true for any other scientist, I expect. ¬†We’re human and we make mistakes.

However, there is one mistake in bioinformatics that is so common, and which has been around for so long, that it’s really annoying when¬†it keeps happening:

It turns out the Carp genome is full of Illumina adapters.

One of the first things we teach people in our NGS courses is how to remove adapters. ¬†It’s not hard – we use CutAdapt, but many other tools exist. ¬† It’s simple, but really important – with De Bruijn graphs you will get paths through the graphs converging on kmers from adapters; and with OLC assemblers you will get spurious overlaps. ¬†With gap-fillers, it’s possible to fill the gaps with sequences ending in adapters, and this may be what happened in the Carp genome.

Why then are we finding such elementary mistakes in such important papers? ¬†Why aren’t reviewers picking up on this? ¬†It’s frustrating.

This is a separate, but related issue, to genomic contamination Рthe Wheat genome has PhiX in it; tons of bacterial genomes do too; and lots of bacterial genes were problematically included in the Tardigrade genome and declared as horizontal gene transfer.

Genomic contamination can be hard to find, but sequence adapters are not. ¬†Who isn’t adapter trimming in 2016?!

Fast parallel access to data within MinION FAST5 files

As readers of this blog will know, my group was one of the first groups to publish software for the MinION sequencer, and I have blogged a couple of times on how to extract data from FAST5 files using poRe. ¬†Well BBSRC, my main funder, were kind enough to give us a grant to develop the software, and one thing we said we’d do is implement parallel processing.

Below is a summary of what we’ve done so far.

Parallel processing in R

One of the huge benefits of R, besides the amazing suite of statistical and visualization tools, and the thousands of add-on packages, is the fact that parallelisation is really easy.  In this context, what I mean by parallelisation is the use of multiple cores on your machine to run a single bit of code.  By using multiple cores, the code is speeded up, often linearly relative to the number of cores you use.

There are quite a lot of packages for R that help you run parallel code, including doParallel, foreach, snow etc. but we’re going to focus on the built in package, parallel. ¬†To show the benefits of this, we have run various toolkits for the extraction of 2D FASTQ data from 25484 FAST5 files kindly shared by Nick Loman.

Benchmarking FASTQ extraction

Now, as Matt Loose has pointed out to me, focusing on FASTQ extraction is a little pointless as almost certainly this will be built in to future ONT base callers; however, it serves as a useful example and there are plenty of FAST5 files out there already that you may wish to extract FASTQ from.

The tools we compared are extract2D from poRe, poretools, fast5extract from c_fast5 (a simple C implementation) and extract2DParallel, a parallel version of extract2D (the parallel scripts are in the “misc” folder in GitHub). ¬†We ran each tool 4 times on 1000, 5000, 10000, 15000, 20000, and 25000 randomly selected FAST5 files, clearing the cache completely between runs to ensure a fair comparison.


As can be seen, all tools perform linearly with the number of FAST5 files; poretools is the slowest, then extract2D, then c_fast5.  However, the best result is clearly from extract2DParallel (using 4 cores) which outperforms everything else by some distance.

Of course, this is not surprising, and doesn’t come without a cost – if you only have 4 cores then using all 4 will mean all other processes perform really slowly. ¬†However, there are also benefits – if your machine has 8 or 24 cores, you can speed up processing even more.

The scripts on GitHub should be treated as an early release and have only been tested on one dataset, so I welcome your feedback!

Extracting FASTQ on Windows

We haven’t wrapped up the functionality into poRe yet, but we expect to in the next release. ¬†Meanwhile for the impatient…


# find out how many cores you have
# [1] 4

I have 4 cores, and I probably don’t want to use all of them! I now need to redefine one of poRe’s internal functions so that it works with parallel:

printfastq <- function (f5, f5path = "/Analyses/Basecall_2D_000/BaseCalled_2D/", out="out.fastq")
    fid = tryCatch({
    }, warning = function(w) w, error = function(e) e)
    if (typeof(fid) == "S4" && H5Lexists(fid, f5path)) {
        did <- H5Dopen(fid, paste(f5path, "Fastq", sep = ""))
        fq <- (H5Dread(did))
        four <- strsplit(fq, "\n")[[1]]
        fq <- paste(paste(four[1], f5, sep = " "), four[2], four[3],
            four[4], sep = "\n")
        cat(fq, file = out, sep = "\n", fill = FALSE, append=TRUE)
    if (typeof(fid) == "S4") {

Now let's create a vector of the fast5 files I want to extract:

my.dir <- "C:/MinION/FAST5/" # use your own directory here!

f5files <- dir(path = my.dir, pattern = "\\.fast5$", full.names = TRUE)

Here is the interesting part, where we set up our parallel "cluster", which is then used to execute the code. Each core within the cluster should be treated as a separate instance of R, so we need to load libraries and export variables to the cluster:

# use two cores
cl <- makeCluster(2)
# load poRe on the cluster
c <- clusterEvalQ(cl, suppressWarnings(suppressMessages(library(poRe))))

# export the printfastq function to the cluster
clusterExport(cl, "printfastq")

# run the code and stop the cluster
l <- parApply(cl, array(f5files), 1, printfastq, out="myfastq.fastq")

Extract metadata

Absolutely one of the slowest aspects of MinION data analysis and QC is extracting metadata, but we can also parallelise this!  For those of you who like the Linux command line, there is an extractMetaParallel script in the GitHub repo (again - use with caution and feedback!), but for those of you who want to use this within R:

# find out how many cores you have
# [1] 4

# get a list of files to extract data from
my.dir <- "C:/MinION/FAST5/" # use your own directory here!
f5files <- dir(path = my.dir, pattern = "\\.fast5$", full.names = TRUE)

# use two cores
cl <- makeCluster(2)
# load poRe on the cluster
c <- clusterEvalQ(cl, suppressWarnings(suppressMessages(library(poRe))))

# variables to set paths to template, complement and 2D data
# NOTE you will need to change these for the new FAST5 format
path.t <- "/Analyses/Basecall_2D_000/"
path.c <- path.t
path2d <- path.t

# run the code and stop the cluster
l <- parLapply(cl, 
          get_meta_info, path2d, path.t, path.c)

# convert the resulting vector into a matrix with the correct column names
m <- matrix(unlist(l), ncol=12, byrow=TRUE)
colnames(m) <- c("filename","channel_num","read_num","read_start_time","status","tlen","clen","len2d","run_id","read_id","barcode","exp_start") 


Parallelisation is so easy within R it's almost criminal not to do it. We can see that just a bit of parallelisation speeds up the code incredibly - and we can also see that poRe is a pretty fast toolkit when used in the correct way, which it wasn't in this paper.

NOTE I am fairly new to benchmarking and parallelisation, if I have done anything wrong, I apologize - and please get in touch!

Older posts

© 2022 Opiniomics

Theme by Anders NorenUp ↑