bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

With great power comes great responsibility

Recently I published a blog post about a fairly simple test to find out whether you have “short” protein predictions in your bacterial genomes, and predicted that some of these short peptides may be the result of unresolved errors in long-read, single molecule assemblies.

Perhaps not surprisingly, there was a reaction from the PacBio community over this, and here is my response.

Before I begin, I just want to say that, whilst most people see me as some kind of Nanopore fan-boy, the reality is I am a fan of cool technology and that includes PacBio.  The hard facts are that I have spent around £200k on PacBio sequencing over the last 18 months, and about £20k on nanopore in the same time period.  I also encouraged our core to buy a PacBio Sequel.  So I am not anti-Pacbio.  I am, however, anti-bullshit 😉

In addition, my blog post wasn’t about problems with technology per se, it was about problems with people.  If you don’t know errors are there, you might not correct them, and if you believe company hype (for example that PacBio data is Q50 after polishing) you might believe your assembly is perfect.

It isn’t.  They never are.

Let’s dive into some data.  The three PacBio genomes I chose in the last blog post are:

AP018165	Mycobacterium stephanolepidis 
CP025317	Escherichia albertii 
ERS530415	Yersinia enterocolitica

if you search GenBank genomes for these, there is only one Mycobacterium stephanolepidis genome, but there are 39 Escherichia albertii genomes and there are 176 Yersinia enterocolitica.  Many of these will be sequenced on different technologies, which allows us to do a comparison within a species.

Escherichia albertii 

I downloaded the 39 genomes from GenBank, and ran the process I described last week.  For the complete genomes, this is what the data look like, ordered from lowest number of short peptides, to greatest:

Accession Technology Contigs # proteins # short
GCA_001549955.1 Sanger+454 1 4299 19
GCA_002285455.1 Sanger+454 1 4157 19
GCA_002285475.1 Sanger+454 1 4295 28
GCA_002741375.1 PacBio 1 (plus 6 plasmids) 5126 79
GCA_002895205.1 Illumina+454+Ion 1 4423 83
GCA_000512125.1 PacBio 1 4482 106
GCA_002872455.1 PacBio 1 (plus 3 plasmids) 4984 124

Sure is unlucky that PacBio assemblies are at the bottom of the table isn’t it?  Of course, Sanger is the gold standard, and many will be asking what the Illumina assemblies look like.

Let’s look at the next ten genomes, which are not complete, but are the least fragmented:

Accession Technology Contigs # proteins # short
GCA_002109845.1 454 25 4758 93
GCA_001514945.1 Illumina 43 4237 82
GCA_002563295.1 Ion 44 4531 143
GCA_001514965.1 Illumina 50 4390 86
GCA_001515045.1 Illumina 53 4480 110
GCA_001515005.1 Illumina 59 4963 117
GCA_001514645.1 Illumina 63 4470 84
GCA_001514685.1 Illumina 70 4209 68
GCA_001514925.1 Illumina 73 4464 84
GCA_001514905.1 Illumina 78 4622 113

What I think is worth pointing out here is that the PacBio genomes in the first table, which are complete, have about the same number of short proteins as the Illumina and 454 assemblies in the second table, which are fragmented.  We would usually expect fragmented assemblies to have more short proteins because the contig ends would interupt ORFs.   Indeed, compared to Sanger complete assemblies, the fragmented assemblies do have more short proteins.

They just don’t have more than the PacBio complete assemblies.  Odd that.  If there were no uncorrected errors in the PacBio assemblies, they would be more like the Sanger assemblies than the Illumina ones.

Yersinia enterocolitica

Of the 176 GenBank genomes for this species, my automated script could only detect sequencing technology for 35 of them.  Here are the 21 that have a claim to be complete or near-complete (<5 contigs)

Accession Technology contigs # proteins # short
GCA_000834195.1 Illumina+454 2 4161 17
GCA_000987925.1 PacBio 2 4340 19
GCA_001304755.1 PacBio 2 4344 20
GCA_002082275.2 PacBio+Illumina 1 4067 31
GCA_002554625.2 PacBio+Illumina 1 4384 35
GCA_001305635.1 Illumina 2 4162 35
GCA_000834795.1 PacBio+Illumina 2 4259 36
GCA_000755045.1 Illumina+454 2 4226 44
GCA_000834735.1 PacBio+Illumina 2 4138 48
GCA_000755055.1 Illumina+454 1 4095 53
GCA_000754975.1 Illumina+454 2 4083 53
GCA_000597945.2 Illumina+454 3 4640 59
GCA_000754985.1 Illumina+454 3 4282 65
GCA_002083285.2 PacBio+Illumina 4 4198 76
GCA_002082245.2 PacBio+Illumina 2 4521 106
GCA_001708575.1 PacBio 3 4673 221
GCA_001708615.1 PacBio 3 4615 224
GCA_001708635.1 PacBio 3 4659 224
GCA_001708655.1 PacBio 3 4654 226
GCA_001708555.1 PacBio 3 4674 230
GCA_001708595.1 PacBio 2 4733 235

Slightly different story here, in that the PacBio (and PacBio hybrid) genomes appear to have some of the lowest number of predicted short proteins.  This is what people now expect when they see PacBio bacterial genomes.  However, there are also six PacBio genomes at the bottom of the table, and so I don’t think you can really look at this data and think there isn’t a problem.  It’s possible that those six just happen to be the strains of Yersinia enterocolitica that have undergone the most pseudogenisation, but I don’t think so.

Let’s get some things straight

  • I know this is an incomplete analysis and obviously more work needs to be done
  • If I personally wanted a perfect microbial genome, I would probably use PacBio+Illumina
  • I have nothing against PacBio
  • Nanopore aren’t in this list because they’re not in the species I chose, but I am sure they would also have significant problems


As I said above, it’s not that PacBio has a problem per se, it’s that people have a problem.  Yes, many errors are correctable with Quiver, Arrow and Pilon; often multiple rounds are necessary and you still won’t catch everything.

But not everyone knows that, and it’s clear to me from the above that many people are still pumping out poor quality, uncorrected, indel-ridden PacBio genomes.

The same is true for Nanopore, I have no doubt.

Let’s stop bullshitting.  These technologies have problems.  It doesn’t mean they are bad technologies, anything but – both PacBio and Nanopore have been transformational.  There is no need for bullshit.

“With great power comes great responsibility” – in this case, the responsibility is two-fold.  One, stop contaminating public databases with sh*t assemblies; and two, stop bullshitting that this isn’t a problem for your favourite technology.  That helps no-one.


A simple test for uncorrected insertions and deletions (indels) in bacterial genomes

A friend and a colleague of mine once said about me “he’s a details man”, and it was after we had discussed the fact some of my papers consist solely in pointing out the errors other people ignore – in RNA-Seq for example, or in genome assemblies (I have another under review!).

By now, those of you familiar with my earlier work will be jumping up and shouting

“Of course!  Even as far back as 2012, he was warning us about the dangers of mis-annotated pseudogenes!  Just look at Figure 2 in his review of bacterial genome annotation!”

Well, let’s dig into that a bit more.

For the uninitiated, the vast majority of bases in most bacterial genomes fall within open-reading frames (ORFs) that are eventually translated into proteins.  Bacterial evolution, and often niche specificity, is characterised by the psedogen-isation of these ORFs – basically, many replication errors in ORFs (especially insertions and deletions) can introduce a top codon, the transcript and resulting protein are truncated, they are often no longer functional, and you have a pseudogene.  This happens when the ORFs are not under positive selection.

HOWEVER.  Sequencing errors can also introduce errors, and you end up with incorrectly annotated pseudogenes.

And OH NOES we just so happen to be flooded with data from long read technologies that have an indel problem.

Let’s assume you have a shiny new bacterial genome, sequenced on either PacBio or Nanopore, and you want to figure out if it has an indel problem.  Here is my simple and quick test:

First predict proteins using a gene finder.  Then map those proteins to UniProt (using blastp or diamond).  The ratio of the query length to the length of the top hit should be a tight and normal distribution around 1.

OK so what does this look like in practice?

Let’s look at a genome we know if OK, E coli MG1655:


On the left, we have the raw histogram, and on the right we have zoomed in so the y-axis ends at 500.  Generally speaking, this has worked I think – the vast majority of predicted proteins have a 1:1 ratio with their top hit in UniProt.

Here is Salmonella gallinarum which we know “has undergone extensive degradation through deletion and pseudogene formation”


Well, would you look at that!  There they are, all those pseudogenes appearing as predicted proteins that are shorter than their top hit.  For those of you saying “It’s not that obvious”, in the plots above, MG1655 has 157 proteins < 90% of the length of their top hit, whereas for Salmonella gallinarum the figure is 432.   By my recollection, gallinarum has over 200 annotated pseudogenes.

If you’re convinced that this actually might be working, then read on!

So here are a couple of PacBio genomes.  I genuinely chose these at random from recent Genome Announcements.  Generally speaking, all had high PacBio coverage, were assembled using Canu or HGAP, and polished using Pilon.

In no particular order:


(237 proteins < 90% of top hit)


(246 proteins < 90% of top hit)


(268 proteins < 90% of top hit)

Now, these are not terrible numbers, in fact they are quite good, but those long tails are a tiny bit of a worry, and are definitely worth checking out.  Bear in mind, if 200 pseudogenes in Salmonella gallinarum is “extensive degradation through deletion and pseudogene formation“, then 200 incorrect indel pseudogenes in your genome is a problem.  In the above, I can’t possibly tell you whether these  numbers of predicted short/interrupted ORFs are real or not because I don’t know about the biology.  However, I can say that given they were created with a technology known to have an indel problem, they are worth further investigation.

Indel sequencing errors are not the only problem, so are fragmented assemblies.  In fragmented assemblies, the contig ends also tend to be in ORFs and again, you get short/interrupted proteins.  Here is one of our MAGs from our recent rumen paper:


(310 proteins < 90% of top hit)

It’s not all bad news though, some of the MAGs are OK:


(126 proteins < 90% of top hit)

And at least other peoples’ MAGs are just as bad (or worse):


(265 proteins < 90% of top hit)

My absolute favourite way to assemble genomes is a hybrid approach, creating contigs with Illumina and then stitching together with a small amount of nanopore data.  We did this with B fragilis and the results look good:


(63 proteins < 90% of top hit)

So are there any examples of where it goes really badly hideously wrong?  Well the answer is yes.  We are assembling data from metagenomic samples using both PacBio and Nanopore.   The issue with metagenomes is that you may not have enough coverage to correct errors using the consensus approach – there simply aren’t enough bases at each position to get an accurate reflection.  We are seeing the same pattern for both PacBio and Nanopore, so I won’t name which technology, but…. well scroll down for the horror show.













Are you ready?  Scroll down….











A 3Mb contig from a Canu assembly:


(2635 proteins < 90% of top hit)

Now, the only work this has had done on it is that Canu performs read correction before assembly.  We did one round.  Sweet holy Jesus that’s bad isn’t it??

Can it be corrected?  To a degree.  Here it is after two rounds of Pilon and some technology-specific polishing:


(576 proteins < 90% of top hit).

So less bad, but still not great.  I’d say this assembly has problems, which is why we haven’t published yet.

What’s the take home message here?  Well there are a few:

  1. Errors matter.  Pay attention to them.
  2. Both long read technologies have indel problems and these probably cause frameshifts in your ORFs
  3. Polishing and consensus and Illumina correction helps but it doesn’t catch everything
  4. The problem will inevitably be worse if you have low coverage
  5. Comparing predicted proteins with their top hits in UniProt can identify where you have errors in your ORFs (or real pseudogenes!)

Code for doing this type of analysis will appear here.

Let’s keep saying it, and say it louder: REVIEWERS ARE UNPAID

The arguments over peer review and whether we are obliged to do it generally fall on two sides – either it is or isn’t an implicit part of your job.   Stephen Heard falls into the former group, arguing that there are many things which are implicitly part of our jobs as academics, and which don’t make it into our job description.

I agree wholeheartedly with Stephen that peer-reviewing is not in my job description.  In fact it is nowhere to be seen.  It’s not in my employment contract either, nor has it ever been in any of my lists of annual objectives, after 16 years in academia.  In fact, I don’t think it’s ever been discussed, either in annual appraisal meetings or in my annual meetings with the institute director.  It doesn’t make it on to my CV either.

Here is a key point: I could never write another peer review for the rest of my career and my career would not suffer.  Not one bit.

It is not part of my job and I am not paid to do it.

(for the record, I do peer reviews! For free!)

In an increasingly pressurised environment where the only factors that influence my career progression are papers published and grants won; with over 6500 e-mails in my inbox which I have lost control of; and with a to-do list I have no chance of ever finishing, prioritisation is essential.  The key for any task is to be important enough to get into the “action” zone on my to do list.  Does peer review manage that?  Occasionally, but not often.

Would it get there more often if I was paid to do it?  Absolutely.  Why?  Because I have bills to pay and a small family and every little helps.  I imagine this is even more true for post-docs and early career researchers.  Why should they do something for free, often for profit-making organisations, when it doesn’t affect their career prospects one tiny bit?  The answer is simple: they shouldn’t.

A common argument is this: if everyone stopped peer reviewing, science would grind to a halt.  Well unfortunately that ignores reality.  If delivery drivers stopped driving, would there be no food on the shelves?  No.  Because we wouldn’t and couldn’t let that happen.  We’d find whatever incentive needed to be found, and make sure the drivers still drove.  The same is true of peer review.  If you are struggling to find people, there is a simple solution, one that is as old as money itself: pay them.

(note: I do many peer reviews and I will continue to do so; free.  However, I believe it is time to re-think incentives, and yes, it is time to pay people for peer reviews)

It’s a bit POINTLESS if you forget the POINTLESS in POINTLESS ADMIN

Rather predictably, a Guardian article railing against pointless admin has stimulated a response from the university admin community.   Unfortunately, the response kind of misses the point – academics are annoyed at POINTLESS ADMIN, not at administrators in general (though the two are not entirely unrelated).

Here’s the thing – there is both pointless and essential administration, and at the same time, there are both excellent and poor administrators.  An attack on pointless admin is not an attack on good administrators.   We all can recognise groups of excellent administrators without whom the place would fall apart.  No-one is attacking them.

So what is pointless admin?  There is so much of it I don’t know where to start.  Perhaps the best example is the recording of academic outputs.  My University insists I put them in PURE; Wellcome Trust insist I have an ORCID account; and BBSRC insist I use ResearchFish.  You can accept the motivation for this whilst at the same time recognising that it is genuinely pointless replication of effort.  Academics are angry not because this happens once or twice, but because it happens all the time and it is increasing in frequency.

So what makes a good administrator?  Well, I have always said “A bad administrator reminds you to do something, a good administrator does it for you”.  There are caveats, but it’s a good starting point.  A quick guide for administrators might be something like this: “Is this a tick-box exercise?  Can you tick the box?  Then tick the box.  Please.  Thank you!   Does it really need the academic’s input?  Can they send bullet points and you do the rest?  Let’s do that then!  Otherwise are there any parts you can fill in?  I’d love for you to fill them in!  Is it possible to visit and not send yet another e-mail?  That would be awesome.  You can ask if a follow-up email would help and then send one afterwards.  Is there a possibility that one system/form can be filled in and then the same information copied to other systems/forms?  Can you do the copying?  Awesome!  And, as an aside, can we maybe have less systems/forms if they contain the same information?  Amazing”.

The obvious response to all of this is “why should the admins have to do all of that?  You arrogant ****!  Do your own admin!”.  This misses an unfortunate point – we are drowning.  If academia was a near-death experience, then academics are 200 yards off shore, in a rip-tide, barely breathing with our arms in the air, waving for help.   We need help.  Our job is to teach, to publish papers and to win research grants, research grants with overheads, all of which go some way to sustaining the very institution we all work for.   Every minute doing admin is a minute we are not doing the thing that keeps the lights on.

Please.  Help us.  Remove POINTLESS ADMIN.  And be a good administrator, not a bad one.

Judge rules in favour of Oxford Nanopore in patent dispute with PacBio

Forgive me if I get any of the details wrong, I am not a lawyer, but the title of this post is my take on a judgement passed down in the patent infringement case PacBio brought against ONT.

To get your hands on the documentation, you need to register and log in to EDIS, click “Advanced Search”, do a search for “single molecule sequencing” and click the top hit.

My interpretation of the documentation is that the judge has massively limited the scope of the patents in question by expanding on the definition of “single molecule sequencing”.  ONT argued that in the patents in question, “single molecule sequencing” referred only to “sequencing of a single molecule by template-dependent synthesis”, and the judge agreed with this definition.


All claims are then subsequently limited to template-dependent synthesis, which of course is NOT what Oxford Nanopore do.


The document then goes into an area that would make all biological ontologists rejoice – THEY TRY AND DEFINE THE TERM “SEQUENCE”.  I can almost hear the voices shouting “I told you so!” coming out of Manchester and Cambridge as I write 😉

Beautiful boxplots in base R

As many of you will be aware, I like to post some R code, and I especially like to post base R versions of ggplot2 things!

Well these amazing boxplots turned up on github – go and check them out!

So I did my own version in base R – check out the code here and the result below.  Enjoy!


HiSeq move over, here comes Nova! A first look at Illumina NovaSeq

Illumina have announced NovaSeq, an entirely new sequencing system that completely disrupts their existing HiSeq user-base.  In my opinion, if you have a HiSeq and you are NOT currently engaged in planning to migrate to NovaSeq, then you will be out of business in 1-2 years time.  It’s not quite the death knell for HiSeqs, but it’s pretty close and moving to NovaSeq over the next couple of years is now the only viable option if you see Illumina as an important part of your offering.

Illumina have done this before, it’s what they do, so no-one should be surprised.

The stats

I’ve taken the stats from the spec sheet linked above and produced the following.  If there are any mistakes let me know.

There are two machines – the NovaSeq 5000 and 6000 – and 4 flowcell types – S1, S2, S3 and S4.  The 6000 will run all four flowcell types and the 5000 will only run the first two.  Not all flowcell types are immediately available, with S4 scheduled for 2018 (See below)

S1 S2 S3 S4 2500 HO 4000 X
Reads per flowcell (billion) 1.6 3.3 6.6 10 2 2.8 3.44
Lanes per flowcell 2 2 4 4 8 8 8
Reads per lane (million) 800 1650 1650 2500 250 350 430
Throughput per lane (Gb) 240 495 495 750 62.5 105 129
Throughput per flowcell (Gb) 480 990 1980 3000 500 840 1032
Total Lanes 4 4 8 8 16 16 16
Total Flowcells 2 2 2 2 2 2 2
Run Throughput (Gb) 960 1980 3960 6000 1000 1680 2064
Run Time (days) 2-2.5 2-2.5 2-2.5 2-2.5 6 3.5 3

For X Ten, simply mutiply X figures by 10.  These are maximum figures, and assume maximum read lengths.

Read lengths available on NovaSeq 2×50, 2×100 and 2x150bp.  This is unfortunate as the sweet spot for RNA-Seq and exomes is 2x75bp.

As you can see from the stats, the massive innovation here is the cluster density, which has hugely increased. We also have shorter run times.

So what does this all mean?

Well let’s put this to bed straight away – HiSeq X installations are still viable.  This from an Illumina tech on Twitter:


We learn two things from this – first, that HiSeq X is still going to be cheaper for human genomes until S4 comes out, and S4 won’t be out until 2018.

So Illumina won’t sell any more HiSeq X, but current installations are still viable and still the cheapest way to sequence genomes.

I also have this from an un-named source:

speculation from Illumina rep “X’s will be king for awhile. Cost per GB on those will likely be adjusted to keep them competitive for a long time.”

So X is OK, for a while.

What about HiSeq 4000? Well to understand this, you need to understand 4000 and X.

The HiSeq 4000 and HiSeq X

First off, the HiSeq X IS NOT a human genome only machine.  It is a genome-only machine.  You have been able to do non-human genomes for about a year now.  Anything you like as long as it’s a whole genome and it’s 30X or above.  The 4000 is reserved for everything else because you cannot do exomes, RNA-Seq, ChIP-Seq etc on the HiSeq X.  HiSeq 4000 reagents are more expensive, which means that per-Gb every assay is more expensive than genome sequencing on Illumina.

However, no such restrictions exist on the NovaSeq – which means that every assay will now cost the same on NovaSeq.   This is what led me to say this on Twitter:

At Edinburgh Genomics, roughly speaking, we charge approx. 2x as much for a 4000 lane as we do for an X lane.  Therefore, per Gb, RNA-Seq is approx. twice as expensive as genome sequencing.  NovaSeq promises to make this per-Gb cost the same, so does that mean RNA-Seq will be half price?  Not quite.  Of course no-one does a whole lane of RNA-Seq, we multiplex multiple samples in one lane.  When you do this, library prep costs begin to dominate, and for most of my own RNA-Seq samples, library prep is about 50% of the per-sample cost, and 50% is sequencing.  NovaSeq promises to half the sequencing costs, which means the per-sample cost will come down by 25%.

These are really rough numbers, but they will do for now.  To be honest, I think this will make a huge difference to some facilities, but not for others.  Larger centers will absolutely need to grab that 25% reduction to remain competitive, but smaller, boutique facilities may be able to ignore it for a while.

Capital outlay

Expect to get pay $985k for a NovaSeq 6000 and $850k for a 5000.

Time issues

One supposedly big advantage is that NovaSeq takes 40 hours to run, compared to the existing 3 days for a HiSeq X.   Comparing like with like that’s 40 hours vs 72 hours.  This might be important in the clinical space, but not for much else.

Putting this in context, when you send your samples to a facility, they will be QC-ed first, then put in library prep queue, then put in sequencing queue, then QC-ed bioinformatically before finally being delivered.  Let’s be generous and say this takes 2 weeks.  Out of that sequencing time is 3 days.  So instead of waiting 14 days, you’re waiting 13 days.  Who cares?

Clinically having the answer 1 day earlier may be important, but let’s not forget, even on our £1M cluster, at scale the BWA+GATK pipeline itself takes 3 days.  So again you’re looking at 5 days vs 6 days.  Is that a massive advantage?  I’m not sure.  Of course you could buy one of the super-fast bioinformatics solutions, and maybe then the 40 hour run time will count.

Colours and quality

NovaSeq marks a switch from the traditional HiSeq 4 colour chemistry to the quicker NextSeq 2 colour chemistry.  As Brian Bushnell has noted on this blog, NextSeq data quality is quite a lot worse than HiSeq 2500, so we may see a dip in data quality, though Illumina claim 85% above Q30.


University of California makes legal move against Roger Chen (and Genia)

The relationship between sequencing companies is frosty and beset by legal issues, which I’ve covered before here and here.  Keith Robison tends to cover in more detail 😉

Most recently, PacBio moved against Oxford Nanopore, we think claiming that ONT’s 2D technology violated their patent on CCR (link).

Well now the absolute latest is a filing by the University of California against Roger Chen and therefore Genia. If you click through to the documents (requires registration) you’ll see that UC claim Chen, with others, produced key inventions whilst at UC that he later assigned to Genia, but which should have automatically been assigned to UC according to UC’s “oath of allegiance”, which Chen signed as a UC employee.

It awaits to be seen how important this is and no doubt Chen/Genia/Roche will fight tooth and nail; however if the courts decide in UC’s favour it could spell the end of Genia, and at the very least see a large cash settlement with UC.

Fascinating times!

Is the long read sequencing war already over?

My enthusiasm for nanopore sequencing is well known; we have some awesome software for working with the datawe won a grant to support this work; and we successfully assembled a tricky bacterial genome.  This all led to Nick and I writing an editorial for Nature Methods.

So, clearly some bias towards ONT from me.

Having said all of that, when PacBio announced the Sequel, I was genuinely excited.   Why?  Well, revolutionary and wonderful as the MinION was at the time, we were getting ~100Mb runs.  Amazing technology, mobile sequencer, tri-corder, just incredible engineering – but 100Mb was never going to change the world.  Some uses, yes; but for other uses we need more data.  Enter Sequel.

However, it turns out Sequel isn’t really delivering on promises.  Rather than 10Gb runs, folk are getting between 3 and 5Gb from the Sequel:

At the same time, MinION has been coming along great guns:

Whilst we are right to be skeptical about ONT’s claims about their own sequencer, other people who use the MinION have backed up these claims and say they regularly get figures similar to this. If you don’t believe me, go get some of the World’s first Nanopore human data here.

PacBio also released some data for Sequel here.

So how do they stack up against one another?  I won’t deal with accuracy here, but we can look at #reads, read length and throughput.

To be clear, we are comparing “rel2-nanopore-wgs-216722908-FAB42316.fastq.gz” a fairly middling run from the NA12878 release, m54113_160913_184949.subreads.bam and one of the Sequel SMRT cell datasets released.

Read length histograms:


As you can see, the longer reads are roughly equivalent in length, but MinION has far more reads at shorter read lengths.  I know the PacBio samples were size selected on Blue Pippin, but unsure about the MinION data.

The MinION dataset includes 466,325 reads, over twice as many as the Sequel dataset at 208,573 reads.

In terms of throughput, MinION again came out on top, with 2.4Gbases of data compared to just 2Gbases for the Sequel.

We can limit to reads >1000bp, and see a bit more detail:


  • The MinION data has 326,466 reads greater than 1000bp summing to 2.37Gb.
  • The Sequel data has 192,718 reads greater than 1000bp, summing to 2Gb.

Finally, for reads over 10,000bp:

  • The MinION data has 84,803 reads greater than 10000bp summing to 1.36Gb.
  • The Sequel data has 83,771 reads greater than 10000bp, summing to 1.48Gb.

These are very interesting stats!

This is pretty bad news for PacBio.  If you add in the low cost of entry for MinION, and the £300k cost of the Sequel, the fact that MinION is performing as well as, if not better, than Sequel is incredible.  Both machines have a long way to go – PacBio will point to their roadmap, with longer reads scheduled and improvements in chemistry and flowcells.  In response, ONT will point to the incredible development path of MinION, increased sequencing speeds and bigger flowcells.  And then there is PromethION.

So is the war already over?   Not quite yet.  But PacBio are fighting for their lives.

People are wrong about sequencing costs on the internet again

People are wrong about sequencing costs on the internet again and it hurts my face, so I had to write a blog post.

Phil Ashton, whom I like very much, posted this blog:

But the words are all wrong 😉  I’ll keep this short:

  • COST is what it COSTS to do something.  It includes all COSTS.  The clue is in the name.  COST. It’s right there.
  • PRICE is what a consumer pays for something.

These are not the same thing.

As a service provider, if the PRICE you charge to users is lower than your COST, then you are either SUBSIDISED or LOSING MONEY, and are probably NOT SUSTAINABLE.

COST, amongst other things, includes:

  • Reagents
  • Staff time
  • Capital cost or replacement cost (sequencer and compute)
  • Service and maintenance fees
  • Overheads
  • Rent

Someone is paying these, even if it’s not the consumer.  So please – when discussing sequencing – distinguish between PRICE and COST.

Thank you 

« Older posts

© 2018 Opiniomics

Theme by Anders NorenUp ↑