Opiniomics

bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Category: sequencing (page 1 of 5)

HiSeq move over, here comes Nova! A first look at Illumina NovaSeq

Illumina have announced NovaSeq, an entirely new sequencing system that completely disrupts their existing HiSeq user-base.  In my opinion, if you have a HiSeq and you are NOT currently engaged in planning to migrate to NovaSeq, then you will be out of business in 1-2 years time.  It’s not quite the death knell for HiSeqs, but it’s pretty close and moving to NovaSeq over the next couple of years is now the only viable option if you see Illumina as an important part of your offering.

Illumina have done this before, it’s what they do, so no-one should be surprised.

The stats

I’ve taken the stats from the spec sheet linked above and produced the following.  If there are any mistakes let me know.

There are two machines – the NovaSeq 5000 and 6000 – and 4 flowcell types – S1, S2, S3 and S4.  The 6000 will run all four flowcell types and the 5000 will only run the first two.  Not all flowcell types are immediately available, with S4 scheduled for 2018 (See below)

S1 S2 S3 S4 2500 HO 4000 X
Reads per flowcell (billion) 1.6 3.3 6.6 10 2 2.8 3.44
Lanes per flowcell 2 2 4 4 8 8 8
Reads per lane (million) 800 1650 1650 2500 250 350 430
Throughput per lane (Gb) 240 495 495 750 62.5 105 129
Throughput per flowcell (Gb) 480 990 1980 3000 500 840 1032
Total Lanes 4 4 8 8 16 16 16
Total Flowcells 2 2 2 2 2 2 2
Run Throughput (Gb) 960 1980 3960 6000 1000 1680 2064
Run Time (days) 2-2.5 2-2.5 2-2.5 2-2.5 6 3.5 3

For X Ten, simply mutiply X figures by 10.  These are maximum figures, and assume maximum read lengths.

Read lengths available on NovaSeq 2×50, 2×100 and 2x150bp.  This is unfortunate as the sweet spot for RNA-Seq and exomes is 2x75bp.

As you can see from the stats, the massive innovation here is the cluster density, which has hugely increased. We also have shorter run times.

So what does this all mean?

Well let’s put this to bed straight away – HiSeq X installations are still viable.  This from an Illumina tech on Twitter:

 

We learn two things from this – first, that HiSeq X is still going to be cheaper for human genomes until S4 comes out, and S4 won’t be out until 2018.

So Illumina won’t sell any more HiSeq X, but current installations are still viable and still the cheapest way to sequence genomes.

I also have this from an un-named source:

speculation from Illumina rep “X’s will be king for awhile. Cost per GB on those will likely be adjusted to keep them competitive for a long time.”

So X is OK, for a while.

What about HiSeq 4000? Well to understand this, you need to understand 4000 and X.

The HiSeq 4000 and HiSeq X

First off, the HiSeq X IS NOT a human genome only machine.  It is a genome-only machine.  You have been able to do non-human genomes for about a year now.  Anything you like as long as it’s a whole genome and it’s 30X or above.  The 4000 is reserved for everything else because you cannot do exomes, RNA-Seq, ChIP-Seq etc on the HiSeq X.  HiSeq 4000 reagents are more expensive, which means that per-Gb every assay is more expensive than genome sequencing on Illumina.

However, no such restrictions exist on the NovaSeq – which means that every assay will now cost the same on NovaSeq.   This is what led me to say this on Twitter:

At Edinburgh Genomics, roughly speaking, we charge approx. 2x as much for a 4000 lane as we do for an X lane.  Therefore, per Gb, RNA-Seq is approx. twice as expensive as genome sequencing.  NovaSeq promises to make this per-Gb cost the same, so does that mean RNA-Seq will be half price?  Not quite.  Of course no-one does a whole lane of RNA-Seq, we multiplex multiple samples in one lane.  When you do this, library prep costs begin to dominate, and for most of my own RNA-Seq samples, library prep is about 50% of the per-sample cost, and 50% is sequencing.  NovaSeq promises to half the sequencing costs, which means the per-sample cost will come down by 25%.

These are really rough numbers, but they will do for now.  To be honest, I think this will make a huge difference to some facilities, but not for others.  Larger centers will absolutely need to grab that 25% reduction to remain competitive, but smaller, boutique facilities may be able to ignore it for a while.

Capital outlay

Expect to get pay $985k for a NovaSeq 6000 and $850k for a 5000.

Time issues

One supposedly big advantage is that NovaSeq takes 40 hours to run, compared to the existing 3 days for a HiSeq X.   Comparing like with like that’s 40 hours vs 72 hours.  This might be important in the clinical space, but not for much else.

Putting this in context, when you send your samples to a facility, they will be QC-ed first, then put in library prep queue, then put in sequencing queue, then QC-ed bioinformatically before finally being delivered.  Let’s be generous and say this takes 2 weeks.  Out of that sequencing time is 3 days.  So instead of waiting 14 days, you’re waiting 13 days.  Who cares?

Clinically having the answer 1 day earlier may be important, but let’s not forget, even on our £1M cluster, at scale the BWA+GATK pipeline itself takes 3 days.  So again you’re looking at 5 days vs 6 days.  Is that a massive advantage?  I’m not sure.  Of course you could buy one of the super-fast bioinformatics solutions, and maybe then the 40 hour run time will count.

Colours and quality

NovaSeq marks a switch from the traditional HiSeq 4 colour chemistry to the quicker NextSeq 2 colour chemistry.  As Brian Bushnell has noted on this blog, NextSeq data quality is quite a lot worse than HiSeq 2500, so we may see a dip in data quality, though Illumina claim 85% above Q30.

 

Is the long read sequencing war already over?

My enthusiasm for nanopore sequencing is well known; we have some awesome software for working with the datawe won a grant to support this work; and we successfully assembled a tricky bacterial genome.  This all led to Nick and I writing an editorial for Nature Methods.

So, clearly some bias towards ONT from me.

Having said all of that, when PacBio announced the Sequel, I was genuinely excited.   Why?  Well, revolutionary and wonderful as the MinION was at the time, we were getting ~100Mb runs.  Amazing technology, mobile sequencer, tri-corder, just incredible engineering – but 100Mb was never going to change the world.  Some uses, yes; but for other uses we need more data.  Enter Sequel.

However, it turns out Sequel isn’t really delivering on promises.  Rather than 10Gb runs, folk are getting between 3 and 5Gb from the Sequel:

At the same time, MinION has been coming along great guns:

Whilst we are right to be skeptical about ONT’s claims about their own sequencer, other people who use the MinION have backed up these claims and say they regularly get figures similar to this. If you don’t believe me, go get some of the World’s first Nanopore human data here.

PacBio also released some data for Sequel here.

So how do they stack up against one another?  I won’t deal with accuracy here, but we can look at #reads, read length and throughput.

To be clear, we are comparing “rel2-nanopore-wgs-216722908-FAB42316.fastq.gz” a fairly middling run from the NA12878 release, m54113_160913_184949.subreads.bam and one of the Sequel SMRT cell datasets released.

Read length histograms:

minion_vs_pacbio

As you can see, the longer reads are roughly equivalent in length, but MinION has far more reads at shorter read lengths.  I know the PacBio samples were size selected on Blue Pippin, but unsure about the MinION data.

The MinION dataset includes 466,325 reads, over twice as many as the Sequel dataset at 208,573 reads.

In terms of throughput, MinION again came out on top, with 2.4Gbases of data compared to just 2Gbases for the Sequel.

We can limit to reads >1000bp, and see a bit more detail:

gt1000minion_vs_pacbi

  • The MinION data has 326,466 reads greater than 1000bp summing to 2.37Gb.
  • The Sequel data has 192,718 reads greater than 1000bp, summing to 2Gb.

Finally, for reads over 10,000bp:

  • The MinION data has 84,803 reads greater than 10000bp summing to 1.36Gb.
  • The Sequel data has 83,771 reads greater than 10000bp, summing to 1.48Gb.

These are very interesting stats!


This is pretty bad news for PacBio.  If you add in the low cost of entry for MinION, and the £300k cost of the Sequel, the fact that MinION is performing as well as, if not better, than Sequel is incredible.  Both machines have a long way to go – PacBio will point to their roadmap, with longer reads scheduled and improvements in chemistry and flowcells.  In response, ONT will point to the incredible development path of MinION, increased sequencing speeds and bigger flowcells.  And then there is PromethION.

So is the war already over?   Not quite yet.  But PacBio are fighting for their lives.

Building a Kraken database with new FTP structure and no GI numbers

Progress is usually good, except when a reliable resource completely changes their data and file structures and mess up everything you have been trying to do.  The NCBI used to arrange bacterial genomes in such a very easy way to understand and download; now it’s a bit tougher.  So tough in fact that they had to create a new FAQ.

At the same time as moving around genome data, they also decided to retire GI numbers (thanks to Andreas for the link!).

This is a problem for Kraken, a metagenomic profiler, because it relies both on the old style genomic data structures and the GI numbers that come with them.  Below are my attempts to build a Kraken database, using the new FTP structure and no GIs.

Downloading the data

It is pretty easy to follow the FAQ above, but I have created some perl scripts here that should help.

You will need:

  • Perl
  • BioPerl
  • Git
  • A linux shell (bash) that has wget
  • Internet access
  • A folder that you have write access to
  • Maybe 40Gb of space

The magic in these scripts is that they download the data and then add the necessary “|kraken:taxid|xxxx” text that replaces GI number for building what Kraken refers to as “custom databases”

Note also that the script only downloads fasta for assemblies classed as “Complete Genome”.  Other options are:

  • Chromosome
  • Complete Genome
  • Contig
  • Scaffold

If you want these others then edit the scripts appropriately.


# clone the git repo
git clone https://github.com/mw55309/Kraken_db_install_scripts.git

# either put the scripts in your path or run them from that directory

# make download directory
mkdir downloads
cd downloads

# run for each branch of life you wish to download
perl download_bacteria.pl
perl download_archaea.pl
perl download_fungi.pl
perl download_protozoa.pl
perl download_viral.pl

# you should now have five directories, one for each branch

# build a new database 
# download taxonomy
kraken-build --download-taxonomy --db kraken_bvfpa_080416

# for each branch, add all fna in the directory to the database
for dir in fungi protozoa archaea viral bacteria; do
        for fna in `ls $dir/*.fna`; do
                kraken-build --add-to-library $fna --db kraken_bvfpa_080416
        done
done

# build the actual database
kraken-build --build --db kraken_bvfpa_080416


I haven’t tested the above database yet, but the output of the build script ends with:


Creating GI number to seqID map (step 4 of 6)...
GI number to seqID map created. [14.935s]
Creating seqID to taxID map (step 5 of 6)...
7576 sequences mapped to taxa. [1.188s]
Setting LCAs in database (step 6 of 6)...
Database LCAs set. [21m24.805s]
Database construction complete. [Total: 1h1m28.108s]

Which certainly makes it look like things have worked.

I’ll update the post with info from testing in due course.

 

Update 13/04/2016

I simulated some reads from a random set of fastas in the database, then ran Kraken, and it worked (plus/minus a few false positives)

kraken_test_png

We need to stop making this simple f*cking mistake

I’m not perfect.  Not in any way.  I am sure if anyone was so inclined, they could work their way through my research with clinical forensic attention-to-detail and uncover all sorts of mistakes.  The same will be true for any other scientist, I expect.  We’re human and we make mistakes.

However, there is one mistake in bioinformatics that is so common, and which has been around for so long, that it’s really annoying when it keeps happening:

It turns out the Carp genome is full of Illumina adapters.

One of the first things we teach people in our NGS courses is how to remove adapters.  It’s not hard – we use CutAdapt, but many other tools exist.   It’s simple, but really important – with De Bruijn graphs you will get paths through the graphs converging on kmers from adapters; and with OLC assemblers you will get spurious overlaps.  With gap-fillers, it’s possible to fill the gaps with sequences ending in adapters, and this may be what happened in the Carp genome.

Why then are we finding such elementary mistakes in such important papers?  Why aren’t reviewers picking up on this?  It’s frustrating.

This is a separate, but related issue, to genomic contamination – the Wheat genome has PhiX in it; tons of bacterial genomes do too; and lots of bacterial genes were problematically included in the Tardigrade genome and declared as horizontal gene transfer.

Genomic contamination can be hard to find, but sequence adapters are not.  Who isn’t adapter trimming in 2016?!

Fast parallel access to data within MinION FAST5 files

As readers of this blog will know, my group was one of the first groups to publish software for the MinION sequencer, and I have blogged a couple of times on how to extract data from FAST5 files using poRe.  Well BBSRC, my main funder, were kind enough to give us a grant to develop the software, and one thing we said we’d do is implement parallel processing.

Below is a summary of what we’ve done so far.

Parallel processing in R

One of the huge benefits of R, besides the amazing suite of statistical and visualization tools, and the thousands of add-on packages, is the fact that parallelisation is really easy.  In this context, what I mean by parallelisation is the use of multiple cores on your machine to run a single bit of code.  By using multiple cores, the code is speeded up, often linearly relative to the number of cores you use.

There are quite a lot of packages for R that help you run parallel code, including doParallel, foreach, snow etc. but we’re going to focus on the built in package, parallel.  To show the benefits of this, we have run various toolkits for the extraction of 2D FASTQ data from 25484 FAST5 files kindly shared by Nick Loman.

Benchmarking FASTQ extraction

Now, as Matt Loose has pointed out to me, focusing on FASTQ extraction is a little pointless as almost certainly this will be built in to future ONT base callers; however, it serves as a useful example and there are plenty of FAST5 files out there already that you may wish to extract FASTQ from.

The tools we compared are extract2D from poRe, poretools, fast5extract from c_fast5 (a simple C implementation) and extract2DParallel, a parallel version of extract2D (the parallel scripts are in the “misc” folder in GitHub).  We ran each tool 4 times on 1000, 5000, 10000, 15000, 20000, and 25000 randomly selected FAST5 files, clearing the cache completely between runs to ensure a fair comparison.

paratest

As can be seen, all tools perform linearly with the number of FAST5 files; poretools is the slowest, then extract2D, then c_fast5.  However, the best result is clearly from extract2DParallel (using 4 cores) which outperforms everything else by some distance.

Of course, this is not surprising, and doesn’t come without a cost – if you only have 4 cores then using all 4 will mean all other processes perform really slowly.  However, there are also benefits – if your machine has 8 or 24 cores, you can speed up processing even more.

The scripts on GitHub should be treated as an early release and have only been tested on one dataset, so I welcome your feedback!

Extracting FASTQ on Windows

We haven’t wrapped up the functionality into poRe yet, but we expect to in the next release.  Meanwhile for the impatient…


library(poRe)
library(parallel)

# find out how many cores you have
detectCores()
# [1] 4


I have 4 cores, and I probably don’t want to use all of them! I now need to redefine one of poRe’s internal functions so that it works with parallel:


printfastq <- function (f5, f5path = "/Analyses/Basecall_2D_000/BaseCalled_2D/", out="out.fastq")
{
    fid = tryCatch({
        H5Fopen(f5)
    }, warning = function(w) w, error = function(e) e)
    if (typeof(fid) == "S4" && H5Lexists(fid, f5path)) {
        did <- H5Dopen(fid, paste(f5path, "Fastq", sep = ""))
        fq <- (H5Dread(did))
        four <- strsplit(fq, "\n")[[1]]
        fq <- paste(paste(four[1], f5, sep = " "), four[2], four[3],
            four[4], sep = "\n")
        cat(fq, file = out, sep = "\n", fill = FALSE, append=TRUE)
        H5Dclose(did)
    }
    if (typeof(fid) == "S4") {
        H5Fclose(fid)
    }
}

Now let's create a vector of the fast5 files I want to extract:


my.dir <- "C:/MinION/FAST5/" # use your own directory here!

f5files <- dir(path = my.dir, pattern = "\\.fast5$", full.names = TRUE)

Here is the interesting part, where we set up our parallel "cluster", which is then used to execute the code. Each core within the cluster should be treated as a separate instance of R, so we need to load libraries and export variables to the cluster:


# use two cores
cl <- makeCluster(2)
 
# load poRe on the cluster
c <- clusterEvalQ(cl, suppressWarnings(suppressMessages(library(poRe))))

# export the printfastq function to the cluster
clusterExport(cl, "printfastq")

# run the code and stop the cluster
l <- parApply(cl, array(f5files), 1, printfastq, out="myfastq.fastq")
stopCluster(cl)

Extract metadata

Absolutely one of the slowest aspects of MinION data analysis and QC is extracting metadata, but we can also parallelise this!  For those of you who like the Linux command line, there is an extractMetaParallel script in the GitHub repo (again - use with caution and feedback!), but for those of you who want to use this within R:



library(poRe)
library(parallel)
# find out how many cores you have
detectCores()
# [1] 4

# get a list of files to extract data from
my.dir <- "C:/MinION/FAST5/" # use your own directory here!
f5files <- dir(path = my.dir, pattern = "\\.fast5$", full.names = TRUE)

# use two cores
cl <- makeCluster(2)
 
# load poRe on the cluster
c <- clusterEvalQ(cl, suppressWarnings(suppressMessages(library(poRe))))

# variables to set paths to template, complement and 2D data
# NOTE you will need to change these for the new FAST5 format
path.t <- "/Analyses/Basecall_2D_000/"
path.c <- path.t
path2d <- path.t

# run the code and stop the cluster
l <- parLapply(cl, 
          f5files, 
          get_meta_info, path2d, path.t, path.c)
stopCluster(cl)

# convert the resulting vector into a matrix with the correct column names
m <- matrix(unlist(l), ncol=12, byrow=TRUE)
colnames(m) <- c("filename","channel_num","read_num","read_start_time","status","tlen","clen","len2d","run_id","read_id","barcode","exp_start") 

Summary

Parallelisation is so easy within R it's almost criminal not to do it. We can see that just a bit of parallelisation speeds up the code incredibly - and we can also see that poRe is a pretty fast toolkit when used in the correct way, which it wasn't in this paper.

NOTE I am fairly new to benchmarking and parallelisation, if I have done anything wrong, I apologize - and please get in touch!

Shots fired – nanopore wars part II

For the record, and to try and avoid accusations of bias, I am a fan of both Illumina and Oxford Nanopore.  I have won grants to purchase, and I have encouraged our facility to purchase, millions of pounds worth of Illumina equipment; my lab has spent well over £100,000 on Illumina sequencing.   So there’s that.  Yet I am also a massive fan boy of Oxford Nanopore’s technology – anyone who reads my blog or my Twitter feed will realise that I see nanopore technology as the future and often get incredibly excited by the revolutionary nature of the MinION mobile sequencer.

So I am incredibly disappointed in this – Illumina Sues Oxford Nanopore for Patent Infringement.   It’s like having two of your friends fighting.

There are fractious relationships in genomics of course – that between Illumina and BGI perhaps most famous – and I have written before about the relationship between Illumina and ONT, and how they entered into arbitration in 2011/12.

Illumina suing ONT is the latest act in the “war” between the two companies and comes after Illimina licensed technology from the Gundlach lab in 2013.  Key for me is what Illumina’s intentions were when they did this:

  1. Illumina wanted to develop nanopore technology, and they now have something close to market.  This situation would put Illumina in the best light.  In this case suing ONT is probably something they have to do to try and create space in the market for their technology.  Not great for ONT, but it seems this is how business is done.
  2. Illumina wanted to develop nanopore technology, they’ve tried and failed.  This makes Illumina look worse – if they don’t have a competing technology, it’s a bit spiteful to try and spike a competitor anyway, though certainly not uncommon.  And in this situation at least they tried to develop tech.
  3. Illumina never wanted to develop nanopore technology, and they bought the Gundlach patents just to kill ONT.  Again not entirely uncommon, but this is essentially patent trolling and really doesn’t make Illumina look good at all.  This would be a business driven decision to the detriment of science and the detriment of humanity.  Not good.

I am guessing the only way we will ever know which of these is true is if Illumina eventually release nanopore technology – we’d then know/suspect (1) is true.

The timing of this really stinks too – given ONT are set to float on the stock market, it’s hard not to see this as a deliberate attempt to scupper that.

My view is that ONT have a pretty strong position – my understanding is that they have patents, stretching back many years, that cover most or all of their instruments.  Some of this may come down to the pore – Gundlach’s patents cover MspA; and Hagan Bayley’s work, on which ONT was originally based, involved alpha-hemolysin, a different pore.  Though we don’t know if ONT switched pores.  In addition to all of that, the pores in ONTs technologies have been engineered, mutated and developed a huge amount – so when does a pore stop being the original molecule and become something else?  Do the patents cover this?

These are very interesting times, but this is not a welcome development.

Did you notice the paradigm shift in DNA sequencing last week?

There is a very funny tweet by Matthew Hankins about the over-use of “paradigm shift” in articles in Scopus between 1962-2014 which clearly suggest we over-use the term:

However, last week there was a genuine shift in DNA sequencing published in bioRxiv by Matt Loose called “Real time selective sequencing using nanopore technology“.  So what makes this paper special?  Why is this a genuine paradigm shift?

Well, this is the first example, ever, of a DNA sequencer selecting regions of the input genome to sequence.   To be more accurate, Matt demonstrates the MinION sequencing DNA molecules and analyzing them in real time; testing whether they come from a region of the genome he is interested in; and if they are not, rejecting them (“spitting them out”) and making the nanopore available for the next read.

This has huge applications from human health through to pathogen genomics and microbiome research.  There are many applications in genomics where you are only interested in a subset of the DNA molecules in your sample.  For example, exome sequencing, or other sequence capture experiments – at the moment a separate capture reaction needs to take place and this can actually be more expensive than the sequencing.   Or pathogen diagnostics – here most of what you sequence would be host DNA, which you could ask the sequencer to reject, leaving only the pathogen to sequence; or flip it round – human saliva samples can have up to 40% oral microbiome DNA, which is wasted if what you want to sequence is the host.  I have worked with piRNA and siRNA datasets from insects where less than 1% of the Illumina reads are those I actually want.  If you want to know if a particular bacterial infection is resistant to antibiotics, you only actually need to sequence the antibiotic-resistance genes/islands/plasmids, not the entire genome. etc etc etc

Of course, we still have a bit of a way to go – Matt demonstrates so-called “read until” on Lambda, a relatively small, simple genome – but this is an amazing proof-of-principle of an incredible idea – that you can tell your sequencer what to sequence.  The paper deserves your attention.  Please read it and fire up your imagination!

Doing the maths on PromethION throughput

In case you don’t know, the PromethION is the bigger brother of the MinION, Oxford Nanopore’s revolutionary mobile single-molecule DNA sequencer.  The PromethION is about to go into early access, where I believe the access fee is just $75k, most of which can subsequently be spent on reagents/flowcells in the ONT shop.  The hidden costs are of course the bioinformatics, but that’s not what this post is about.

What I wanted to do is predict PromethION throughput using current MinION stats.  So let’s do that.

I will be deliberately conservative with the MinION throughput stats.  It’s a matter of record, I think, that a MinION 48hr run is easily capable of producing 100-200Mbase of 2D reads, most of which are ~8Kb in length.  I’m going to use this figure as a basic MinION throughput, though many will argue that SQK-MAP-006 runs produce far more.

I want you to focus on three parameters: the rate at which DNA translocates through the pores, currently ~70b/s on MinION; the number of channels per flowcell (512 on MinION); and the number of flowcells per run (1 on MinION).   So 1 flowcell with 512 channels running at 70b/s will produce 100-200Mbase of high quality 2D sequence data.

PromethION flowcells are 4 times the size of MinION flowcells, with 2048 channels.  You can also run 48 flowcells in parallel.  Therefore, if all PromethION ends up being is 48 large MinIONs running in parallel, we can predict 4 * 48 * 100-200Mbase == between 19.2 and 38.4Gbase per run.

However, let’s go back to that translocation speed.   I believe the aim is for the PromethION to run at 400b/s.  There is a huge caveat here in that we haven’t seen any data from ONT running at 400b/s; however, logically the faster the DNA moves through the pore, the more sequence data you will get.  The potential is therefore for PromethION to produce 5 times the above – so between 96Gbase and 192Gbase of high quality 2D reads (note – many more template and complement reads will be produced). (2nd note – there is no guarantee the 2D protocol will be around long term, the aim is to produce 1D data at a  much higher quality, which will also increase throughput)

These are obviously “back of an envelope” calculations, and almost certainly wrong in some respect, but hopefully this puts the PromethION in context with the other sequencers on the market.

Update 1: via Nick Loman, the F1000 data was SQK-MAP-005 and ran at 30b/s, not 70

Update 2: via Matt Loose, PromethION flowcells are 3000 channels not 2048

Update 3: via Nick Loman, PoreCamp (SQK-MAP-006) run produced 460Mbase 2D data

 

* bad maths is a product of lack of sleep

** the blog is called opiniomics for a reason

Bioinformatics error responsible for retraction in Science

Update: sorry, my mistake, the paper has not been retracted, an Erratum has been issued

If you follow the same kind of feeds and blogs that I do, you can’t have missed the recent retraction in Science that was due to a simple bioinformatics error.  Others have covered this in way more detail, but the essential points are that the authors sequenced an ancient genome from an individual they called “Mota”, and compared that genome to a bunch of other human genomic datasets to put it in context – except when they compared to a reference panel of SNPs, a necessary conversion script wasn’t run, which meant that a huge number of SNPs that Mota shared with Europeans were dropped, making Mota appear less European than he actually was.

The thing that strikes me most about this story is that this could have been so easily avoided.

There are a few things I tell my post-docs and students over and over.  One of them, perhaps the most important one, is to check your work.  A second and related point is that if you get wonderful results, or unusual and striking results, they are probably wrong and you should check your work.  We tried to encapsulate some of this advice in our paper – So you want to be a computational biologist?  Honestly, I know it’s boring, but double- and triple- checking your work is so important.   Whatever you find, at the very least just take a few data points and check again to make sure you get the same result.

In the above example, what I personally would have done, is to write a script to make sure that the “255,922 SNPs out of 256,540” were genuinely missing from the samtools output.  This may sound daunting but it really isn’t – possibly 10 or 20 lines of Python or Perl could have saved the day.

I do wonder if this was a case of a lonely bioinformatician – I genuinely don’t know who did what on the paper, but so frequently errors creep in when wet and dry lab scientists communicate in every-so-slightly different languages.

Whatever the issue – the take home is this: please, for the love of everything that is good in the world, CHECK YOUR RESULTS!

Which sequencer should you buy?

My aim for this post is a quick, pithy review of available sequencers, what they’re good for, what they’re not and under which circumstances you should buy one.   However, your first thought should be: do I need to buy a sequencer?  I know it’s great to have your own box (I love all of ours), but they are expensive, difficult to run, temperamental and time-consuming.  So think on that before you spend millions of pounds of your institute’s money – running the sequencers may cost millions more.  My blog post on choosing an NGS supplier is still valid today.

Illumina HiSeq X Ten

To paraphrase Quentin Tarantino, you buy one of these “When you absolutely, positively got to sequence every person in the room, accept no substitutes”.  The HiSeq X Ten is actually ten HiSeq X instruments; each instrument can run two flowcells and each flowcell has 8 lanes.  Each lane will produce 380-450million 150PE reads (120Gbase or data or 40X of a human genome).  Runs take 3 days.  Expect to pay an extra £1M on computing to cope with the data streams.  Ordered flowcells are quite difficult to deal with and can result in up to 30% “optical duplicates” (actually spill over from one well to adjacent wells).  You can producing 160 genomes every 3 days.  Essentially now used as a very cheap, whole-genome genotyping system, cost per genome is currently £850+VAT at Edinburgh Genomics.  Limited to 30X (or greater) genome sequencing.  I have checked with Illumina and this is definitely still true.

Illumina HiSeq X Five

Do not buy one of these.  The reagents are 40% more expensive for no good reason.  Simply out source to someone with an X Ten.

Illumina HiSeq 4000

The baby brother of the HiSeq X, I actually think it’s the same machine, except with smaller flowcells (possibly a different camera or laser).  Expect 300million 150PE reads per lane (same setup, two flowcells, each with 8 lanes, 3.5 day run time).  That’s 90Gbase per lane.  Same caveats apply – ordered flowcells are tricky and it’s easy to generate lots of “optical duplicates”.  No limitations, so you can run anything you like on this.  The new workhorse of the Illumina stable.  Buy one of these if you have lots and lots of things to sequence, and you want to run a variety of different protocols.

Illumina HiSeq 2500

One of the most reliable machines Illumina has ever produced.  Two modes: high output has the classic 2 flowcell, 8 lane set-up and takes 6 days; rapid is 2 flowcell, 2 lanes and takes less time (all run times depend on read length).  High output V4 capable of 250million 125PE reads, and rapid capable of 120million 250PE reads.   Increased throughput of the 4000 makes the 2500 more expensive per Gb and therefore only buy a 2500 if you can get a good one cheap second-hand, or get offered a really great deal on a new one.  Even then, outsourced 4000 data is likely to be cheaper than generating your own data on a 2500

Illumina NextSeq 500

I’ve never really seen the point – small projects can go on MiSeq, and medium- to large- projects fit better and are cheaper as a subset of lanes on the 2500/4000.  The machine only measures 3 bases, with the 4th base being an absence of signal.  This means the runs are ~25% quicker.  I am told V1 data was terrible, but V2 data much improved.  NextSeq flowcells are massive, the size of an iPad mini, and have four lanes, each capable of 100million 150PE reads.  Illumina claim these are good for “Everyday exome, transcriptome, and targeted resequencing“, but realistically these would all be better and cheaper run multiplexed on a 4000.

Illumina MiSeq

A great little machine, one lane per run, V2 is capable of 12million 250PE reads per run; V3 claims 25million 300PE reads but good luck getting those, there has been a problem with V3 300PE for as long as I can remember – it just doesn’t work well.  Great for small genomes and 16S.

Illumina MiniSeq

I suspect Illumina will sell tons of these as they are so cheap (< $50k), but no-one yet knows how well it will run.  Supposedly capable of 25million 150PE reads per run, that’s 7.5Gbase of data.  You could just about run a single RNA-Seq sample on there, but why would you?  A possible replacement for MiSeq if they get the read length up.  Could be good for small genomes and possibly 16S.  Illumina claim it’s for targetted DNA and RNA samples, so could work well with e.g. gene panels for rapid diagnostics.


 

One interesting downside of Illumina machines is that you have to fill the whole flowcell before you can run the machine.  What this means is that despite the fact Illumina’s cost-per-Gb is smaller, small projects can be cheaper and faster on other platforms.


Ion Torrent and Ion Proton

The people who I meet who are happy with their Ion* platforms are generally diagnostic labs, where run time is really important (they are faster than Illumina machines) and where absolute base-pair accuracy is not important.   Noone I know who works in genomics research uses Ion* data – it’s just not good enough.  Major indel problem and Illumina data is cheaper and better.

PacBio Sequel

No-one has seen any data but this looks like an impressive machine.   There are 1 million ZMWs per SMRT cell and about 30-40% will return useable data.  Useable data will be 10-20Kb reads at 85% raw accuracy, but correctable to 99% accuracy.  Output at launch is 5-10Gbase per SMRT cell, and PacBio expect to produce 20Kb and 30Kb library protocols in 2016.  Great for genome assembly and structural variation, not quite quantitative for RNA-Seq bu fantastic for gene discovery.  Link this up to NimbleGen’s long fragment capture kits and you can target difficult areas of the genome with long reads.  Machine cost is £300k so good value compared to the RSII.  These will fly off the shelf.

PacBio RSII

The previous workhorse of PacBio, capable of 2Gbase of long reads per SMRT cell.  Cool machine, but over-shadowed by Sequel, I wouldn’t recommend buying one.

Oxford Nanopore MinION

The coolest sequencer on the planet, a $1000 access fee gets you a USB sequencer the size of an office stapler.   Each run on the mark I MinION can produce several hundred Mb of 2D data, and fast mode (in limited early access) promises to push this into the Gbases.  Read lengths are a mean of 10Kb with raw 2D accuracy at 85% and a range of options for correction to 98-99% accuracy.  We use for scaffolding bacterial genomes, and also for pathogen detection.  Should you buy one?  You should have one already!

Oxford Nanopore PromethION

The big brother of the MinION, this is best imagined as 48 bigger, fast-mode MinIONs run in parallel.  If fast mode MinION can produce 1Gbase per run, the PromethION will produce 300Gbase per run.  This machine is in limited early access, but offers the possibility of long-read, population scale sequencing.  Access fee is $75,000 but expect to spend ten times that on compute to deal with the data.  Get one if you can deal with the data.

 

Older posts

© 2017 Opiniomics

Theme by Anders NorenUp ↑