Opiniomics

bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Category: sequencing (page 2 of 5)

Did you notice the paradigm shift in DNA sequencing last week?

There is a very funny tweet by Matthew Hankins about the over-use of “paradigm shift” in articles in Scopus between 1962-2014 which clearly suggest we over-use the term:

However, last week there was a genuine shift in DNA sequencing published in bioRxiv by Matt Loose called “Real time selective sequencing using nanopore technology“.  So what makes this paper special?  Why is this a genuine paradigm shift?

Well, this is the first example, ever, of a DNA sequencer selecting regions of the input genome to sequence.   To be more accurate, Matt demonstrates the MinION sequencing DNA molecules and analyzing them in real time; testing whether they come from a region of the genome he is interested in; and if they are not, rejecting them (“spitting them out”) and making the nanopore available for the next read.

This has huge applications from human health through to pathogen genomics and microbiome research.  There are many applications in genomics where you are only interested in a subset of the DNA molecules in your sample.  For example, exome sequencing, or other sequence capture experiments – at the moment a separate capture reaction needs to take place and this can actually be more expensive than the sequencing.   Or pathogen diagnostics – here most of what you sequence would be host DNA, which you could ask the sequencer to reject, leaving only the pathogen to sequence; or flip it round – human saliva samples can have up to 40% oral microbiome DNA, which is wasted if what you want to sequence is the host.  I have worked with piRNA and siRNA datasets from insects where less than 1% of the Illumina reads are those I actually want.  If you want to know if a particular bacterial infection is resistant to antibiotics, you only actually need to sequence the antibiotic-resistance genes/islands/plasmids, not the entire genome. etc etc etc

Of course, we still have a bit of a way to go – Matt demonstrates so-called “read until” on Lambda, a relatively small, simple genome – but this is an amazing proof-of-principle of an incredible idea – that you can tell your sequencer what to sequence.  The paper deserves your attention.  Please read it and fire up your imagination!

Doing the maths on PromethION throughput

In case you don’t know, the PromethION is the bigger brother of the MinION, Oxford Nanopore’s revolutionary mobile single-molecule DNA sequencer.  The PromethION is about to go into early access, where I believe the access fee is just $75k, most of which can subsequently be spent on reagents/flowcells in the ONT shop.  The hidden costs are of course the bioinformatics, but that’s not what this post is about.

What I wanted to do is predict PromethION throughput using current MinION stats.  So let’s do that.

I will be deliberately conservative with the MinION throughput stats.  It’s a matter of record, I think, that a MinION 48hr run is easily capable of producing 100-200Mbase of 2D reads, most of which are ~8Kb in length.  I’m going to use this figure as a basic MinION throughput, though many will argue that SQK-MAP-006 runs produce far more.

I want you to focus on three parameters: the rate at which DNA translocates through the pores, currently ~70b/s on MinION; the number of channels per flowcell (512 on MinION); and the number of flowcells per run (1 on MinION).   So 1 flowcell with 512 channels running at 70b/s will produce 100-200Mbase of high quality 2D sequence data.

PromethION flowcells are 4 times the size of MinION flowcells, with 2048 channels.  You can also run 48 flowcells in parallel.  Therefore, if all PromethION ends up being is 48 large MinIONs running in parallel, we can predict 4 * 48 * 100-200Mbase == between 19.2 and 38.4Gbase per run.

However, let’s go back to that translocation speed.   I believe the aim is for the PromethION to run at 400b/s.  There is a huge caveat here in that we haven’t seen any data from ONT running at 400b/s; however, logically the faster the DNA moves through the pore, the more sequence data you will get.  The potential is therefore for PromethION to produce 5 times the above – so between 96Gbase and 192Gbase of high quality 2D reads (note – many more template and complement reads will be produced). (2nd note – there is no guarantee the 2D protocol will be around long term, the aim is to produce 1D data at a  much higher quality, which will also increase throughput)

These are obviously “back of an envelope” calculations, and almost certainly wrong in some respect, but hopefully this puts the PromethION in context with the other sequencers on the market.

Update 1: via Nick Loman, the F1000 data was SQK-MAP-005 and ran at 30b/s, not 70

Update 2: via Matt Loose, PromethION flowcells are 3000 channels not 2048

Update 3: via Nick Loman, PoreCamp (SQK-MAP-006) run produced 460Mbase 2D data

 

* bad maths is a product of lack of sleep

** the blog is called opiniomics for a reason

Bioinformatics error responsible for retraction in Science

Update: sorry, my mistake, the paper has not been retracted, an Erratum has been issued

If you follow the same kind of feeds and blogs that I do, you can’t have missed the recent retraction in Science that was due to a simple bioinformatics error.  Others have covered this in way more detail, but the essential points are that the authors sequenced an ancient genome from an individual they called “Mota”, and compared that genome to a bunch of other human genomic datasets to put it in context – except when they compared to a reference panel of SNPs, a necessary conversion script wasn’t run, which meant that a huge number of SNPs that Mota shared with Europeans were dropped, making Mota appear less European than he actually was.

The thing that strikes me most about this story is that this could have been so easily avoided.

There are a few things I tell my post-docs and students over and over.  One of them, perhaps the most important one, is to check your work.  A second and related point is that if you get wonderful results, or unusual and striking results, they are probably wrong and you should check your work.  We tried to encapsulate some of this advice in our paper – So you want to be a computational biologist?  Honestly, I know it’s boring, but double- and triple- checking your work is so important.   Whatever you find, at the very least just take a few data points and check again to make sure you get the same result.

In the above example, what I personally would have done, is to write a script to make sure that the “255,922 SNPs out of 256,540” were genuinely missing from the samtools output.  This may sound daunting but it really isn’t – possibly 10 or 20 lines of Python or Perl could have saved the day.

I do wonder if this was a case of a lonely bioinformatician – I genuinely don’t know who did what on the paper, but so frequently errors creep in when wet and dry lab scientists communicate in every-so-slightly different languages.

Whatever the issue – the take home is this: please, for the love of everything that is good in the world, CHECK YOUR RESULTS!

Which sequencer should you buy?

My aim for this post is a quick, pithy review of available sequencers, what they’re good for, what they’re not and under which circumstances you should buy one.   However, your first thought should be: do I need to buy a sequencer?  I know it’s great to have your own box (I love all of ours), but they are expensive, difficult to run, temperamental and time-consuming.  So think on that before you spend millions of pounds of your institute’s money – running the sequencers may cost millions more.  My blog post on choosing an NGS supplier is still valid today.

Illumina HiSeq X Ten

To paraphrase Quentin Tarantino, you buy one of these “When you absolutely, positively got to sequence every person in the room, accept no substitutes”.  The HiSeq X Ten is actually ten HiSeq X instruments; each instrument can run two flowcells and each flowcell has 8 lanes.  Each lane will produce 380-450million 150PE reads (120Gbase or data or 40X of a human genome).  Runs take 3 days.  Expect to pay an extra £1M on computing to cope with the data streams.  Ordered flowcells are quite difficult to deal with and can result in up to 30% “optical duplicates” (actually spill over from one well to adjacent wells).  You can producing 160 genomes every 3 days.  Essentially now used as a very cheap, whole-genome genotyping system, cost per genome is currently £850+VAT at Edinburgh Genomics.  Limited to 30X (or greater) genome sequencing.  I have checked with Illumina and this is definitely still true.

Illumina HiSeq X Five

Do not buy one of these.  The reagents are 40% more expensive for no good reason.  Simply out source to someone with an X Ten.

Illumina HiSeq 4000

The baby brother of the HiSeq X, I actually think it’s the same machine, except with smaller flowcells (possibly a different camera or laser).  Expect 300million 150PE reads per lane (same setup, two flowcells, each with 8 lanes, 3.5 day run time).  That’s 90Gbase per lane.  Same caveats apply – ordered flowcells are tricky and it’s easy to generate lots of “optical duplicates”.  No limitations, so you can run anything you like on this.  The new workhorse of the Illumina stable.  Buy one of these if you have lots and lots of things to sequence, and you want to run a variety of different protocols.

Illumina HiSeq 2500

One of the most reliable machines Illumina has ever produced.  Two modes: high output has the classic 2 flowcell, 8 lane set-up and takes 6 days; rapid is 2 flowcell, 2 lanes and takes less time (all run times depend on read length).  High output V4 capable of 250million 125PE reads, and rapid capable of 120million 250PE reads.   Increased throughput of the 4000 makes the 2500 more expensive per Gb and therefore only buy a 2500 if you can get a good one cheap second-hand, or get offered a really great deal on a new one.  Even then, outsourced 4000 data is likely to be cheaper than generating your own data on a 2500

Illumina NextSeq 500

I’ve never really seen the point – small projects can go on MiSeq, and medium- to large- projects fit better and are cheaper as a subset of lanes on the 2500/4000.  The machine only measures 3 bases, with the 4th base being an absence of signal.  This means the runs are ~25% quicker.  I am told V1 data was terrible, but V2 data much improved.  NextSeq flowcells are massive, the size of an iPad mini, and have four lanes, each capable of 100million 150PE reads.  Illumina claim these are good for “Everyday exome, transcriptome, and targeted resequencing“, but realistically these would all be better and cheaper run multiplexed on a 4000.

Illumina MiSeq

A great little machine, one lane per run, V2 is capable of 12million 250PE reads per run; V3 claims 25million 300PE reads but good luck getting those, there has been a problem with V3 300PE for as long as I can remember – it just doesn’t work well.  Great for small genomes and 16S.

Illumina MiniSeq

I suspect Illumina will sell tons of these as they are so cheap (< $50k), but no-one yet knows how well it will run.  Supposedly capable of 25million 150PE reads per run, that’s 7.5Gbase of data.  You could just about run a single RNA-Seq sample on there, but why would you?  A possible replacement for MiSeq if they get the read length up.  Could be good for small genomes and possibly 16S.  Illumina claim it’s for targetted DNA and RNA samples, so could work well with e.g. gene panels for rapid diagnostics.


 

One interesting downside of Illumina machines is that you have to fill the whole flowcell before you can run the machine.  What this means is that despite the fact Illumina’s cost-per-Gb is smaller, small projects can be cheaper and faster on other platforms.


Ion Torrent and Ion Proton

The people who I meet who are happy with their Ion* platforms are generally diagnostic labs, where run time is really important (they are faster than Illumina machines) and where absolute base-pair accuracy is not important.   Noone I know who works in genomics research uses Ion* data – it’s just not good enough.  Major indel problem and Illumina data is cheaper and better.

PacBio Sequel

No-one has seen any data but this looks like an impressive machine.   There are 1 million ZMWs per SMRT cell and about 30-40% will return useable data.  Useable data will be 10-20Kb reads at 85% raw accuracy, but correctable to 99% accuracy.  Output at launch is 5-10Gbase per SMRT cell, and PacBio expect to produce 20Kb and 30Kb library protocols in 2016.  Great for genome assembly and structural variation, not quite quantitative for RNA-Seq bu fantastic for gene discovery.  Link this up to NimbleGen’s long fragment capture kits and you can target difficult areas of the genome with long reads.  Machine cost is £300k so good value compared to the RSII.  These will fly off the shelf.

PacBio RSII

The previous workhorse of PacBio, capable of 2Gbase of long reads per SMRT cell.  Cool machine, but over-shadowed by Sequel, I wouldn’t recommend buying one.

Oxford Nanopore MinION

The coolest sequencer on the planet, a $1000 access fee gets you a USB sequencer the size of an office stapler.   Each run on the mark I MinION can produce several hundred Mb of 2D data, and fast mode (in limited early access) promises to push this into the Gbases.  Read lengths are a mean of 10Kb with raw 2D accuracy at 85% and a range of options for correction to 98-99% accuracy.  We use for scaffolding bacterial genomes, and also for pathogen detection.  Should you buy one?  You should have one already!

Oxford Nanopore PromethION

The big brother of the MinION, this is best imagined as 48 bigger, fast-mode MinIONs run in parallel.  If fast mode MinION can produce 1Gbase per run, the PromethION will produce 300Gbase per run.  This machine is in limited early access, but offers the possibility of long-read, population scale sequencing.  Access fee is $75,000 but expect to spend ten times that on compute to deal with the data.  Get one if you can deal with the data.

 

Generate a single contig, hybrid assembly of E coli using MiSeq and MinION data

If you have MiSeq and ONT MinION data, it is now very easy to combine them into a single-contig, high quality assembly.  Here I will start with a base Ubuntu 14.04 LTS distribution and take you through every step, including software installation.  I am starting with an Amazon EC2 m4.xlarge AMI with 4 vCPUs and 16Gb RAM.  If you want to repeat the work below, you don’t need to use EC2, any computer running Ubuntu and with enough disk/RAM should work.

For MinION data, we will use Nick’s SQK-MAP-006 data at the EBI.  Using a simple trick, and because I know the internal structure of the tar archive, we can download only the base-called data:


curl ftp://ftp.sra.ebi.ac.uk/vol1/ERA540/ERA540530/oxfordnanopore_native/MAP006-1.tar | tar -xf - MAP006-1/MAP006-1_downloads &

We can also get some MiSeq data from ENA from the same strain, E coli K-12 MG1655:


wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR262/005/SRR2627175/SRR2627175_2.fastq.gz &
wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR262/005/SRR2627175/SRR2627175_1.fastq.gz &

Now, while that is downloading, we need to install some software on the server:


sudo apt-get update
sudo apt-get install r-base-core git samtools

This installs R which we will use to extract the MinION data (using poRe), git and samtools.  Just reply “Y” to any questions.

We now need to install the poRe dependencies in R, which is very easy:


R
source("http://www.bioconductor.org/biocLite.R")
biocLite("rhdf5")
install.packages(c("shiny","bit64","data.table","svDialogs"))
q()

R may ask if you want to install into a local library, just say Y and accept defaults.  We need to download poRe from sourecforge and we are using version 0.16

Once downloaded, and back at the Linux command line:


R CMD INSTALL poRe_0.16.tar.gz

The fastq extraction scripts for poRe are in github, so let’s go get those:


git clone https://github.com/mw55309/poRe_scripts.git

We will assemble using SPAdes, so let’s go get that:


wget http://spades.bioinf.spbau.ru/release3.6.2/SPAdes-3.6.2-Linux.tar.gz
gunzip < SPAdes-3.6.2-Linux.tar.gz | tar xvf -

Now, we are ready to go.  First off, let’s extract the 2D sequence data as FASTQ from the MinION data.  Nick’s SQK-MAP-006 data are in the old FAST5 format so we use the script in “old_format”:


./poRe_scripts/old_format/extract2D MAP006-1/MAP006-1_downloads/pass/ > minion.pass.2D.fastq &

Finally we can use SPAdes to assemble the data.  SPAdes is a swiss-army knife of genome assembly tools, and by default includes read correction.  This takes up lots of RAM, so we are going to skip it.  We will also only use 3 kmers to save time:


./SPAdes-3.6.2-Linux/bin/spades.py --only-assembler 
				   -t 4 -k 21,51,71 
				   -1 SRR2627175_1.fastq.gz 
				   -2 SRR2627175_2.fastq.gz 
				   --nanopore minion.pass.2D.fastq 
				   -o SPAdes_hybrid & 

Use samtools to extract the top contig:


head -n 1 SPAdes_hybrid/contigs.fasta
samtools faidx SPAdes_hybrid/contigs.fasta
samtools faidx SPAdes_hybrid/contigs.fasta NODE_1_length_4620446_cov_135.169_ID_22238 > single_contig.fa

Finally, a quick comparison to the reference:


sudo apt-get install mummer 
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NC_000913.3&rettype=fasta&retmode=txt" > NC_000913.3.fa
nucmer NC_000913.3.fa single_contig.fa
mummerplot -png out.delta
display out.png &

ecoli_mummer

And that is a single global alignment between reference and query 🙂

Simple huh?

How to extract FASTQ from the new MinION FAST5 format using poRe

A little while ago I demonstrated how to extract FASTQ from MinION FAST5 files using Rscript and the Linux command line.

In that article, I described how to extract the different FASTQ data using a parameter:

# 2D is the default
invisible(apply(array(f5files), 1, printfastq))
# template
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_template"))
# complement
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_complement"))

The f5path parameter is the address of the FASTQ data within the FAST5 file, and that’s been changed recently by nanopore.  Well, we can very easily simply edit our scripts to cope with the new format:

# 2D is the default
invisible(apply(array(f5files), 1, printfastq))
# template
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_1D_000/BaseCalled_template"))
# complement
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_1D_000/BaseCalled_complement"))

And that’s it! No new version of poRe required, simply a couple of edits to scripts – example scripts can be found on github.

We definitely do not need a “Steve Jobs” in biology

If you follow me on Twitter, you cannot help but know that I am not Steve Jobs’ biggest fan – in fact I have frequently called him the most over-rated man in history.  I stand by that.  Steve Jobs is just a common garden billionaire capitalist.  No better or worse than any other billionaire capitalist.  As such he deserves some respect and I give him that – well done, Steve, for taking things and selling enough of them to make yourself a billionaire.  It’s a feat that deserves respect – I couldn’t do it.  Why then is he idolized?   I don’t have all of the answers but I suspect it has something to do with the fact he ran a company that designed cool-looking consumer devices that, for a minority, defined them and defined their entire lives.

So why am I writing about Steve Jobs?

Because Eric Schadt said the biggest need in biology today is a “Steve Jobs”.

I was pretty astounded when I read that, because Eric is an intelligent man.  So why the ridiculous statement?  Let’s look at this a bit more.

What did Steve Jobs actually do?

Steve Wozniak, who designed the technology behind early Apple devices, has said that Steve Jobs wasn’t interested in technology, he just wanted to be important.  Sour grapes?  Perhaps; though perhaps Steve was the real creative genius and resents the fact Steve got all of the credit?

Others have analysed what Steve Jobs actually did, (many ask the question), and I quote from Ian McCullough below:

He was central to the mass marketing and consumerization of the personal computer — but he by no means invented it. That credit most directly goes to people like Ivan Sutherland, Alan Kay, Douglas Engelbart, and the people who worked at Xerox PARC and the Stanford Research Institute during the 1960’s and 1970’s. Let us also remember that he didn’t invent the portable digital music player. Looking past the Sony Walkman which first met the general desire to carry music with you, the first known mention of a digital music player is Kane Kramer’s IXI in 1981. In the US, the Diamond Rio was the first widely available portable MP3 player in the late ’90s. The iPod didn’t come out until 2001. How about multi-touch screens? It’s a big “no” there too. That was Bent Stumpe & Frank Beck and later Bell Labs at Murray Hill, with some early experimentation from musicians like Huge Le Caine and Bob Moog. And what of Pixar (company)? He actually bought what became known as Pixar from none other than George Lucas in 1986. It was Ed Catmull that drove the Pixar technology side and ultimately John Lasseter (director) that drove the creative side. (Steve does get credit there for cultivating the business, building the The Walt Disney Company (company) relationship, and – according to all third-party reports – smartly staying out of the way.) To be completely fair: as far as I know he never actually laid claim to these accomplishments, but he is so tightly associated with them in the popular mind that many people make the errors.

Steve Jobs was a salesman of unparalleled natural talent, an astute curator, and a very tough editor. It is clear that he made the ideas and products that he came into contact with significantly better and extremely desirable to consumers, but he did not originate those things.

So, from what I can gather, Steve Jobs took what other people invented, made it cool and desirable and sold it for great profit.  Remind me: why do we need that in biology?

What did Eric Schadt actually say?

Eric’s visions, and it is one that I and many others share, is that if you collect enough relevant information, both scientific data but also patient data, and you include “additional” data feeds such as microbiome and the local environment, then we will be able to build predictive models that improve the treatment of disease in humans.  The models can and will be refined by experimental validation which will feed back into and improve the models.

That’s cool, and as I said, many people share this vision.  In fact, in many ways this describes molecular biology research of the last 10-15 years.

Eric goes on to say that we need to put these models in the hands of doctors who can work with them, who don’t need to understand the models, but who need to be able to use them.  And that’s where he brings Steve Jobs into the equation, and here is what he said:

“I think the number one need in biology today is a Steve Jobs. Where is the Steve Jobs of biology who can lead the design of amazing, intuitive interfaces into these complex data?”

Wait, wut?  Firstly, what amazing intuitive interface is he talking about?  Angry Birds?  The iPod wheel?  Touch screens (that neither Steve Jobs nor Apple invented)?  A high definition screen? Aluminium?  Icons?  Is talking about iOS icons?   Does Eric think that what biology needs is a set of cartoonish icons designed for toddlers?

*breathe*

Back to what Eric said: do we need these models?  Absolutely, of course we do!  Do we need doctors to be able to use them?  Damn right!  Do we need an amazing intuitive interface to them?  Absolutely.  Sign me up!  Did Steve Jobs ever invent such an interface?  Of course he fucking didn’t, he’s a fucking salesman!

Do we need an amazing salesman who steals your personal data in biology/medicine?  FUCK NO!

What else did Eric Schadt say?

He said a bunch of stuff that I agree with.  He’s hiring intelligent people from machine learning and high-performance computing, network modellng etc and I think that’s a great idea (though hiring some dude from Facebook also raised my hackles slightly).  He talks about open-ness and data sharing and the problems he faces with a sense of data ownership.

I mean, all in all I agree with Eric – apart from that dumb reference to Steve Jobs!

An inconvenient truth

There are several inconvenient truths in biology today that might prevent Eric and others’ vision coming true.  However here are the main two:

  1. We cannot accurately measure everything we need to measure.  We don’t survey the whole of the genome.  Proteins and other small molecules we are not able to measure/quantify accurately.  Our knowledge of epigenetics, 3D genome structure, the action of enhancers, ncRNA, microbiome etc etc is not refined enough.  Put simply, we don’t have all of the data that the model needs, and nor can we easily get it
  2. Once we do have all of the data, the most intelligent machine on the planet, the human brain, will be incapable of understanding it all.  It’s high-dimensional data that we as humans simply can’t understand.  Jun Wang (or Wang Jun as I know him) is right when he says that artificial intelligence is the future of genomics – the only thing that can genuinely understand the complexity that is a living cell, or living organism, is an intelligence far greater than our own.  The problem is AI is nowhere near this capability yet.

I’m not saying that machine learning, models and “big data” won’t advance knowledge and insight; it may even provide us with additional treatments.  But 1000s of diseases will remain uncured until such time as “we” can understand biology, and we’re quite a long way from that at the moment.  Credit to Eric for trying.

A last word on Steve Jobs

Imagine a world with two billionaires, both whom made their money by revolutionizing the computing and electronic devices world.  One of them sets up the largest transparent private foundation in the world, which aims to enhance healthcare and reduce extreme poverty globally.  The other one dies.

Which one do we need in biology?

What I know about the expansion of HiSeq X to non-human genomes

Update 15:31:


Well, it isn’t very much…..

Illumina announced that they will now allow researchers to sequence non-human genomes on the HiSeq X, and well we have HiSeq X systems, so we have a tiny bit of knowledge gleaned from our communications with them.

What they say is:

The updated rights of use will allow for market expansion and population-scale sequencing of non-human species in a variety of markets, including plants and livestock in agricultural research and model organisms in pharmaceutical research. Previously, it has been cost prohibitive to sequence non-human genomes at high coverage.

This is a bit vague, but there are a few keywords in there – “genomes” and “high coverage”.

My opinion is that this was done after pressure from the mouse and rat model-organism communities, who want to do thousands of 3Gb 30X genomes, just not human genomes.  We (as in Roslin) have also been putting on pressure to enable us to do farm animal genomes.

It’s only genomes: so not RNA-Seq, ChIP-Seq, exomes, amplicons etc – none of that.  Just genomes.

It’s only high coverage: 30X or above.  So no 10X, 5X or 1X genomes.

We asked about metagenomics (WGS): maybe (Update 08/10/2015 metagenomics is not supported)

We asked about smaller genomes (100Mb): (update 12/10/2015) small genomes are allowed, the issue is that only 12 indexes are available per lane – each lane producing around 120Gbase of data.  That’s 10Gbase per index (or 2000-fold coverage of a 5Mb bacterial genome)

We asked about medium sized genomes (1Gb): maybe (how a 30X 1Gb genome is different to a 10X 3Gb genome is anyone’s guess, but apparently, maybe, it is)

Bisulfite sequencing: my understanding is that this is supported/allowed (Update 08/10/2015)

Having said all that – my opinion is that what Illumina want to stimulate are massive, large-scale genomics projects.  Think on a scale of 1000s of 30X human genomes (90 terabases).  If you have fly, yeast, C elegans, metagenomics or small genome projects that are on that scale, I should think Illumina would be very happy to talk to you about getting them sequenced on an X somewhere.  However, if you’re smaller scale than that, well – we have the HiSeq 4000 systems at Edinburgh Genomics, which offer incredible value for money.

 

What does the PacBio Sequel mean for the future of sequencing?

PacBio have responded to intense competition in the sequencing market by releasing the Sequel, a machine that promises 7 times the throughput of their last machine (the RSII) at a price of $350k (the RSII cost more in the region of $750k). So they have a cheaper, higher throughput machine (though I haven’t seen cost-per-Gb figures).  That’s around 7 Gigabases of reads averaging around 15Kb in length.  Without doubt this is a very interesting platform and they will sell as many of them as they can produce.  A sweet spot for assembly is still 50-60X for PacBio, so think 12 SMRT cells to get a human genome: let’s say £12k per genome. Edit 01/10/2015 my maths at 7am was not that good!  50X human genome is 150Gb, so that’s 21 SMRT cells and £21K per human genome.   Much more expensive than Illumina’s $1000 genome, but far, far better.

I just want to say that at times I have been accused of being an Illumina- and a nanopore- fanboy; I am neither and both.  I am just a fan of cool technology, from microarray in the 90s to sequencing now.

In long reads, let’s be clear, we are talking about the promise of Oxford Nanopore vs the proven technology of PacBio.  And the Sequel changes the dynamics.  However, the MinION fast mode is capable of throughput in the region of 7Gb (like the Sequel) and the PromethION is capable of throughput on Illumina-scale.   Therefore, Oxford Nanopore are far from dead – though they need to respond.

So how does the new PacBio Sequel change the market?  A lot of initial reactions I have had are that the Sequel is a real threat to Oxford Nanopore.  It certainly ramps up the competition in the long read space, which is a really good thing.  But actually, high-throughput long read machines like the Sequel and the PromethION don’t spell the end for one another – they actually spell the beginning of the end for Illumina – as a sequencing platform.

As soon as you have high-throughput, cheap long reads, it is in fact Illumina who face a problem.  I love Illumina.  When I first arrived at Roslin, I walked into our lab and (honestly!) stroked our Illumina GAIIx.  Illumina have revolutionised biology.  However, short reads have limitations – they are bad for genome assembly, they are bad at complex genomes, they’re actually quite bad at RNA-Seq, they are pretty bad for structural variation, they are bad at haplotypes and SNP phasing, and they are not that great at metagenomics.  What has made Illumina the platform of choice for those applications is scale – but as soon as long read technologies reach a similar scale, Illumina looks like a poor choice.

The Sequel (and the PromethION) actually challenge Illumina – because in an era of cheap, long read sequencing, Illumina becomes a genotyping platform, not a sequencing platform.

Extracting MinION fastq on the command line using poRe

Hopefully most of you will be aware of our software for handling MinION data, poRe, published in bioinformatics this year:

Watson M, Thomson M, Risse J, Talbot R, Santoyo-Lopez J, Gharbi K, Blaxter M. (2015) 
poRe: an R package for the visualization and analysis of nanopore sequencing data. 
Bioinformatics 31(1):114-5.

poRe is based on R; many people consider R to be an interactive package, but it is very easy to write command-line scripts for R in very much the same way as you write Perl or Python scripts.  Below is a script that can be used in conjunction with R and poRe to extract 2D fastq data from a directory full of fast5 files:

#!/usr/bin/env Rscript

# get command line arguments as an array
args <- commandArgs(trailingOnly = TRUE)

# if we don't have any command line arguments
# print out usage
if (length(args) == 0) {
        stop("Please provide a directory containing fast5 files as the first (and only) argument")
}

# the first argument is the directory, stop if it doesn't exist
my.dir <- args[1]
if (! file.exists(my.dir)) {
        stop(paste(my.dir, " does not exist"))
}

# get a list of all of the files with a .fast5 file name extension
f5files <- dir(path = my.dir, pattern = "\.fast5$", full.names = TRUE)

# print to STDERR how many files we are going to attempt
write(paste("Extracting 2D fastq from", length(f5files), "files"), stderr())

# load poRe
suppressMessages(library(poRe))

# apply printfastq to every entry in f5files
# printfastq tests for 2D fastq and if found prints to STDOUT
invisible(apply(array(f5files), 1, printfastq))

The suppressMessages() function means that none of the output is printed when the poRe library is loaded, and the invisible() function suppresses the natural output of apply() (which would otherwise return an array of undefineds the same length as f5files (which we definitely do not want).

The above is based on R 3.1.2 and poRe 0.9.

We could extract template and complement fastq (respectively) by substituting in the following lines:

# template
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_template"))
# complement
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_complement"))

 

Older posts Newer posts

© 2019 Opiniomics

Theme by Anders NorenUp ↑