Opiniomics

bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Category: sequencing (page 3 of 5)

Why nanopore sequencing errors aren’t actually errors

Let me start by saying that what I’ve written below are opinions formed after many conversations with many different people, including some from ONT, over the last 2-3 years.  I don’t want anyone to think I am stealing their ideas – but I think it’s important to get them out there.

Oxford Nanopore’s technology, currently available in the MinION device, is the only sequencing technology to directly sequence an actual strand of DNA.  Every other technology sequences incorporation events into a template strand (including Sanger, Illumina, Ion and PacBio) – and invariably those incorporation events involve the canonical bases, A, G, C and T.  However, in nature, DNA is much more complicated than that – there are base modifications and base analogues.  The MinION (and other ONT sequencers) is the only sequencer that directly detects those modifications and analogues.  However, what we do when we “base call” the raw ONT data is to compress it from natural-DNA-space into canonical-AGCT-space, and whenever we compress we lose signal, and that lost signal turns up in the system as noise – as so-called “error”.  People talk about the error rate of the MinION being around 10% – but not all of it is error, some of it is signal, signal for something we don’t quite understand yet so we’re interpreting it (wrongly) as AGCT.

That’s one of the exciting things about nanopore sequencing that no-one is talking about.  Oxford Nanopore’s sequencers could possibly reveal a whole new alphabet for DNA that we didn’t know about before – and my bet is that it’ll explain some of what we currently struggle to understand in genetics and genomics.

Assembling B fragilis from MinION and Illumina data

You may have seen our bioRxiv preprint about the sequencing and assembly of B fragilis using Illumina + MinION sequence data.  Well, here is how to do it yourself.

First get the data:

# MinION data (raw dast5 data; needs to be extracted)
wget ftp://ftp.sra.ebi.ac.uk/vol1/ERA463/ERA463589/oxfordnanopore_native/FAA37759_GB2974_MAP005_20150423__2D_basecalling_v1.14_2D.tar.gz
mkdir fragilis_minion
tar -xzf FAA37759_GB2974_MAP005_20150423__2D_basecalling_v1.14_2D.tar.gz -C fragilis_minion
rm fragilis_minion/*.md5

# MiSeq data (fastq data)
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR973/ERR973713/ERR973713_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR973/ERR973713/ERR973713_2.fastq.gz

Then, within R:

library(poRe)
extract.run.fasta(dir="fragilis_minion")

This will extract all sequences as FASTA into fragilis_minion/extracted.  Let’s put all the 2D reads into one file:

cat fragilis_minion/extracted/*.2D.fasta > fragilis_minion.2D.fasta

And finally we are ready to assemble it:

spades.py -o spades_fragilis -1 ERR973713_1.fastq.gz -2 ERR973713_2.fastq.gz --nanopore fragilis_minion.2D.fasta

That’s as far as I can go on my Ubuntu laptop, wiil update when I get to work!

4 predictions about nanopore sequencing in the next 12 months

I have been lucky enough to be involved in Oxford Nanopore’s MinION access programme since it kicked off, and we have an active grant to develop poRe, software we are writing to help scientists access and use MinION sequence data.

Because of the above, I have been lucky enough to work with some amazing people: incredibly driven, intelligent scientists.  Here is my prediction about what ONT, and those intelligent, driven scientists, are going to achieve in the next 12 months (probably sooner):

1. We will see the first full human genome sequenced using only Oxford Nanopore data.  The cost will be comparable to current techniques.

2. Genotyping and consensus accuracy will be very high, more than capable of accurately calling SNVs  (arguably we are there already), and better than other technologies at calling structural variation

3. Nanopore will become the default platform for calling base modifications (5mC, 5hmC etc)

4. All of the above will be possible without seeing a single A, G, C, or T (i.e. it will all be possible without base-calling the data)

Exciting times!

Why FPKM makes sense

Simulating the perfect scenario

Let’s get our hands dirty in R.  I’m going to simulate 10 genes, each between 200 and 10000 nucleotides long and with between 0 and 50000 molecules in the mix:


gmat <- data.frame(id=paste("G",1:10,sep=""),
 length=as.integer(runif(10, min=200, max=10001)), 
 nmol=as.integer(runif(10, min=0, max=50001)),
 stringsAsFactors=FALSE)

This is what my data look like

> gmat
    id length  nmol
1   G1   5432  3206
2   G2   7177 43621
3   G3   6189 48670
4   G4   8529   635
5   G5   2576 27919
6   G6   8462 33280
7   G7   4316 18735
8   G8   3600 41457
9   G9   4973 21935
10 G10   9742 26207

What I want to do now is calculate the total amount of sequence space each gene represents, and use that to calculate a probability of RNA-Seq reads coming from that gene.  A simple example: if there are 10 copies of a gene in the mix, each of 1000 nucleotides long, then the sequence space of that gene is 10,000 nucleotides.

gmat$sspace <- gmat$length * gmat$nmol
ssum <- sum(gmat$sspace)
gmat$p <- gmat$sspace/ssum

My data now look like this:

> gmat
    id length  nmol    sspace          p
1   G1   5432  3206  17414992 0.01098634
2   G2   7177 43621 313067917 0.19750063
3   G3   6189 48670 301218630 0.19002544
4   G4   8529   635   5415915 0.00341666
5   G5   2576 27919  71919344 0.04537072
6   G6   8462 33280 281615360 0.17765861
7   G7   4316 18735  80860260 0.05101114
8   G8   3600 41457 149245200 0.09415216
9   G9   4973 21935 109082755 0.06881546
10 G10   9742 26207 255308594 0.16106284

My p-values should sum to one, and they do:

> sum(gmat$p)
[1] 1

Now, let’s assume that we have 30 million RNA-Seq reads, and that those reads have an equal chance of coming from any of the copies of any of the genes in the mix.  Therefore, 30million multiplied by my p-values will give me my RNA-Seq counts:

gmat$nreads <- round(30000000 * gmat$p)

And I can now calculate my FPKM values:

gmat$fpkm <- (gmat$nreads / (gmat$length / 1000)) / 30

My data now look like this:

> gmat
    id length  nmol    sspace          p  nreads       fpkm
1   G1   5432  3206  17414992 0.01098634  329590  2022.5209
2   G2   7177 43621 313067917 0.19750063 5925019 27518.5509
3   G3   6189 48670 301218630 0.19002544 5700763 30703.7388
4   G4   8529   635   5415915 0.00341666  102500   400.5941
5   G5   2576 27919  71919344 0.04537072 1361121 17612.8500
6   G6   8462 33280 281615360 0.17765861 5329758 20994.8719
7   G7   4316 18735  80860260 0.05101114 1530334 11819.0767
8   G8   3600 41457 149245200 0.09415216 2824565 26153.3805
9   G9   4973 21935 109082755 0.06881546 2064464 13837.8180
10 G10   9742 26207 255308594 0.16106284 4831885 16532.8309

Satisfyingly, I have a nice linear relationship between the number of molecules in the mix, and my calculated FPKM:

fpkmnmol

We can calculate the relationship between these (forcing the model through the origin):


> lm(nmol ~ 0 + fpkm, data = gmat)

Call:
lm(formula = nmol ~ 0 + fpkm, data = gmat)

Coefficients:
 fpkm  
1.585  

So, in our model, nmol = 1.585 * fpkm, and we can add this line to the plot:

p + geom_point() + geom_abline(intercept=0, slope=1.585, col="red")

fpkmnmolline

(please do not use this formula more generically – it really only works in this simple simulation!)

So for perfect simulated data, the number of molecules (transcripts) in the original mix correlates perfectly with FPKM.

(unless I have made any huge errors or incorrect assumptions, which is possible)

Complete Genomics Revolocity and the future of genome sequencing

I’m a bit late to this, please accept my apologies – I’ve been on holiday!  If you are into genome sequencing then you cannot have missed the announcement of the Revolocity system from Complete Genomics.  So what does this all mean?  Let’s put it all in context.

Revolocity

Key features appear to be:

  • The system costs $12M and occupies 1500 square feet.
  • Automated sequencing from biological samples, including blood and saliva
  • Capable of 10,000 WGS at 50X coverage.  To be increased to 30,000 WGS (no hint of how, though I expect read length to increase)
  • Capable of exomes
  • Has necessary compute and bioinformatics built in (possibly through BGI’s cloud service?)
  • 2x28bp reads apparently covering 96% of the human genome
  • 8 day sequencing time

It’s not just the sequencer, though….

It is a good thing that Illumina have a competitor in this space, as competition drives innovation and brings the cost of sequencing down.  The Revolocity is obviously a competitor for Illumina’s HiSeq X Ten and X Five systems.  However, Illumina have a few key advantages:

  • A history of supporting machines in the field (Complete have always been a sequence-as-a-service model)
  • An international support network of field engineers
  • An international distribution network for spare parts
  • A global production line for flowcells and reagents
  • The administrative support required for selling and supporting all of those things
  • A global network of scientists working with Illumina data for many years
  • A huge ecosystem of free and commercial bioinformatics software

CG will need to reproduce all of the above to genuinely compete.  A key piece of missing knowledge about Revolocity is: how often does it fail and how quickly will CG be able to fix it?  Illumina are not perfect in this, but they are a known entity.  Can Revolocity match or even beat Illumina’s “up time”?

Comparison to HiSeq X

CG have the advantage of automated sample prep from blood/saliva; 50X coverage; bioinformatics apparently included; exome enabled.

The HiSeq X also has automation tie-ins with the NeoPrep and their collaboration with Hamilton; they also have BaseSpace for bioinformatics needs.  So not too different from Revolocity.  We need more details to do a true comparison.

The HiSeq X wins on sequencing time (3 days) and read length (2x150bp).  The latter is key.  Despite CG claiming that they will cover 96% of the genome, figure 1 from this paper suggests that 56bp will get you about 92% of the human genome, whereas 300bp (HiSeq X) gets you about 97.5%

How will Illumina respond?

I have no doubt whatsoever that Illumina will reduce the cost of human genomes further; they will do this in January, at the JP Morgan conference.  I’m guessing a figure closer to $600-800 per genome.

I am less sure, but I think they may also open up the system to human exomes and RNA-Seq.

Cost?

People have said Revolocity hits the $1000 genome mark, but I have yet to see evidence.  If they can, they are doing incredibly well.  However, we need to see more than press releases. Having been involved in the set-up of a HiSeq X system at Edinburgh Genomics, I can confidently say that the $1200 genome is genuinely achievable – by that I mean you provide us with DNA, we sequence to 30X, align, variant call and provide you with FASTQ, BAM/CRAM and GVCF, and it costs around $1200/£800.  If you are interested in Edinburgh Genomics’ service at that price, please get in touch!

Edit 14/06/2015

Revolocity cost was announced at ESHG:

The Future

I have spoken to a few people and we all agree – Illumina will remain market leader for at least 18-24 months.  That’s a long time in sequencing!  However, it is the PromethION that we think will disrupt Illumina, not Revolocity.  If the PromethION can first match and then surpass MinION’s quality levels, tied to massive increases in throughput, then it will become a serious competitor to Illumina.  Why will this not happen sooner?  It will take time for the new platform to develop, and yet more time for the market to react (a market heavily invested in Illumina).  If/when PromethION does become a threat, it will be fascinating to see how Illumina respond.

Analysing MinION data with no bioinformatics expertise nor infrastructure

It has been a long standing aim of mine to enable easy analysis of Oxford Nanopore MinION data, preferably by people who don’t have a ton of bioinformatics skills.  This was one of the reasons we started to develop poRe, which is really easy to install, runs on any platform and is fairly easy to use.

Below is a proof-of-concept method for analysing MinION data without the need for bioinformatics infrastructure – I am doing this on my Windows laptop, with only R and an internet connection.  We are going to use the EBI web-services to do the heavy lifting.  Please do not abuse this!  EBI probably don’t want you to submit 20,000 jobs.  As I said, this is “proof-of-concept”.  It means you’re not supposed to actually use it in practice 😉

Some polite things to do:

  1. Use your own, real e-mail address when/if you run the code below
  2. Please stick to a small number of jobs! (unless EBI tell you it’s OK)

We’re going to do everything in R.

Make a directory on your computer where we will do the work.  You will need approx. 15Gb of space (I know, the data are big!).   My directory is going to be “C:/Users/Mick/Documents/MinION”.  We’re going to use data from Nick Loman’s group, published here and here.

# go to the working directory
setwd("C:/Users/Mick/Documents/MinION")

# load the necessary packages
library(poRe)
library(RCurl)

# download one of Nick's runs and unpack it
url <- "ftp://ftp.sra.ebi.ac.uk/vol1/ERA362/ERA362836/oxfordnanopore_native/Ecoli_R7_ONI_flowcell_17.tar.gz"
dest <- "Ecoli_R7_ONI_flowcell_17.tar.gz"
download.file(url=url, destfile=dest)
untar("Ecoli_R7_ONI_flowcell_17.tar.gz")

This will take some time, as there is about 10Gb of data there.  If I had a smaller dataset, I would use it.

Now, we use the poRe get_fast5_info() function to read statistics from the several thousand MinION reads, which again takes some time.  We then just select the reads that have a 2D sequence:

# read information from all .fast5 files in the current working directory
finfo <- read.fast5.info(dir=getwd())

# get a list of fast5 files where the "len2d" is greater than zero
reads.2d <- rownames(finfo)[finfo$len2d>0]

Now, we are going to use the EBI REST API for NCBI BLAST.  You can read about it at the link provided, but essentially it is about building and querying custom URLs to submit BLAST jobs and fetch the result.

We are only going to do this for the first ten fast5 sequences.  Please do not submit more, or the EBI may block you.  If you want to use this in anger, I suggest contacting the EBI directly and asking them politely.  The code blasts each sequence with default parameters against the EMBL bacterial database, saves the BLAST report to your current working directory, and prints out the top hit.

Here goes:

# iterate over the first ten 2D reads
for (f5 in reads.2d[1:10]) {

    # use poRe function get_fasta() to get the 2D sequence
    	fasta.2d <- get_fasta(f5)$'2D'
	
    # build a list of parameters for the EBI BLAST service
    	myl <- list(sequence=fasta.2d,
                stype="dna",
		                program="blastn",
		                email="yourownaddress@email.com",
		                database="em_rel_pro")

    # build the submission URL and submit the job
    runurl <- "http://www.ebi.ac.uk/Tools/services/rest/ncbiblast/run"
    	jobid <- postForm(runurl, .params=myl)

    # query the status URL for the job until the status
    # is no longer "RUNNING"
    staturl <- "http://www.ebi.ac.uk/Tools/services/rest/ncbiblast/status/"
    	surl <- paste(staturl, jobid, sep="")
    	status <- "RUNNING"
    	while(status == "RUNNING") {
          		status <- getURL(surl)
	    }

    # build the output URL and fetch the results
    outurl <- "http://www.ebi.ac.uk/Tools/services/rest/ncbiblast/result/"
    	ourl <- paste(outurl, jobid, "/out", sep="")
    	out <- getURL(ourl)

    # write the output to a file in the current working directory
    	bloutfile <- gsub(".fast5",".blast",f5)
    	write(out, bloutfile)

    # print out the top hit
    	print(strsplit(out, "n")[[1]][22])

}

Now, it’s not quick – it probably takes about 2-3 minutes to complete on my laptop.  It also isn’t very responsive (in that it looks a bit like it may not be doing anything!).  However, eventually the R terminal will show the top hits:

blast_top_hit

You can see from the above that the first few reads are a bit dodgy and don’t really hit anything, but pretty quickly the reads start hitting E coli K12, which is what they sequenced.

Also, in your directory you will find a whole bunch of BLAST reports:

blast_files

Note, as “proof-of-concept” code, there is very little (i.e. nothing) in the way of error-checking in the above code!  I mean, come on, it’s only a blog post!

Enjoy!

The cost of sequencing is still going down

There’s been a bit of chatter on Twitter about how the cost of sequencing is not going down anymore, with reference to that genome.gov graph.  I realise that sensationalist statements get more retweets, but I am sorry, this one is complete crap – the cost of sequencing is still on a long-term downward trend.

By way of qualification, I have been helping to run our University’s next-generation genomics facility, Edinburgh Genomics, since 2010.  We are an academic facility running an FEC model – which means we recover all of our costs (reagents, staff, lab space etc) but do not make a profit.  If you are interested in Illumina sequencing, especially after reading below, you should contact us.

You can read some relatively up-to-date references here, here and here.

What I am going to talk about below are the medium- to long- term trends in sequencing prices on the Illumina platform.  There are fluctuations in the price in any given period (for example Illumina put prices up across the board a few weeks ago), but these fluctuations are very small in the context of the wider, global trend of cost reduction.

HiSeq V3

Back in 2013/14, the most cost-effective method of sequencing genomes was on the HiSeq 2000/2500 using V3 chemistry.  At its best, a lane of sequencing would produce 180million paired 100bp reads, or 36 gigabases of sequence data.  I am going to use this as our base line.

HiSeq V4

After HiSeq V3, came HiSeq V4 which was introduced last year.  All new 2500s could offer this and many new-ish 2500s could be upgraded.  At its best, V4 produces 250million paired 125bp reads, or 62.5 gigabases of sequence data.

Of course, Illumina charge more for V4 reagents than they do for V3 reagents, but crucially, the price increase is proportionally smaller than the increase in throughput.  So, at Edinburgh Genomics, the cost of a V4 lane was approx. 1.2 times the cost of a V3 lane, but you get 1.7 times the data.  Therefore, the cost of sequencing decreased.

HiSeq X

This is rather trivial I think!  By my calculations, a lane of the HiSeq X will produce around 110 gigabases of sequence data, which is 3 times as much data as HiSeq V3, and the cost has come down to about 0.4 times.  Therefore, the cost of sequencing decreased.

HiSeq 4000

The HiSeq 4000 is a bit of an unknown quantity at present as we haven’t seen one in the wild (yet) and nor have we come up with a detailed costing model.  However, my back-of-a-post-it calculations tell me the HiSeq 4000 lanes will produce around 93 gigabases of data (about 2.6 times as much data as HiSeq V3) at about 1.4 times the cost.   Therefore, the cost of sequencing decreased.

Drawing a new graph

Here it is.  You’re welcome.

biomickwatson_thenewgraph

My numbers are accurate, despite what you may hear

It came to my attention last year that some facilities might be offering HiSeq V3 lanes with over 200 million read-pairs/clusters.  It is possible to go that high, but often only through a process known as over-clustering.  This is a bad thing.  It not only reduces sequence quality, but also produces lots of PCR and optical duplicates which can negatively affect downstream analysis.

Other platforms

I haven’t spoken about Ion Torrent or Ion Proton for obvious reasons.  I also haven’t included NextSeq 500 nor MiSeq – to be honest, though these are very popular, they are not cost-effective ways of producing sequence data (even Illumina will admit that) and if you told your director that they were, well, shame on you 😉

PacBio?  Well it seems the throughput has increased 3 times in the last year:

Despite the need for an expensive concrete block:

So I can’t really see the cost of PacBio sequencing going up either.

MinION and PromethION – costs currently unknown, but very promising platforms and likely to bring the cost of sequencing down further.

Complete Genomics – well, as I said in 2013, they claimed to be able to increase throughput by 30 times:

There is also the BGISEQ-1000, which apparently can do 1 million human genomes per year. Apparently.

All of which means – the cost of sequencing is going to keep coming down.

So why is the genome.gov graph incorrect?

I don’t know for sure, but I have an idea.  Firstly, the data only go to July 2014; and secondly, the cost per genome is listed as $4905, which is obviously crap in the era of HiSeq X.

Can we stop this now?

Putting the HiSeq 4000 in context

Illumina have done it again, disrupted their own market under no competiton and produced some wonderful new machines with higher throughput and lower run times.  Below is a brief summary of what I have learned so far.

HiSeq X 5

Pretty basic, this is half of an X ten, but the reagents etc are going to be more expensive.  $6million caital for an X5 and the headline figure appears to be $1400 per 30X human genome.  The headline figure for X10 is $1000 per genome, so X5 may be 40% more expensive.

HiSeq 3000/4000

The 3000 is to the 4000 as the 1000 was to the 2000 and the 1500 to the 2500 – it’s a 4000 that can only run one flowcell instead of two.  I expect it to be as popular as the 1000/1500s were – i.e. not very.  No-one goes to a funder for capital investment and says “Give me millions of dollars so I can buy the second best machine”.

Details are scarce, but the 4000 (judging by the stats) will have 2 flowcells with 6 8 lanes each, do 2x150bp sequencing, it seems around 375 312 million clusters per lane in 3.5 days.

Here is how it stacks up against the other HiSeq systems:

Clusters per lane Read length Lanes Days Gb per lane Gb total Gb per day
V1 rapid 150000000 2×150 4 2 45 180 90
V2 rapid 150000000 2×250 4 2.5 75 300 120
V3 high output 180000000 2×100 16 11 36 576 52
V4 high output 250000000 2×125 16 6 62.5 1000 167
HiSeq 4000 312000000 2×150 16 3.5 93.6 1500 428
HiSeq X 375000000 2×150 16 3 112.5 1800 600

These are headine figures and contain some guesses. How the machines behave in reality might differ.

If any of my figures are wrong, please leave a comment!

UPDATE: there appears to be some confusion over the exact config of the HiSeq 4000.  The spec sheet says that 5 billion reads per run pass filter.  The RNA-Seq dataset has 378million reads “from one lane”.  5 billion / 378 million is ~ 13 (lanes).  My contact at Illumina says there are 8 lanes per flowcell.  5 billion clusters / 16 lanes would give us 312 million reads per lane.  Possible the RNA-Seq dataset is overclustered!

A 387million paired RNA-Seq data set is here.

Edinburgh and Glasgow invest in revolutionary HiSeq X clinical genomics technology

I don’t think I’m exaggerating when I say that the ability to sequence a person’s genome for less than £750 will turn out to be one of the biggest breakthroughs in human medicine this century.  This one technology gives us an unprecedented ability to understand and diagnose genetic disease, and to enable personalised medicine and pharmacogenomics.

Today, the Universities of Edinburgh and Glasgow announce the purchase of 15 HiSeq X instruments, one of the largest installs in the world.  This tied together with Scotland and the UK’s world-leading super computing infrastructure, much of which is based at the University of Edinburgh, will make Scotland a global leader in clinical genomics.  in Edinburgh, the installation is backed by three world leading institutions, The Roslin Institute (my employer), the Institute for Genetics and Molecular Medicine and Edinburgh’s School of Biological Sciences.  The Edinburgh node will be placed within Edinburgh Genomics.

One nuance of this announcement that should not be underestimated is the fact that Scotland has excellent electronic national health records, representing a huge advantage.  As anyone working in genomics knows – in the land of cheap sequencing, the person with good metadata is king.

This announcement represents both the beginning and the end of two separate but related stories.  The first began at PAG XXII last year (it doesn’t escape me that I am writing this at PAGXXIII this year), when Illumina announced the HiSeq X system.  This morning I re-read a 3 hour e-mail conversation from January last year between Mark Blaxter (director of Edinburgh Genomics), Karim Gharbi (head of genomics), Richard Talbot (head of laboratory) and myself (head of bioinformatics).  As Illumina announced the HiSeq X system, we started off at confusion, moved swiftly to wonder and amazement, and ended with Mark asking “Do you think we need to invest in this technology?”.  He received 3 resounding answers: “Yes!”.  Three hours to decide we needed to push for a clinical genomics revolution in Scotland, driven by a HiSeq X Ten install.  The last year has seen some incredible hard work, not least from Mark Blaxter, Tim Aitman, David Hume and countless others.  A huge amount of respect has to go to all three, especially Mark who really began to drive these discussions at the start of last year.  No doubt countless others worked incredibly hard, and everyone involved deserves credit.

That’s the end of one story, and the the beginning of another – sequencing 1000s of human genomes from Scotland and beyond.  I am incredibly excited by what that represents.  One of my first jobs was at GlaxoWellcome in the 90s, and around 1996-98 I recall the buzz around pharmacogenomics, about personalised medicine, about delivering the right drug to the right patient basedon genetic information.  That’s nearly 20 years ago now, which demonstrates that this is not a new idea, and nor is it one that has been quick to realise.

Within Edinburgh, the HiSeq X systems will be run by a new clinical genomics division of Edinburgh Genomics, directed by Tim Aitman.  All other sequencing, including agrigenomics, environmental and non-genomic human sequencing will be run by the existing Edinburgh Genomics infrastructure, Edinburgh Genomics Genome Sciences, directed by Mark Blaxter.  Existing customers, and those who work in fields other than human genomics should rest assured that we remain committed to our existing facility and will continue to deliver high quality data and projects across the biological sciences using our current install of HiSeq 2500 and MiSeq sequencers.

These are very, very exciting times!

On Mars, Venus, Monkeys and Microbiomes

You would have to be dead not to realise that the major story today is the contamination paper, Salter et al, available here.  Read more about it here and here, including quotes from me.  I’ve been banging on about this paper for a while now, and arguably I think it’s more important than the authors do!

As I said in Ed Yong’s Nat Geo piece, Salter et al isn’t just about contamination in microbiome studies, it is about how we behave as scientists.  Did we learn nothing from the XMRV fiasco?  Apparently not, because we are still finding probable contaminants and insisting they’re implicated in varied diseases such as HIV, cancer, sleep apnoea and stupidity.

My point is this: in any experiment (not just microbiome), and especially experiments involving deep sequencing, if you find something incredible, then it probably is exactly that (below is the definition of incredible, by the way) – “unconvincing; far fetched; hard to believe; scarecely credible” etc etcincredible

Yesterday, I presented at NGS Sheffield, and this is what I said about contamination:

If you travel to Mars, and upon your return, feel ill; you subsequently undergo a sequencing test that suggests you are infected with a virus from Venus; the conclusion should not be “Venusian virus found on Mars!”; rather, the question should be “have any of my lab been to Venus?” and  “Could any of my reagents include anything Venusian?”

In other words, when you find something that flies in the face of all previous evidence, or which defies belief, then the onus is on you to prove it is not a false positive.  Sometimes I think researchers are too quick to pick up the phone to Nature and Science – because incredible results get high impact papers, and to hell with the idea that it might not actually be true.

To torture another metaphor, the infinite monkey theorem states that an infinite number of monkeys hitting keys at random on a keyboard will eventually type the complete works of Shakespeare.  Well, with current high-throughput sequencing technologies, we are approaching the equivalent of infinite monkeys, which means occasionally you’re going to find something that looks like Shakespeare.  It isn’t though, and we all need to be big enough to admit that our perfect results might turn out to be complete crap.

 

Older posts Newer posts

© 2019 Opiniomics

Theme by Anders NorenUp ↑