bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Category: bioinformatics (page 2 of 7)

Building a Kraken database with new FTP structure and no GI numbers

Progress is usually good, except when a reliable resource completely changes their data and file structures and mess up everything you have been trying to do.  The NCBI used to arrange bacterial genomes in such a very easy way to understand and download; now it’s a bit tougher.  So tough in fact that they had to create a new FAQ.

At the same time as moving around genome data, they also decided to retire GI numbers (thanks to Andreas for the link!).

This is a problem for Kraken, a metagenomic profiler, because it relies both on the old style genomic data structures and the GI numbers that come with them.  Below are my attempts to build a Kraken database, using the new FTP structure and no GIs.

Downloading the data

It is pretty easy to follow the FAQ above, but I have created some perl scripts here that should help.

You will need:

  • Perl
  • BioPerl
  • Git
  • A linux shell (bash) that has wget
  • Internet access
  • A folder that you have write access to
  • Maybe 40Gb of space

The magic in these scripts is that they download the data and then add the necessary “|kraken:taxid|xxxx” text that replaces GI number for building what Kraken refers to as “custom databases”

Note also that the script only downloads fasta for assemblies classed as “Complete Genome”.  Other options are:

  • Chromosome
  • Complete Genome
  • Contig
  • Scaffold

If you want these others then edit the scripts appropriately.

# clone the git repo
git clone https://github.com/mw55309/Kraken_db_install_scripts.git

# either put the scripts in your path or run them from that directory

# make download directory
mkdir downloads
cd downloads

# run for each branch of life you wish to download
perl download_bacteria.pl
perl download_archaea.pl
perl download_fungi.pl
perl download_protozoa.pl
perl download_viral.pl

# you should now have five directories, one for each branch

# build a new database 
# download taxonomy
kraken-build --download-taxonomy --db kraken_bvfpa_080416

# for each branch, add all fna in the directory to the database
for dir in fungi protozoa archaea viral bacteria; do
        for fna in `ls $dir/*.fna`; do
                kraken-build --add-to-library $fna --db kraken_bvfpa_080416

# build the actual database
kraken-build --build --db kraken_bvfpa_080416

I haven’t tested the above database yet, but the output of the build script ends with:

Creating GI number to seqID map (step 4 of 6)...
GI number to seqID map created. [14.935s]
Creating seqID to taxID map (step 5 of 6)...
7576 sequences mapped to taxa. [1.188s]
Setting LCAs in database (step 6 of 6)...
Database LCAs set. [21m24.805s]
Database construction complete. [Total: 1h1m28.108s]

Which certainly makes it look like things have worked.

I’ll update the post with info from testing in due course.


Update 13/04/2016

I simulated some reads from a random set of fastas in the database, then ran Kraken, and it worked (plus/minus a few false positives)


We need to stop making this simple f*cking mistake

I’m not perfect.  Not in any way.  I am sure if anyone was so inclined, they could work their way through my research with clinical forensic attention-to-detail and uncover all sorts of mistakes.  The same will be true for any other scientist, I expect.  We’re human and we make mistakes.

However, there is one mistake in bioinformatics that is so common, and which has been around for so long, that it’s really annoying when it keeps happening:

It turns out the Carp genome is full of Illumina adapters.

One of the first things we teach people in our NGS courses is how to remove adapters.  It’s not hard – we use CutAdapt, but many other tools exist.   It’s simple, but really important – with De Bruijn graphs you will get paths through the graphs converging on kmers from adapters; and with OLC assemblers you will get spurious overlaps.  With gap-fillers, it’s possible to fill the gaps with sequences ending in adapters, and this may be what happened in the Carp genome.

Why then are we finding such elementary mistakes in such important papers?  Why aren’t reviewers picking up on this?  It’s frustrating.

This is a separate, but related issue, to genomic contamination – the Wheat genome has PhiX in it; tons of bacterial genomes do too; and lots of bacterial genes were problematically included in the Tardigrade genome and declared as horizontal gene transfer.

Genomic contamination can be hard to find, but sequence adapters are not.  Who isn’t adapter trimming in 2016?!

Fast parallel access to data within MinION FAST5 files

As readers of this blog will know, my group was one of the first groups to publish software for the MinION sequencer, and I have blogged a couple of times on how to extract data from FAST5 files using poRe.  Well BBSRC, my main funder, were kind enough to give us a grant to develop the software, and one thing we said we’d do is implement parallel processing.

Below is a summary of what we’ve done so far.

Parallel processing in R

One of the huge benefits of R, besides the amazing suite of statistical and visualization tools, and the thousands of add-on packages, is the fact that parallelisation is really easy.  In this context, what I mean by parallelisation is the use of multiple cores on your machine to run a single bit of code.  By using multiple cores, the code is speeded up, often linearly relative to the number of cores you use.

There are quite a lot of packages for R that help you run parallel code, including doParallel, foreach, snow etc. but we’re going to focus on the built in package, parallel.  To show the benefits of this, we have run various toolkits for the extraction of 2D FASTQ data from 25484 FAST5 files kindly shared by Nick Loman.

Benchmarking FASTQ extraction

Now, as Matt Loose has pointed out to me, focusing on FASTQ extraction is a little pointless as almost certainly this will be built in to future ONT base callers; however, it serves as a useful example and there are plenty of FAST5 files out there already that you may wish to extract FASTQ from.

The tools we compared are extract2D from poRe, poretools, fast5extract from c_fast5 (a simple C implementation) and extract2DParallel, a parallel version of extract2D (the parallel scripts are in the “misc” folder in GitHub).  We ran each tool 4 times on 1000, 5000, 10000, 15000, 20000, and 25000 randomly selected FAST5 files, clearing the cache completely between runs to ensure a fair comparison.


As can be seen, all tools perform linearly with the number of FAST5 files; poretools is the slowest, then extract2D, then c_fast5.  However, the best result is clearly from extract2DParallel (using 4 cores) which outperforms everything else by some distance.

Of course, this is not surprising, and doesn’t come without a cost – if you only have 4 cores then using all 4 will mean all other processes perform really slowly.  However, there are also benefits – if your machine has 8 or 24 cores, you can speed up processing even more.

The scripts on GitHub should be treated as an early release and have only been tested on one dataset, so I welcome your feedback!

Extracting FASTQ on Windows

We haven’t wrapped up the functionality into poRe yet, but we expect to in the next release.  Meanwhile for the impatient…


# find out how many cores you have
# [1] 4

I have 4 cores, and I probably don’t want to use all of them! I now need to redefine one of poRe’s internal functions so that it works with parallel:

printfastq <- function (f5, f5path = "/Analyses/Basecall_2D_000/BaseCalled_2D/", out="out.fastq")
    fid = tryCatch({
    }, warning = function(w) w, error = function(e) e)
    if (typeof(fid) == "S4" && H5Lexists(fid, f5path)) {
        did <- H5Dopen(fid, paste(f5path, "Fastq", sep = ""))
        fq <- (H5Dread(did))
        four <- strsplit(fq, "\n")[[1]]
        fq <- paste(paste(four[1], f5, sep = " "), four[2], four[3],
            four[4], sep = "\n")
        cat(fq, file = out, sep = "\n", fill = FALSE, append=TRUE)
    if (typeof(fid) == "S4") {

Now let's create a vector of the fast5 files I want to extract:

my.dir <- "C:/MinION/FAST5/" # use your own directory here!

f5files <- dir(path = my.dir, pattern = "\\.fast5$", full.names = TRUE)

Here is the interesting part, where we set up our parallel "cluster", which is then used to execute the code. Each core within the cluster should be treated as a separate instance of R, so we need to load libraries and export variables to the cluster:

# use two cores
cl <- makeCluster(2)
# load poRe on the cluster
c <- clusterEvalQ(cl, suppressWarnings(suppressMessages(library(poRe))))

# export the printfastq function to the cluster
clusterExport(cl, "printfastq")

# run the code and stop the cluster
l <- parApply(cl, array(f5files), 1, printfastq, out="myfastq.fastq")

Extract metadata

Absolutely one of the slowest aspects of MinION data analysis and QC is extracting metadata, but we can also parallelise this!  For those of you who like the Linux command line, there is an extractMetaParallel script in the GitHub repo (again - use with caution and feedback!), but for those of you who want to use this within R:

# find out how many cores you have
# [1] 4

# get a list of files to extract data from
my.dir <- "C:/MinION/FAST5/" # use your own directory here!
f5files <- dir(path = my.dir, pattern = "\\.fast5$", full.names = TRUE)

# use two cores
cl <- makeCluster(2)
# load poRe on the cluster
c <- clusterEvalQ(cl, suppressWarnings(suppressMessages(library(poRe))))

# variables to set paths to template, complement and 2D data
# NOTE you will need to change these for the new FAST5 format
path.t <- "/Analyses/Basecall_2D_000/"
path.c <- path.t
path2d <- path.t

# run the code and stop the cluster
l <- parLapply(cl, 
          get_meta_info, path2d, path.t, path.c)

# convert the resulting vector into a matrix with the correct column names
m <- matrix(unlist(l), ncol=12, byrow=TRUE)
colnames(m) <- c("filename","channel_num","read_num","read_start_time","status","tlen","clen","len2d","run_id","read_id","barcode","exp_start") 


Parallelisation is so easy within R it's almost criminal not to do it. We can see that just a bit of parallelisation speeds up the code incredibly - and we can also see that poRe is a pretty fast toolkit when used in the correct way, which it wasn't in this paper.

NOTE I am fairly new to benchmarking and parallelisation, if I have done anything wrong, I apologize - and please get in touch!

Shots fired – nanopore wars part II

For the record, and to try and avoid accusations of bias, I am a fan of both Illumina and Oxford Nanopore.  I have won grants to purchase, and I have encouraged our facility to purchase, millions of pounds worth of Illumina equipment; my lab has spent well over £100,000 on Illumina sequencing.   So there’s that.  Yet I am also a massive fan boy of Oxford Nanopore’s technology – anyone who reads my blog or my Twitter feed will realise that I see nanopore technology as the future and often get incredibly excited by the revolutionary nature of the MinION mobile sequencer.

So I am incredibly disappointed in this – Illumina Sues Oxford Nanopore for Patent Infringement.   It’s like having two of your friends fighting.

There are fractious relationships in genomics of course – that between Illumina and BGI perhaps most famous – and I have written before about the relationship between Illumina and ONT, and how they entered into arbitration in 2011/12.

Illumina suing ONT is the latest act in the “war” between the two companies and comes after Illimina licensed technology from the Gundlach lab in 2013.  Key for me is what Illumina’s intentions were when they did this:

  1. Illumina wanted to develop nanopore technology, and they now have something close to market.  This situation would put Illumina in the best light.  In this case suing ONT is probably something they have to do to try and create space in the market for their technology.  Not great for ONT, but it seems this is how business is done.
  2. Illumina wanted to develop nanopore technology, they’ve tried and failed.  This makes Illumina look worse – if they don’t have a competing technology, it’s a bit spiteful to try and spike a competitor anyway, though certainly not uncommon.  And in this situation at least they tried to develop tech.
  3. Illumina never wanted to develop nanopore technology, and they bought the Gundlach patents just to kill ONT.  Again not entirely uncommon, but this is essentially patent trolling and really doesn’t make Illumina look good at all.  This would be a business driven decision to the detriment of science and the detriment of humanity.  Not good.

I am guessing the only way we will ever know which of these is true is if Illumina eventually release nanopore technology – we’d then know/suspect (1) is true.

The timing of this really stinks too – given ONT are set to float on the stock market, it’s hard not to see this as a deliberate attempt to scupper that.

My view is that ONT have a pretty strong position – my understanding is that they have patents, stretching back many years, that cover most or all of their instruments.  Some of this may come down to the pore – Gundlach’s patents cover MspA; and Hagan Bayley’s work, on which ONT was originally based, involved alpha-hemolysin, a different pore.  Though we don’t know if ONT switched pores.  In addition to all of that, the pores in ONTs technologies have been engineered, mutated and developed a huge amount – so when does a pore stop being the original molecule and become something else?  Do the patents cover this?

These are very interesting times, but this is not a welcome development.

Did you notice the paradigm shift in DNA sequencing last week?

There is a very funny tweet by Matthew Hankins about the over-use of “paradigm shift” in articles in Scopus between 1962-2014 which clearly suggest we over-use the term:

However, last week there was a genuine shift in DNA sequencing published in bioRxiv by Matt Loose called “Real time selective sequencing using nanopore technology“.  So what makes this paper special?  Why is this a genuine paradigm shift?

Well, this is the first example, ever, of a DNA sequencer selecting regions of the input genome to sequence.   To be more accurate, Matt demonstrates the MinION sequencing DNA molecules and analyzing them in real time; testing whether they come from a region of the genome he is interested in; and if they are not, rejecting them (“spitting them out”) and making the nanopore available for the next read.

This has huge applications from human health through to pathogen genomics and microbiome research.  There are many applications in genomics where you are only interested in a subset of the DNA molecules in your sample.  For example, exome sequencing, or other sequence capture experiments – at the moment a separate capture reaction needs to take place and this can actually be more expensive than the sequencing.   Or pathogen diagnostics – here most of what you sequence would be host DNA, which you could ask the sequencer to reject, leaving only the pathogen to sequence; or flip it round – human saliva samples can have up to 40% oral microbiome DNA, which is wasted if what you want to sequence is the host.  I have worked with piRNA and siRNA datasets from insects where less than 1% of the Illumina reads are those I actually want.  If you want to know if a particular bacterial infection is resistant to antibiotics, you only actually need to sequence the antibiotic-resistance genes/islands/plasmids, not the entire genome. etc etc etc

Of course, we still have a bit of a way to go – Matt demonstrates so-called “read until” on Lambda, a relatively small, simple genome – but this is an amazing proof-of-principle of an incredible idea – that you can tell your sequencer what to sequence.  The paper deserves your attention.  Please read it and fire up your imagination!

Bioinformatics error responsible for retraction in Science

Update: sorry, my mistake, the paper has not been retracted, an Erratum has been issued

If you follow the same kind of feeds and blogs that I do, you can’t have missed the recent retraction in Science that was due to a simple bioinformatics error.  Others have covered this in way more detail, but the essential points are that the authors sequenced an ancient genome from an individual they called “Mota”, and compared that genome to a bunch of other human genomic datasets to put it in context – except when they compared to a reference panel of SNPs, a necessary conversion script wasn’t run, which meant that a huge number of SNPs that Mota shared with Europeans were dropped, making Mota appear less European than he actually was.

The thing that strikes me most about this story is that this could have been so easily avoided.

There are a few things I tell my post-docs and students over and over.  One of them, perhaps the most important one, is to check your work.  A second and related point is that if you get wonderful results, or unusual and striking results, they are probably wrong and you should check your work.  We tried to encapsulate some of this advice in our paper – So you want to be a computational biologist?  Honestly, I know it’s boring, but double- and triple- checking your work is so important.   Whatever you find, at the very least just take a few data points and check again to make sure you get the same result.

In the above example, what I personally would have done, is to write a script to make sure that the “255,922 SNPs out of 256,540” were genuinely missing from the samtools output.  This may sound daunting but it really isn’t – possibly 10 or 20 lines of Python or Perl could have saved the day.

I do wonder if this was a case of a lonely bioinformatician – I genuinely don’t know who did what on the paper, but so frequently errors creep in when wet and dry lab scientists communicate in every-so-slightly different languages.

Whatever the issue – the take home is this: please, for the love of everything that is good in the world, CHECK YOUR RESULTS!

Generate a single contig, hybrid assembly of E coli using MiSeq and MinION data

If you have MiSeq and ONT MinION data, it is now very easy to combine them into a single-contig, high quality assembly.  Here I will start with a base Ubuntu 14.04 LTS distribution and take you through every step, including software installation.  I am starting with an Amazon EC2 m4.xlarge AMI with 4 vCPUs and 16Gb RAM.  If you want to repeat the work below, you don’t need to use EC2, any computer running Ubuntu and with enough disk/RAM should work.

For MinION data, we will use Nick’s SQK-MAP-006 data at the EBI.  Using a simple trick, and because I know the internal structure of the tar archive, we can download only the base-called data:

curl ftp://ftp.sra.ebi.ac.uk/vol1/ERA540/ERA540530/oxfordnanopore_native/MAP006-1.tar | tar -xf - MAP006-1/MAP006-1_downloads &

We can also get some MiSeq data from ENA from the same strain, E coli K-12 MG1655:

wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR262/005/SRR2627175/SRR2627175_2.fastq.gz &
wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR262/005/SRR2627175/SRR2627175_1.fastq.gz &

Now, while that is downloading, we need to install some software on the server:

sudo apt-get update
sudo apt-get install r-base-core git samtools

This installs R which we will use to extract the MinION data (using poRe), git and samtools.  Just reply “Y” to any questions.

We now need to install the poRe dependencies in R, which is very easy:


R may ask if you want to install into a local library, just say Y and accept defaults.  We need to download poRe from sourecforge and we are using version 0.16

Once downloaded, and back at the Linux command line:

R CMD INSTALL poRe_0.16.tar.gz

The fastq extraction scripts for poRe are in github, so let’s go get those:

git clone https://github.com/mw55309/poRe_scripts.git

We will assemble using SPAdes, so let’s go get that:

wget http://spades.bioinf.spbau.ru/release3.6.2/SPAdes-3.6.2-Linux.tar.gz
gunzip < SPAdes-3.6.2-Linux.tar.gz | tar xvf -

Now, we are ready to go.  First off, let’s extract the 2D sequence data as FASTQ from the MinION data.  Nick’s SQK-MAP-006 data are in the old FAST5 format so we use the script in “old_format”:

./poRe_scripts/old_format/extract2D MAP006-1/MAP006-1_downloads/pass/ > minion.pass.2D.fastq &

Finally we can use SPAdes to assemble the data.  SPAdes is a swiss-army knife of genome assembly tools, and by default includes read correction.  This takes up lots of RAM, so we are going to skip it.  We will also only use 3 kmers to save time:

./SPAdes-3.6.2-Linux/bin/spades.py --only-assembler 
				   -t 4 -k 21,51,71 
				   -1 SRR2627175_1.fastq.gz 
				   -2 SRR2627175_2.fastq.gz 
				   --nanopore minion.pass.2D.fastq 
				   -o SPAdes_hybrid & 

Use samtools to extract the top contig:

head -n 1 SPAdes_hybrid/contigs.fasta
samtools faidx SPAdes_hybrid/contigs.fasta
samtools faidx SPAdes_hybrid/contigs.fasta NODE_1_length_4620446_cov_135.169_ID_22238 > single_contig.fa

Finally, a quick comparison to the reference:

sudo apt-get install mummer 
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NC_000913.3&rettype=fasta&retmode=txt" > NC_000913.3.fa
nucmer NC_000913.3.fa single_contig.fa
mummerplot -png out.delta
display out.png &


And that is a single global alignment between reference and query 🙂

Simple huh?

How to extract FASTQ from the new MinION FAST5 format using poRe

A little while ago I demonstrated how to extract FASTQ from MinION FAST5 files using Rscript and the Linux command line.

In that article, I described how to extract the different FASTQ data using a parameter:

# 2D is the default
invisible(apply(array(f5files), 1, printfastq))
# template
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_template"))
# complement
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_complement"))

The f5path parameter is the address of the FASTQ data within the FAST5 file, and that’s been changed recently by nanopore.  Well, we can very easily simply edit our scripts to cope with the new format:

# 2D is the default
invisible(apply(array(f5files), 1, printfastq))
# template
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_1D_000/BaseCalled_template"))
# complement
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_1D_000/BaseCalled_complement"))

And that’s it! No new version of poRe required, simply a couple of edits to scripts – example scripts can be found on github.

The five habits of bad bioinformaticians

When ever I see bad bioinformatics, a little bit of me dies inside, because I know there is ultimately no reason for it to have happened.  As a community, bioinformaticians are wonderfully open, collaborative and helpful.  I genuinely believe that most problems can be fixed by appealing to a local expert or the wider community.  As Nick and I pointed out in our article, help is out there in the form of SeqAnswers and Biostars.

I die even more when I find that some poor soul has been doing bad bioinformatics for a long time.  This most often happens with isolated bioinformatics staff, often young, and either on their own or completely embedded within a wet-lab research group.  I wrote a blog post about pet bioinformaticians in 2013 and much of the advice I gave there still stands today.

There are so many aspects of bioinformatics that many wet lab PIs are simply incapable of managing them.  This isn’t a criticism.  There are very few superheros; few people who can actually span the gap of wet- and dry- science competently.  So I thought I’d write down the 5 bad habits of bad bioinformaticians.  If you are a pet bioinformatician, read these and figure out if you’re doing them.  If you’re a wet lab PI managing bioinformatics staff, try and find out if they have any of these habits!

Using inappropriate software

This can take many forms, and the most common is out-of-date software.  Perhaps they still use Maq, when BWA and other tools are now far more accurate; perhaps the software is a major version out of date (e.g. Bowtie instead of Bowtie2).  New releases come out for a reason, and major reasons are (i) new/better functionality; (ii) fixing bugs in old code.  If you run out of date software, you are probably introducing errors into your workflow; and you may be missing out on more accurate methods.

Another major cause of inappropriate software use is that people often use the software they can install rather than the software that is best for the job.  Sometimes this is a good reason, but most often it isn’t.  It is worth persevering – if the community says that a tool is the correct one to use, don’t give up on it because it won’t install.

Finally, there are just simple mistakes – using a non-spliced aligner for RNA-Seq, software that assumes the wrong statistical model (e.g. techniques that assume a normal distribution used on counts data) etc etc etc.

These aren’t minor annoyances, they can result in serious mistakes, and embed serious errors in your data which can result in bad science being published.  Systematic errors in analysis pipelines can often look like real results.

Not keeping up to date with the literature

Bioinformatics is a fast moving field and it is important to keep up to date with the latest developments.  A good recent example is RNA-Seq – for a long time, the workflow has been “align to the genome, quantify against annotated genes”.  However, there is increasing evidence that the alignment stage introduces a lot of noise/error, and there are new alignment free tools that are both faster and more accurate.  That’s not to say that you must always work with the most bleeding edge software tools, but there is a happy medium where new tools are compared to existing tools and shown to be superior.

Look for example here, a paper suggesting that the use of SAMtools for indel discovery may come with a 60% false discovery rate.  60%!  Wow….. of course that was written in 2013, and in 2014 a comparison of a more recent version of SAMtools shows a better performance (though still an inflated false positive rate).

Bioinformatics is a research subject.  It’s complicated.

All of this feeds back to point (1) above.  Keeping up to date with the literature is essential, otherwise you will use inappropriate software and introduce errors.

Writing terrible documentation, or not writing any at all

Bioinformaticians don’t document things in the same way wet lab scientists do.  Wet lab scientists keep lab books, which have many flaws; however they are a real physical thing that people are used to dealing with.  Lab books are reviewed and signed off regularly.  You can tell if things have not been documented well using this process.

How do bioinformaticians document things?  Well, often using things like readme.txt files, and on web-based media such as wikis or github.  My experience is that many bioinformaticians, especially young and inexperienced ones, will keep either (i) terrible or (ii) no notes on what they have done.  They’re equally as bad as one another.  Keeping notes, version controlled notes, on what happened in what order, what was done and by whom, is essential for reproducible research.  It is essential for good science.  If you don’t keep good notes, then you will forget what you did pretty quickly, and if you don’t know what you did, no-one else has a chance.

Not writing tests

Tests are essential.  They are the controls (positive and negative) of bioinformatics research and don’t just apply to software development.  Bad bioinformaticians don’t write tests.  As a really simple example, if you are converting a column of counts to a column of percentages you may want to sum the percentages to make sure they sum to 100.  Simple but it will catch errors.  You may want to find out the sex of all of the samples you are processing and make sure they map appropriately to the signal from sex chromosomes.  There are all sorts of internal tests that you can carry out when performing any analysis, and you must implement them, otherwise errors creep in and you won’t know about it.

Rewriting things that exist already

As Nick and I said, “Someone has already done this, find them!”.  No matter what problem you are working on, 99.9999% of the time someone else has already encountered it and either (i) solved it, or (ii) is working on it.  Don’t re-invent the wheel.  Find the work that has already been done and either use it or extend it.  Try not to reject it because it’s been written in the “wrong” programming language.

My experience is that most bioinformaticians, left to their own devices, will rewrite everything themselves in their preferred programming language and with their own quirks.  This is not only a waste of time, it is dangerous, because they may introduce errors or may not consider problems other groups have encountered and solved.  Working together in a community brings safety, it brings multiple minds to the table to consider and solve problems.


There we are.  That’s my five habits of bad bioinformaticians.  Are you doing any of these?  STOP.  Do you manage a bioinformatician?  SEND THEM THIS BLOG POST, and see what they say 😉

What does the PacBio Sequel mean for the future of sequencing?

PacBio have responded to intense competition in the sequencing market by releasing the Sequel, a machine that promises 7 times the throughput of their last machine (the RSII) at a price of $350k (the RSII cost more in the region of $750k). So they have a cheaper, higher throughput machine (though I haven’t seen cost-per-Gb figures).  That’s around 7 Gigabases of reads averaging around 15Kb in length.  Without doubt this is a very interesting platform and they will sell as many of them as they can produce.  A sweet spot for assembly is still 50-60X for PacBio, so think 12 SMRT cells to get a human genome: let’s say £12k per genome. Edit 01/10/2015 my maths at 7am was not that good!  50X human genome is 150Gb, so that’s 21 SMRT cells and £21K per human genome.   Much more expensive than Illumina’s $1000 genome, but far, far better.

I just want to say that at times I have been accused of being an Illumina- and a nanopore- fanboy; I am neither and both.  I am just a fan of cool technology, from microarray in the 90s to sequencing now.

In long reads, let’s be clear, we are talking about the promise of Oxford Nanopore vs the proven technology of PacBio.  And the Sequel changes the dynamics.  However, the MinION fast mode is capable of throughput in the region of 7Gb (like the Sequel) and the PromethION is capable of throughput on Illumina-scale.   Therefore, Oxford Nanopore are far from dead – though they need to respond.

So how does the new PacBio Sequel change the market?  A lot of initial reactions I have had are that the Sequel is a real threat to Oxford Nanopore.  It certainly ramps up the competition in the long read space, which is a really good thing.  But actually, high-throughput long read machines like the Sequel and the PromethION don’t spell the end for one another – they actually spell the beginning of the end for Illumina – as a sequencing platform.

As soon as you have high-throughput, cheap long reads, it is in fact Illumina who face a problem.  I love Illumina.  When I first arrived at Roslin, I walked into our lab and (honestly!) stroked our Illumina GAIIx.  Illumina have revolutionised biology.  However, short reads have limitations – they are bad for genome assembly, they are bad at complex genomes, they’re actually quite bad at RNA-Seq, they are pretty bad for structural variation, they are bad at haplotypes and SNP phasing, and they are not that great at metagenomics.  What has made Illumina the platform of choice for those applications is scale – but as soon as long read technologies reach a similar scale, Illumina looks like a poor choice.

The Sequel (and the PromethION) actually challenge Illumina – because in an era of cheap, long read sequencing, Illumina becomes a genotyping platform, not a sequencing platform.

Older posts Newer posts

© 2019 Opiniomics

Theme by Anders NorenUp ↑