Opiniomics

bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Did you notice the paradigm shift in DNA sequencing last week?

There is a very funny tweet by Matthew Hankins about the over-use of “paradigm shift” in articles in Scopus between 1962-2014 which clearly suggest we over-use the term:

However, last week there was a genuine shift in DNA sequencing published in bioRxiv by Matt Loose called “Real time selective sequencing using nanopore technology“.  So what makes this paper special?  Why is this a genuine paradigm shift?

Well, this is the first example, ever, of a DNA sequencer selecting regions of the input genome to sequence.   To be more accurate, Matt demonstrates the MinION sequencing DNA molecules and analyzing them in real time; testing whether they come from a region of the genome he is interested in; and if they are not, rejecting them (“spitting them out”) and making the nanopore available for the next read.

This has huge applications from human health through to pathogen genomics and microbiome research.  There are many applications in genomics where you are only interested in a subset of the DNA molecules in your sample.  For example, exome sequencing, or other sequence capture experiments – at the moment a separate capture reaction needs to take place and this can actually be more expensive than the sequencing.   Or pathogen diagnostics – here most of what you sequence would be host DNA, which you could ask the sequencer to reject, leaving only the pathogen to sequence; or flip it round – human saliva samples can have up to 40% oral microbiome DNA, which is wasted if what you want to sequence is the host.  I have worked with piRNA and siRNA datasets from insects where less than 1% of the Illumina reads are those I actually want.  If you want to know if a particular bacterial infection is resistant to antibiotics, you only actually need to sequence the antibiotic-resistance genes/islands/plasmids, not the entire genome. etc etc etc

Of course, we still have a bit of a way to go – Matt demonstrates so-called “read until” on Lambda, a relatively small, simple genome – but this is an amazing proof-of-principle of an incredible idea – that you can tell your sequencer what to sequence.  The paper deserves your attention.  Please read it and fire up your imagination!

6 Comments

  1. We’ve been looking at this paper in the last few days. It is extremely cool; if we were working on amplicons in small genomes we would be all over it, and it looks like it might work well to pick out a small number of genes (splice forms, for example) from cDNA. But just to think ahead, we’re working on insects (~300 Mb genomes) and hoping to do target regions on the scale of 100kb-1Mb. So we’re trying to figure out when this approach will be viable for this scale. What needs to happen to make this viable for eukaryotes?

    From Matt’s paper, it appears to take 15 seconds to accept or reject a 256-event squiggle (Supplementary Figure 2H) using an algorithm of O(query size * target size); it looks like this is based on a target of the whole lambda genome (page 4) but it’s not clear. But let’s call it a 50kb target, and a 256-event query. At 70bp/s, this means something like the first 1kb of each 1D template is sequenced before rejection.

    The ratio of reference target to total genome size becomes important here. With a 50kb reference and, say, a 100kb genome, we get 1kb of rejected sequence for every full read of accepted sequence (call it 5kb for a 2D read). But with a 100Mb genome, we get ~2Mb of rejected sequence for every 2D read. For human-scale genomes, we have to sequence ~60 Mb for every 2D read.

    This 15s time to rejection is probably wrong – Matt says something at the end of his talk from the New York MinION Community Meeting

    about placing reads in lambda in 0.3 s (20:15 into the video), so presumably the 15s from the figure is way off – but what’s the difference? Is the quicker time for the 5kb or 2kb targets in the paper? If so, that’s not great, because we’d really like to increase the size of the target if possible.

    So we either have to reduce the number of events, or improve the algorithm in general. Reducing the events could have a meaningful impact but it’s not quite enough. Supplementary Figure 2G shows that over 80% of reads can be correctly assigned with only 64 events, which would cut down the rejection time to less than 5s (by Fig S2H), and so reduce the extraneous sequencing by at least two thirds, at a cost of missing around 1/5th of the genuine reads – which might be worth it. But it looks like the rejection algorithm needs to get faster before this really becomes viable for larger sequences. Nevertheless, the most exciting paper I’ve read for a while.

  2. Hi John,

    OK – good questions here – I’ll try and answer them all. I’ve highlighted a few points to make it clear the bits I’m addressing – I might deal with them out of order!

    “From Matt’s paper, it appears to take 15 seconds to accept or reject a 256-event squiggle (Supplementary Figure 2H) using an algorithm of O(query size * target size);”

    OK – so this is probably something we need to clarify in the supplementary – that time figure is not for a single read. I need to check the raw data folder for exactly the number that is being processed to generate those plots but it will be of the order of 3-400 – so for each read we are mapping them in around 0.3-0.4 seconds. Actually I have improved that further since the preprint so we are around the 0.2 second mark per read at 256 events. At 128 events we are down to 0.1 seconds per read.

    “But let’s call it a 50kb target, and a 256-event query. At 70bp/s, this means something like the first 1kb of each 1D template is sequenced before rejection.”

    Remember we are doing all this in parallel, but lets consider a single read for a moment. In order to skip the leader sequences we ignore the first 50 events from a read and collect the following 250. So it takes 4.3 s at 70 b/s to capture that data. Actually the capturing is across 512 channels and gets batched up a bit, so there is probably a second lag extra before we ‘see’ the data. So that’s around 370 events. We then need to process and map that data and send a rejection message back to the sequencer. For a single channel it takes us 0.4 s (in the paper) to map and around 0.4 s to send the rejection message – so that is an extra 56 events. So our total amount of DNA sequenced before rejection would be 426 events. We actually run a queue to process the reads and so data can sit in that queue for a period of time. We run a time out on that queue – so if we don’t process a read in time we know it isn’t worth processing it. Calculating this value depends on what you are doing. If your average read lengths are long then the queue time out can be quite generous – you will see a relative benefit because rejecting a 10 kb sequence 3 kb into its sequencing (a lag of 42 seconds) still saves you 242 seconds (assuming a 2D read) over non rejection. For a 2kb amplicon a 42 second delay would be too long. For reference on the 11 amplicon sequencing problem we are using a 10 second queue (which equates to 700 bases) but we rarely overshoot that even on a very active flow cell.

    There are also benefits because the channels become increasingly asynchronous over time – so your queue grows at a similar rate to that at which you can process it but you catch up really quickly so that the queue is effectively very short.

    All of this means that on a good run we see our rejected reads in the 450-500 bp range – which is pretty close to our 426 theoretical minimum and suggests a processing time of around 1-2 seconds per read including the queueing times.

    “The ratio of reference target to total genome size becomes important here. With a 50kb reference and, say, a 100kb genome, we get 1kb of rejected sequence for every full read of accepted sequence (call it 5kb for a 2D read). But with a 100Mb genome, we get ~2Mb of rejected sequence for every 2D read. For human-scale genomes, we have to sequence ~60 Mb for every 2D read.”

    I think the specific details of the numbers you quote are a little out, but the general point is correct. You have to think carefully about implementing read until type experiments. If you want to fish out small regions of a genome you need to a) match fast and b) have the longest possible read lengths. But fishing out 1% of a genome is going to lead to a lot of ‘short coverage’ of the rest of the genome. Of course you also deplete the sequences from the sequencing pool – a read can’t be sequenced twice – so over time you will see an increased chance of sequencing what you want. Finally on this, I see no negative impact on flow cell performance of sequencing huge numbers of reads – so all that happens if you spit out reads is you see huge read counts!

    “, so presumably the 15s from the figure is way off – but what’s the difference? Is the quicker time for the 5kb or 2kb targets in the paper? If so, that’s not great, because we’d really like to increase the size of the target if possible.”

    OK – so the 15s figure is wrong 🙂 but… the method works really well for amplicons. The ideal experiment would be the longest amplicons you can use. Essentially if you know the start and end points of each amplicon then you can optimise your search space significantly. For reads with random start points in the reference you can’t cheat the search space so we need faster matching methods. So – for amplicons increasing the size of the target is easy (dependent on PCR I guess). For random start sites it gets harder.

    ” Supplementary Figure 2G shows that over 80% of reads can be correctly assigned with only 64 events, which would cut down the rejection time to less than 5s (by Fig S2H), ”

    So – just to clear this up – if we got the matching time down to 64 events we would get out matching speed down to 0.05 s – but this is just for the optimal amplicon matching experiment. For larger genomes the matching time is (much) longer.

    ” But it looks like the rejection algorithm needs to get faster before this really becomes viable for larger sequences.”

    ABSOLUTELY! And I hope this comes over in the manuscript – right now what we are showing will work really well for amplicons and for viral sized genomes. Another method I didn’t put in the manuscript but which is looking really good at the moment is barcoding – so load 12 different barcoded samples on your minION and normalise the samples or even better choose your desired coverage depth per sample.

    Jumping back through your post:

    “So we’re trying to figure out when this approach will be viable for this scale. What needs to happen to make this viable for eukaryotes?”

    We need faster matching algorithms and we need long libraries. I have a few other methods we’re working on (not quite ready to show those yet!) and am collaborating with some really excellent people to look at this more too. But I think what we have shown here is a good start. We also need to spend a lot of time thinking really hard about the experiments that would best benefit from this technology. As with everything, exploiting this to the full will require carefully thought out and well designed experiments.

    “Nevertheless, the most exciting paper I’ve read for a while.”

    Thanks! It has been one of the most exciting things to be involved with and do too!

    • Thanks Matt, that clears things up a lot. Balancing barcodes would be great!

      Just to check I understand, is the need for longer libraries in order to reduce the queueing? IE with longer reads, more channels are occupied for longer with the fragments we want, so there are fewer channels to check for rejections?

      If so, that might make using this to do things like pulling out genes from transcriptomes a bit more difficult, as the fragments will all be relatively short. But it still seems likely to work pretty well.

      On using the starts/ends of amplicons, rather than random start sites – does this explain the differences in the template read distributions in Figures 3B and 3C? It looks like 3B is using random start sites and 3C is only using ends (except maybe amplicon 6?) – is that right? What causes the difference here?

      • Hi John,

        Good questions again!

        Longer read libraries is a way to maximise the benefit of read until in the sense that – if it takes you 250 bases to get enough data to map a read and then another second of all round processing time to send the reject message to the sequencer and a further second for the rejection to take place, you will have sequenced 400 bases in the time taken to ‘process that read’. If your reads are only 400 bases in length then you are only saving yourself 10 seconds of sequencing time – the time it takes to sequence the complement strand. On a 4kb read, you would save yourself 108 seconds (at 70 b/s sequencing). Does that make sense? For transcriptome type problems, matching speed is an issue – BUT this is a computational problem. More cores and – hopefully – faster algorithms than I have used will be able to address these problems.

        The second issue – the difference between panels 3B and 3C. Good spot! There is no difference in the amplicon starts here – it’s the same library being sequenced on the same flow cell. However, panel B was generated relatively early in the life of the flow cell – many pores and fast sequencing. The rejection times are slow and so the rejected reads end up being quite long – embarrassingly so really – and the ends of the rejected reads overlap. Panel C is run a couple of hours later – fewer pores, more processing time and the reads are rejected quickly – thus you get those spikes on the template of around 600 bp. This gives a round time of about 8 seconds from the read starting to being rejected. The code is improved somewhat over what has been used in this figure and the mapping is a bit faster.

        Sorry for the delay in replying – hope this makes sense!

        Matt

  3. I picked up the article via twittering on AGBT (think so) and it was great to read, indeed.
    As I am in human genetics, an exon-panel approach was my first thought. But indeed a RNAseq for relative quantification popt up as my second thought. I was thinking that a short, few cycle PCR amplication after random primed or polyT primed would give a favourable enrichment, combined with the ‘short’ reference map of the genes I need a profile of.

    Any thought on that regarding feasability?

    PS I would love to work on that as a sideline, but have no access.

    • Hi Jasper,

      This is something we are actively working on. It should be possible – but it depends on the number of reference genes you are after. It’s worth thinking about the maths here a bit – if you want to pull 1 gene out of 30,000 but it’s expressed at a very low level it’s lots of rejections. It should be possible – but I’ll post more when we know!

      Matt

Leave a Reply

Your email address will not be published.

*

8 + 4 =


© 2016 Opiniomics

Theme by Anders NorenUp ↑