Opiniomics

bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Fast parallel access to data within MinION FAST5 files

As readers of this blog will know, my group was one of the first groups to publish software for the MinION sequencer, and I have blogged a couple of times on how to extract data from FAST5 files using poRe.  Well BBSRC, my main funder, were kind enough to give us a grant to develop the software, and one thing we said we’d do is implement parallel processing.

Below is a summary of what we’ve done so far.

Parallel processing in R

One of the huge benefits of R, besides the amazing suite of statistical and visualization tools, and the thousands of add-on packages, is the fact that parallelisation is really easy.  In this context, what I mean by parallelisation is the use of multiple cores on your machine to run a single bit of code.  By using multiple cores, the code is speeded up, often linearly relative to the number of cores you use.

There are quite a lot of packages for R that help you run parallel code, including doParallel, foreach, snow etc. but we’re going to focus on the built in package, parallel.  To show the benefits of this, we have run various toolkits for the extraction of 2D FASTQ data from 25484 FAST5 files kindly shared by Nick Loman.

Benchmarking FASTQ extraction

Now, as Matt Loose has pointed out to me, focusing on FASTQ extraction is a little pointless as almost certainly this will be built in to future ONT base callers; however, it serves as a useful example and there are plenty of FAST5 files out there already that you may wish to extract FASTQ from.

The tools we compared are extract2D from poRe, poretools, fast5extract from c_fast5 (a simple C implementation) and extract2DParallel, a parallel version of extract2D (the parallel scripts are in the “misc” folder in GitHub).  We ran each tool 4 times on 1000, 5000, 10000, 15000, 20000, and 25000 randomly selected FAST5 files, clearing the cache completely between runs to ensure a fair comparison.

paratest

As can be seen, all tools perform linearly with the number of FAST5 files; poretools is the slowest, then extract2D, then c_fast5.  However, the best result is clearly from extract2DParallel (using 4 cores) which outperforms everything else by some distance.

Of course, this is not surprising, and doesn’t come without a cost – if you only have 4 cores then using all 4 will mean all other processes perform really slowly.  However, there are also benefits – if your machine has 8 or 24 cores, you can speed up processing even more.

The scripts on GitHub should be treated as an early release and have only been tested on one dataset, so I welcome your feedback!

Extracting FASTQ on Windows

We haven’t wrapped up the functionality into poRe yet, but we expect to in the next release.  Meanwhile for the impatient…


library(poRe)
library(parallel)

# find out how many cores you have
detectCores()
# [1] 4


I have 4 cores, and I probably don’t want to use all of them! I now need to redefine one of poRe’s internal functions so that it works with parallel:


printfastq <- function (f5, f5path = "/Analyses/Basecall_2D_000/BaseCalled_2D/", out="out.fastq")
{
    fid = tryCatch({
        H5Fopen(f5)
    }, warning = function(w) w, error = function(e) e)
    if (typeof(fid) == "S4" && H5Lexists(fid, f5path)) {
        did <- H5Dopen(fid, paste(f5path, "Fastq", sep = ""))
        fq <- (H5Dread(did))
        four <- strsplit(fq, "\n")[[1]]
        fq <- paste(paste(four[1], f5, sep = " "), four[2], four[3],
            four[4], sep = "\n")
        cat(fq, file = out, sep = "\n", fill = FALSE, append=TRUE)
        H5Dclose(did)
    }
    if (typeof(fid) == "S4") {
        H5Fclose(fid)
    }
}

Now let's create a vector of the fast5 files I want to extract:


my.dir <- "C:/MinION/FAST5/" # use your own directory here!

f5files <- dir(path = my.dir, pattern = "\\.fast5$", full.names = TRUE)

Here is the interesting part, where we set up our parallel "cluster", which is then used to execute the code. Each core within the cluster should be treated as a separate instance of R, so we need to load libraries and export variables to the cluster:


# use two cores
cl <- makeCluster(2)
 
# load poRe on the cluster
c <- clusterEvalQ(cl, suppressWarnings(suppressMessages(library(poRe))))

# export the printfastq function to the cluster
clusterExport(cl, "printfastq")

# run the code and stop the cluster
l <- parApply(cl, array(f5files), 1, printfastq, out="myfastq.fastq")
stopCluster(cl)

Extract metadata

Absolutely one of the slowest aspects of MinION data analysis and QC is extracting metadata, but we can also parallelise this!  For those of you who like the Linux command line, there is an extractMetaParallel script in the GitHub repo (again - use with caution and feedback!), but for those of you who want to use this within R:



library(poRe)
library(parallel)
# find out how many cores you have
detectCores()
# [1] 4

# get a list of files to extract data from
my.dir <- "C:/MinION/FAST5/" # use your own directory here!
f5files <- dir(path = my.dir, pattern = "\\.fast5$", full.names = TRUE)

# use two cores
cl <- makeCluster(2)
 
# load poRe on the cluster
c <- clusterEvalQ(cl, suppressWarnings(suppressMessages(library(poRe))))

# variables to set paths to template, complement and 2D data
# NOTE you will need to change these for the new FAST5 format
path.t <- "/Analyses/Basecall_2D_000/"
path.c <- path.t
path2d <- path.t

# run the code and stop the cluster
l <- parLapply(cl, 
          f5files, 
          get_meta_info, path2d, path.t, path.c)
stopCluster(cl)

# convert the resulting vector into a matrix with the correct column names
m <- matrix(unlist(l), ncol=12, byrow=TRUE)
colnames(m) <- c("filename","channel_num","read_num","read_start_time","status","tlen","clen","len2d","run_id","read_id","barcode","exp_start") 

Summary

Parallelisation is so easy within R it's almost criminal not to do it. We can see that just a bit of parallelisation speeds up the code incredibly - and we can also see that poRe is a pretty fast toolkit when used in the correct way, which it wasn't in this paper.

NOTE I am fairly new to benchmarking and parallelisation, if I have done anything wrong, I apologize - and please get in touch!

3 Comments

  1. Try:

    find . -name “*.fast5” | parallel poretools fastq

  2. How fast does c_fast5 do it when you run one process per core?

    Scaling up cores will eventually hit an IO bottleneck, depending on your storage, and probably sooner with more, smaller, fast5s…

    • biomickwatson

      1st March 2016 at 3:49 pm

      I’m not sure I understand the question… it’s just a compiled C program, non multi-threaded

Leave a Reply

Your email address will not be published.

*

*

code

© 2017 Opiniomics

Theme by Anders NorenUp ↑