bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Category: R (page 2 of 2)

Extracting MinION fastq on the command line using poRe

Hopefully most of you will be aware of our software for handling MinION data, poRe, published in bioinformatics this year:

Watson M, Thomson M, Risse J, Talbot R, Santoyo-Lopez J, Gharbi K, Blaxter M. (2015) 
poRe: an R package for the visualization and analysis of nanopore sequencing data. 
Bioinformatics 31(1):114-5.

poRe is based on R; many people consider R to be an interactive package, but it is very easy to write command-line scripts for R in very much the same way as you write Perl or Python scripts.  Below is a script that can be used in conjunction with R and poRe to extract 2D fastq data from a directory full of fast5 files:

#!/usr/bin/env Rscript

# get command line arguments as an array
args <- commandArgs(trailingOnly = TRUE)

# if we don't have any command line arguments
# print out usage
if (length(args) == 0) {
        stop("Please provide a directory containing fast5 files as the first (and only) argument")

# the first argument is the directory, stop if it doesn't exist
my.dir <- args[1]
if (! file.exists(my.dir)) {
        stop(paste(my.dir, " does not exist"))

# get a list of all of the files with a .fast5 file name extension
f5files <- dir(path = my.dir, pattern = "\.fast5$", full.names = TRUE)

# print to STDERR how many files we are going to attempt
write(paste("Extracting 2D fastq from", length(f5files), "files"), stderr())

# load poRe

# apply printfastq to every entry in f5files
# printfastq tests for 2D fastq and if found prints to STDOUT
invisible(apply(array(f5files), 1, printfastq))

The suppressMessages() function means that none of the output is printed when the poRe library is loaded, and the invisible() function suppresses the natural output of apply() (which would otherwise return an array of undefineds the same length as f5files (which we definitely do not want).

The above is based on R 3.1.2 and poRe 0.9.

We could extract template and complement fastq (respectively) by substituting in the following lines:

# template
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_template"))
# complement
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_complement"))


Analysing MinION data with no bioinformatics expertise nor infrastructure

It has been a long standing aim of mine to enable easy analysis of Oxford Nanopore MinION data, preferably by people who don’t have a ton of bioinformatics skills.  This was one of the reasons we started to develop poRe, which is really easy to install, runs on any platform and is fairly easy to use.

Below is a proof-of-concept method for analysing MinION data without the need for bioinformatics infrastructure – I am doing this on my Windows laptop, with only R and an internet connection.  We are going to use the EBI web-services to do the heavy lifting.  Please do not abuse this!  EBI probably don’t want you to submit 20,000 jobs.  As I said, this is “proof-of-concept”.  It means you’re not supposed to actually use it in practice 😉

Some polite things to do:

  1. Use your own, real e-mail address when/if you run the code below
  2. Please stick to a small number of jobs! (unless EBI tell you it’s OK)

We’re going to do everything in R.

Make a directory on your computer where we will do the work.  You will need approx. 15Gb of space (I know, the data are big!).   My directory is going to be “C:/Users/Mick/Documents/MinION”.  We’re going to use data from Nick Loman’s group, published here and here.

# go to the working directory

# load the necessary packages

# download one of Nick's runs and unpack it
url <- "ftp://ftp.sra.ebi.ac.uk/vol1/ERA362/ERA362836/oxfordnanopore_native/Ecoli_R7_ONI_flowcell_17.tar.gz"
dest <- "Ecoli_R7_ONI_flowcell_17.tar.gz"
download.file(url=url, destfile=dest)

This will take some time, as there is about 10Gb of data there.  If I had a smaller dataset, I would use it.

Now, we use the poRe get_fast5_info() function to read statistics from the several thousand MinION reads, which again takes some time.  We then just select the reads that have a 2D sequence:

# read information from all .fast5 files in the current working directory
finfo <- read.fast5.info(dir=getwd())

# get a list of fast5 files where the "len2d" is greater than zero
reads.2d <- rownames(finfo)[finfo$len2d>0]

Now, we are going to use the EBI REST API for NCBI BLAST.  You can read about it at the link provided, but essentially it is about building and querying custom URLs to submit BLAST jobs and fetch the result.

We are only going to do this for the first ten fast5 sequences.  Please do not submit more, or the EBI may block you.  If you want to use this in anger, I suggest contacting the EBI directly and asking them politely.  The code blasts each sequence with default parameters against the EMBL bacterial database, saves the BLAST report to your current working directory, and prints out the top hit.

Here goes:

# iterate over the first ten 2D reads
for (f5 in reads.2d[1:10]) {

    # use poRe function get_fasta() to get the 2D sequence
    	fasta.2d <- get_fasta(f5)$'2D'
    # build a list of parameters for the EBI BLAST service
    	myl <- list(sequence=fasta.2d,

    # build the submission URL and submit the job
    runurl <- "http://www.ebi.ac.uk/Tools/services/rest/ncbiblast/run"
    	jobid <- postForm(runurl, .params=myl)

    # query the status URL for the job until the status
    # is no longer "RUNNING"
    staturl <- "http://www.ebi.ac.uk/Tools/services/rest/ncbiblast/status/"
    	surl <- paste(staturl, jobid, sep="")
    	status <- "RUNNING"
    	while(status == "RUNNING") {
          		status <- getURL(surl)

    # build the output URL and fetch the results
    outurl <- "http://www.ebi.ac.uk/Tools/services/rest/ncbiblast/result/"
    	ourl <- paste(outurl, jobid, "/out", sep="")
    	out <- getURL(ourl)

    # write the output to a file in the current working directory
    	bloutfile <- gsub(".fast5",".blast",f5)
    	write(out, bloutfile)

    # print out the top hit
    	print(strsplit(out, "n")[[1]][22])


Now, it’s not quick – it probably takes about 2-3 minutes to complete on my laptop.  It also isn’t very responsive (in that it looks a bit like it may not be doing anything!).  However, eventually the R terminal will show the top hits:


You can see from the above that the first few reads are a bit dodgy and don’t really hit anything, but pretty quickly the reads start hitting E coli K12, which is what they sequenced.

Also, in your directory you will find a whole bunch of BLAST reports:


Note, as “proof-of-concept” code, there is very little (i.e. nothing) in the way of error-checking in the above code!  I mean, come on, it’s only a blog post!


Recreating a famous visualisation

Many of you will be aware of the Project Tycho visualisation(s) about vaccination that were featured in the Wall Street Journal – if you’re not, you should be.  They’re beautiful, and almost everything a good visualisation should be:

wsj_projecttychoThe plots shown on the WSJ website are actually interactive and use Javascript, which is a bit beyond the scope of this blog post, but I wanted to show how you can create something similar.

The Data

I can’t give you the data due to T&Cs, but you can download it yourself.  Go to Project Tycho, register, log in, choose level 2 data, click “search and retrieve data”, choose state rather than city, add all of the states to the selection box, unclick graph and Submit query.  There is then an option to download in Excel.  The Excel is actually CSV and we can import this into R quite easily.  This is what (a subset of) my data look like in R:

1096 1930    1       7      0              0       4      196        178       18          64
1097 1930    2      24      0              0       0        2        442       69          62
1098 1930    3      28      0              0       2        9        490       26          44
1099 1930    4      21      0              0       1        3        628       40          23
1100 1930    5      47      0              0       5        7        864      101          25
1101 1930    6      63      0              0       6        6        943      101          24
1102 1930    7     108      0              0       5        5        954       65          20
1103 1930    8     167      0              0       2        9       1151      170          13
1104 1930    9     191      0              0       7       15       1433      150          23
1105 1930   10     242      0              0       5        8       1514      256          39

We need to do a few things – limit data to beyond 1930, substitute the “-” for zeros, convert everything to numeric, and sum over weeks:

d[d=="-"] <- 0
for (i in 2:61) {
   d[,i] <- as.numeric(d[,i])
d <- d[d$YEAR>=1930,]
yd <- aggregate(d[,3:61], by=list(year=d[,1]), sum)
yd <- yd[order(yd$year),]

My data now look like this:

1  1930    4389      0              0    2107      996      43585    11821        1978      264
2  1931    8934      0              0    2135      849      27807     4787       12869     2428
3  1932     270      0              0      86       99      12618     2376        5701       39
4  1933    1735      0              0    1261     5438      26551      297        5430      205
5  1934   15849      0              0    1022     7222      25650    12946        4659     3063
6  1935    7214      0              0     586     1518      28799    12786       24149     1040
7  1936     572      0              0    2378      107      49092      604        3987     1490
8  1937     620      0              0    3793      322       5107     1027       10181     1796
9  1938   13511      0              0     604     6690      19452     8403        1388      492
10 1939    4381      0              0     479     1724      67180     5182       15817      159

Creating the visualisation

It’s important to note that great visualisations don’t just happen; they take work; they evolve over time; there is trial-and-error until you get what you want.  Don’t be scared of this, embrace it!

I’m going to use the heatmap.2() function from gplots in R.  Let’s look at a basic, raw heatmap (with some things turned off that I always turn off in heatmap.2()):

heatmap.2(as.matrix(yd[,2:60]), trace="none", key=FALSE)

basic_rawOK… I don’t actually want to cluster years, nor do I want to cluster states, so I should turn off the dendrograms.  I also have the data the wrong way round, so I need to transpose the matrix:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
          dendrogram="none", trace="none", key=FALSE)

trans_rawOK, we start to see a patter where there are more yellow squares to the left of the diagram; yellow in this context means high, and left means early in the period we’re looking at – so we begin to see patterns.  Still a terrible heatmap.  There is lots of wasted space; there are no year labels; and the state names are being cut-off.  Let’s deal with those:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=yd$year, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12))

I get an error about the figure margins being too large, but this is quite common and can often be ignored

better_rawWe’re getting somewhere now.  It’s time to think about colours.


Let’s get something awkward out of the way.  The guys who created this visualisation cheated a little bit.  Let’s look at the colour scale:

wsj_colorsThis is not a natural colour scale.  White-through-to-blue is fine, but blue does not naturally flow into yellow and then red.  So this is a false color scheme which has been designed for maximum impact – low numbers of measles cases get nice colours such as white, blue and green; high numbers get danger colours such as yellow and red.  Is this acceptable?  I’ll let you decide.

In addition, note the WSJ figures have been scaled per 100,000 population.  I do not have the historical population figures for American states going back to 1930, so we are plotting the raw data (raw number of cases) not scaled for population.  So we can’t completely recreate the WSJ figure, but we can get close.

Colours and breaks

In heatmaps, we have the concept of colours and breaks.  Breaks are provided as a vector of numbers, and essentially, we use each colour to represent values that fall between each subsequent pair of breaks.  In other words, the colours fit “between” pairs of breaks.  Perhaps best explained by example.  Let’s say our breaks vector is 0, 100, 200, and our colours are “green” and “red”.  Therefore, any value between 0 and 100 will be green in the heatmap, and any value between 100 and 200 will be red.   As colours fit in the middle between breaks, there should be one less colour than there are breaks.  I often find it useful to set my own breaks and colours in heatmaps – the function can decide for you, but how do you know what assumptions it is making?


If I want to recreate the colours of the WSJ plot, I need low values between white and blue, and medium-to-high values between yellow and red.  My minimum and maximum values are 0 and 126221:


I’m going to use colorRampPallette() to create ten colours between white and blue; then 30 colours between yellow and red.  I’m also going to game the breaks, just so I can get something similar to the WSJ look:

cols <- c(colorRampPalette(c("white", "cornflowerblue"))(10),
          colorRampPalette(c("yellow", "red"))(30))

I now need to set my breaks.  I want white to be zero, so my first two breaks should be 0 and 0 (i.e. only values between 0 and 0 will be white; in other words, only 0 will be white).  I’m then going to scale up to 700 in steps of 100; and the rest I am going to split equally between 800 and the maximum value (which I’ll set at 127000).

bks <- c(0,0,100,200,300,400,500,600,700,seq(800,127000,length=32))

NOTE: this is a bit dodgy.  Really, my bins should be the same size, but it’s the only way I could recreate the WSJ look with this data.

Now the heatmap:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=yd$year, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks)

better_coloursOK!  Now we’re getting somewhere!  So, the WSJ heatmap has gaps between each row and between each column, so let’s add those in:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=yd$year, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks, colsep=1:72, rowsep=1:57, sepcolor="white")

datasepWe want a line at 1961, which after a few bits of trial-and-error, turns out to be at x=32:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=yd$year, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks, colsep=1:72, rowsep=1:57, sepcolor="white",

datasep_lineNow let’s tidy up those year labels:

row.labels <- rep("", 72)
row.labels[c(1,11,21,31,41,51,61,71)] <- c("1930","1940","1950","1960","1970",
heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=row.labels, cexCol=1, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks, colsep=1:72, srtCol=0, rowsep=1:57, sepcolor="white",

tidy_labelsWe really should add a title, and we’ll need to go back and make room for it using the lhei parameter:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=row.labels, cexCol=1, lhei=c(0.15,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks, colsep=1:72, srtCol=0, rowsep=1:57, sepcolor="white",
        main='Measles cases in US states 1930-2001nVaccine introduced 1961')

titleAnd there we have it!  I think this is about as close to the WSJ viualisation as I can get!

The only thing I would say is that I am disappointed that I had to game the break points to get here.  It doesn’t sit well with me.  However, it was the only way I could figure out how to get so much blue into the left hand side of the graph, whilst still keeping the right side white.

The simple fact is, that keeping the break points a similar size (100), the graph looks better:

cols <- c(colorRampPalette(c("white", "cornflowerblue"))(10),
          colorRampPalette(c("yellow", "red"))(1261))
bks <- c(0,0,100,200,300,400,500,600,700,seq(800,127000,by=100))
heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=row.labels, cexCol=1, lhei=c(0.15,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks, colsep=1:72, srtCol=0, rowsep=1:57, sepcolor="white",
        main='Measles cases in US states 1930-2001nVaccine introduced 1961
             n(data from Project Tycho)')


So, there we are 🙂  A final (larger version) can be found here.

Newer posts

© 2022 Opiniomics

Theme by Anders NorenUp ↑