## Opiniomics

### bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

#### Tag: R

As many of you will be aware, I like to post some R code, and I especially like to post base R versions of ggplot2 things!

Well these amazing boxplots turned up on github – go and check them out!

So I did my own version in base R – check out the code here and the result below.  Enjoy! Reading data from google sheets into R is something you imagine should be really simple, but often is anything but. However, package googlesheets goes a long way to solving this problem.

Let’s crack on with an example.

First, install the software:

```install.packages("googlesheets")
```

We then need an example sheet to work with, and I’m going to use one from Britain Elects:

So go here and add this to your own collection (yes you’ll need a google account!) using the little “Add to my drive” icon to the right of the title, just next to the star.

Right, let’s play around a bit!

```# load package

# may ask you to authenticate in a browser!
gs_ls()

# get the Britain Elects google sheet
be <- gs_title("Britain Elects / Public Opinion")
```

This should show you all of the sheets you have access to, and also select the one we want.

Now we can see which worksheets exist within the sheet:

```# list worksheets
gs_ws_ls(be)
```

```# get Westminster voting
west <- gs_read(ss=be, ws = "Westminster voting intentions", skip=1)

# convert to data.frame
wdf <- as.data.frame(west)
```

And hey presto, we have the data. Simples!

Now let's plot it:

```# reverse so that data are forward-sorted by time
wdf <- wdf[2769:1,]

# treat all dates as a single point on x-axis
dates <- 1:2769

# smooth parameter
fs <- 0.04

# plot conservative
plot(lowess(dates, wdf\$Con, f=fs), type="l", col="blue", lwd=5, ylim=c(0,65), xaxt="n", xlab="", ylab="%", main="Polls: Westminster voting intention")

axis(side=1, at=dates[seq(1, 2769, by=40)], labels=paste(wdf\$"Fieldwork end date", wdf\$Y)[seq(1, 2769, by=40)], las=2, cex.axis=0.8)

# plot labour and libdem
lines(lowess(dates, wdf\$Lab, f=fs), col="red", lwd=5)
lines(lowess(dates, wdf\$LDem, f=fs), col="orange", lwd=5)

# add UKIP, and we treat absent values as -50 for plotting purposes
ukip <- wdf\$UKIP
ukip[is.na(ukip)] <- -50
lines(lowess(dates, ukip, f=fs), col="purple", lwd=5)

legend(1, 65, legend=c("Con","Lab","LibDem","UKIP"), fill=c("blue","red","orange","purple"))
``` You have a matrix in R, and you want to visualise it – say, for example, with each cell coloured according to the value in the cell.  Not a heatmap, per se, as that requires clustering; just a simple, visual image.

Well, the answer is image() – however, there is the slightly bizarre coding choice buried in the help:

Notice that `image` interprets the `z` matrix as a table of `f(x[i], y[j])` values, so that the x axis corresponds to row number and the y axis to column number, with column 1 at the bottom, i.e. a 90 degree counter-clockwise rotation of the conventional printed layout of a matrix.

Let’s look at that:

``````

mat <- matrix(c(1,1,1,0,0,0,1,0,1), nrow=3, byrow=TRUE)
mat
[,1] [,2] [,3]
[1,]    1    1    1
[2,]    0    0    0
[3,]    1    0    1

image(mat)
``````

This produces: We can clearly see the 1,0,1 row is now the right-hand column i.e. the matrix has in fact been rotated 90 degrees anti-clockwise.

To counteract this we need to rotate the input matrix clockwise before passing to image:

``````
rotate <- function(x) t(apply(x, 2, rev))

image(rotate(mat))
``````

This now produces what we might expect: Easy 🙂

I came across a really cool tweet a few days ago containing an animated GIF demonstrating rise in global temperatures since 1880:

Link to the blog is here.

I wanted to see if I could recreate this GIF in R, and below are some instructions on how to get close.

The first step is to try and see if I can replicate the original, final image.  The code below comes close enough for this demo, but strangely I am approx. 1 degree C out.  I’ve left a comment on the original blog to ask why – it’s not clear how the exact figures are calculated.

Anyway, here is how to get close:

``````
skip=7, header=TRUE, nrows=143, colClasses = "character")

# clean data
d <- d[-grep("Year", d\$Year),]
rownames(d) <- d[,1]
d <- d[,-1]
d <- d[,-13:-19]
d[d=="****"] <- NA

# convert to numeric and celsius
for (i in 1:12) {
d[,i] <- as.numeric(d[,i]) / 100
}

sa <- as.numeric(s\$seas_anom)

# create colours from blue through to red
colours <- colorRampPalette(c("grey","blue","red"))(nrow(d))

# create initial plot
mmin <- -3
mmax <- 3
par(mar=c(2,2,2,0))
plot(1:12, d[1,]+sa, type="l", ylim=c(mmin,mmax),
yaxt="n", xaxt="n", bty="n")

axis(side=1, at=1:12,
labels=c("Jan","Feb","Mar",
"Apr","May","Jun",
"Jul","Aug","Sep",
"Oct","Nov","Dec"),
lwd=0, lwd.ticks=1, cex.axis=0.8)

axis(side=2, at=-3:3, labels=-3:3,
lwd=0, lwd.ticks=1, las=2, cex.axis=0.8)

title(main=expression(paste("Temperature ","Anomaly ","(",degree,"C)")),
title(main="(Difference from 1980-2015 annual mean)", adj=0, cex.main=0.6,
line=0, font.main=1)

for (j in -3:3) {
lines(1:12, rep(j, 12), lty=3)
}

for (j in 1:nrow(d)) {
lines(1:12, as.numeric(d[j,])+sa, col=colours[j])
}

``````

This produces the following image: This is close enough!

So how do we create the animation?  The old skool way of doing this is to create every image in R and create an animated GIF with ImageMagick.  So we'll do that 🙂

The slightly adjusted code to create every image is here (creates 137 PNGs in working directory!):

``````
#
# Assumes you have run the code above to create d and sa
#

doplot <- function(i) {

x   <- d[i,] + sa
col <- colours[i]

mmin <- -3
mmax <- 3
par(mar=c(2.5,2.5,2,0))
plot(1:12, x, type="l", ylim=c(mmin,mmax),
yaxt="n", xaxt="n", bty="n", col=col)
axis(side=1, at=1:12, labels=c("Jan","Feb","Mar",
"Apr","May","Jun",
"Jul","Aug","Sep",
"Oct","Nov","Dec"),
lwd=0, lwd.ticks=1, cex.axis=1.5)

axis(side=2, at=-3:3, labels=-3:3, lwd=0,
lwd.ticks=1, las=2, cex.axis=1.5)

title(main=expression(paste("Temperature ","Anomaly ","(",degree,"C)")),
title(main="(Difference from 1980-2015 annual mean)",

for (j in -3:3) {
lines(1:12, rep(j, 12), lty=3)
}

for (k in 1:i) {
lines(1:12, d[k,]+sa, col=colours[k])
}

text(7, 0.8, rownames(d)[i], col=colours[i], font=2, cex=2)

}

# create plots/PNGs
for (j in 1:nrow(d)) {

name <- paste("plot",sprintf("%03.f", j),".png", sep="")
png(name, width=800, height=600)
doplot(j)
dev.off()
}
``````

Then in ImageMagick, we use convert to create the animated GIF:

``````
convert *.png -delay 3 -loop 1 globaltemp.gif
convert globaltemp.gif \( +clone -set delay 500 \) +swap +delete globaltempwpause.gif
```
```

Final result here: Done 🙂

I have to admit to being a bit of a snob when it comes to graphs and charts in scientific papers and presentations.  It’s not like I think I am particularly good at it – I’m OK – it’s just that I know what’s bad.  I’ve seen folk screenshot multiple Excel graphs so they can paste them into a powerpoint table to create multi-panel plots… and it kind of makes me want to scream.   I’m sorry, I really am, but when I see Excel plots in papers I judge the authors, and I don’t mean in a good way.  I can’t help it.  Plotting good graphs is an art, and sticking with the metaphor, Excel is paint-by-numbers and R is a blank canvas, waiting for something beautiful to be created; Excel is limiting, whereas R sets you free.

Readers of this blog will know that I like to take plots that I find which are fabulous and recreate them.  Well let’s do that again 🙂

I saw this Tweet by Trevor Branch on Twitter and found it intriguing:

It shows two plots of the same data.  The Excel plot: And the multi plot: You’re clearly supposed to think the latter is better, and I do; however perhaps disappointingly, the top graph would be easy to plot in Excel but I’m guessing most people would find it impossible to create the bottom one (in Excel or otherwise).

Well, I’m going to show you how to create both, in R. All code now in Github!

The Excel Graph

Now, I’ve shown you how to create Excel-like graphs in R before, and we’ll use some of the same tricks again.

First we set up the data:

``````
# set up the data
df <- data.frame(Circulatory=c(32,26,19,16,14,13,11,11),
Mental=c(11,11,18,24,23,24,26,23),
Musculoskeletal=c(17,18,13,16,12,18,20,26),
Cancer=c(10,15,15,14,16,16,14,14))

rownames(df) <- seq(1975,2010,by=5)

df
```
```

Now let's plot the graph

``````
# set up colours and points
cols <- c("darkolivegreen3","darkcyan","mediumpurple2","coral3")
pch <- c(17,18,8,15)

# we have one point on X axis for each row of df (nrow(df))
# we then add 2.5 to make room for the legend
xmax <- nrow(df) + 2.5

# make the borders smaller
par(mar=c(3,3,0,0))

# plot an empty graph
plot(1:nrow(df), 1:nrow(df), pch="",
xlab=NA, ylab=NA, xaxt="n", yaxt="n",
ylim=c(0,35), bty="n", xlim=c(1,xmax))

for (i in seq(0,35,by=5)) {
lines(1:nrow(df), rep(i,nrow(df)), col="grey")
}

# for each dataset
for (i in 1:ncol(df)) {

points(1:nrow(df), df[,i], pch=pch[i],
col=cols[i], cex=1.5)

lines(1:nrow(df), df[,i], col=cols[i],
lwd=4)

}

axis(side=1, at=1:nrow(df), tick=FALSE,
labels=rownames(df))

axis(side=1, at=seq(-0.5,8.5,by=1),
tick=TRUE, labels=NA)

axis(side=2, at=seq(0,35,by=5), tick=TRUE,
las=TRUE, labels=paste(seq(0,35,by=5),"%",sep=""))

legend(8.5,25,legend=colnames(df), pch=pch,
col=cols, cex=1.5, bty="n",  lwd=3, lty=1)
```
```

And here is the result: The multi-plot

Plotting multi-panel figures in R is sooooooo easy!  Here we go for the alternate multi-plot.  We use the same data.

``````
# split into 2 rows and 2 cols
split.screen(c(2,2))

# keep track of which screen we are
# plotting to
scr <- 1

# iterate over columns
for (i in 1:ncol(df)) {

# select screen
screen(scr)

# reduce margins
par(mar=c(3,2,1,1))

# empty plot
plot(1:nrow(df), 1:nrow(df), pch="", xlab=NA,
ylab=NA, xaxt="n", yaxt="n", ylim=c(0,35),
bty="n")

# plot all data in grey
for (j in 1:ncol(df)) {
lines(1:nrow(df), df[,j],
col="grey", lwd=3)

}

# plot selected in blue
lines(1:nrow(df), df[,i], col="blue4", lwd=4)

points(c(1,nrow(df)), c(df[1,i], df[nrow(df),i]),
pch=16, cex=2, col="blue4")

mtext(df[1,i], side=2, at=df[1,i], las=2)
mtext(df[nrow(df),i], side=4, at=df[nrow(df),i],
las=2)

title(colnames(df)[i])

# add axes if we are one of
# the bottom two plots
if (scr >= 3) {
axis(side=1, at=1:nrow(df), tick=FALSE,
labels=rownames(df))
}

# next screen
scr <- scr + 1
}

# close multi-panel image
close.screen(all=TRUE)
```
```

And here is the result: And there we have it.

So which do you prefer?

Use of the Pearson correlation co-efficient is common in genomics and bioinformatics, which is OK as it goes (I have used it extensively myself), but it has some major drawbacks – the major one being that Pearson can produce large coefficients in the presence of very large measurements.

This is best shown via example in R:

``````
# let's correlate some random data
g1 <- rnorm(50)
g2 <- rnorm(50)

cor(g1, g2)
#  -0.1486646
``````

So we get a small, -ve correlation from correlating two sets of 50 random values. If we ran this 1000 times we would get a distribution around zero, as expected.

Let's add in a single, large value:

``````
# let's correlate some random data with the addition of a single, large value
g1 <- c(g1, 10)
g2 <- c(g2, 11)

cor(g1, g2)
#  0.6040776
``````

Holy smokes, all of a sudden my random datasets are positively correlated with r>=0.6!

It's also significant.

``````
> cor.test(g1,g2, method="pearson")

Pearsons product-moment correlation

data:  g1 and g2
t = 5.3061, df = 49, p-value = 2.687e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3941015 0.7541199
sample estimates:
cor
0.6040776
```
```

So if you have used Pearson in large datasets, you will almost certainly have some of these spurious correlations in your data.

How can you solve this? By using Spearman, of course:

``````
> cor(g1, g2, method="spearman")
 -0.0961086
> cor.test(g1, g2, method="spearman")

Spearmans rank correlation rho

data:  g1 and g2
S = 24224, p-value = 0.5012
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.0961086
```
```

If you have MiSeq and ONT MinION data, it is now very easy to combine them into a single-contig, high quality assembly.  Here I will start with a base Ubuntu 14.04 LTS distribution and take you through every step, including software installation.  I am starting with an Amazon EC2 m4.xlarge AMI with 4 vCPUs and 16Gb RAM.  If you want to repeat the work below, you don’t need to use EC2, any computer running Ubuntu and with enough disk/RAM should work.

For MinION data, we will use Nick’s SQK-MAP-006 data at the EBI.  Using a simple trick, and because I know the internal structure of the tar archive, we can download only the base-called data:

``````
``````

We can also get some MiSeq data from ENA from the same strain, E coli K-12 MG1655:

``````
wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR262/005/SRR2627175/SRR2627175_2.fastq.gz &
wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR262/005/SRR2627175/SRR2627175_1.fastq.gz &
``````

``````
sudo apt-get update
sudo apt-get install r-base-core git samtools
``````

This installs R which we will use to extract the MinION data (using poRe), git and samtools.  Just reply “Y” to any questions.

We now need to install the poRe dependencies in R, which is very easy:

``````
R
source("http://www.bioconductor.org/biocLite.R")
biocLite("rhdf5")
install.packages(c("shiny","bit64","data.table","svDialogs"))
q()
``````

R may ask if you want to install into a local library, just say Y and accept defaults.  We need to download poRe from sourecforge and we are using version 0.16

``````
R CMD INSTALL poRe_0.16.tar.gz
``````

The fastq extraction scripts for poRe are in github, so let’s go get those:

``````
git clone https://github.com/mw55309/poRe_scripts.git
``````

We will assemble using SPAdes, so let’s go get that:

``````
gunzip < SPAdes-3.6.2-Linux.tar.gz | tar xvf -
``````

Now, we are ready to go.  First off, let’s extract the 2D sequence data as FASTQ from the MinION data.  Nick’s SQK-MAP-006 data are in the old FAST5 format so we use the script in “old_format”:

``````
``````

Finally we can use SPAdes to assemble the data.  SPAdes is a swiss-army knife of genome assembly tools, and by default includes read correction.  This takes up lots of RAM, so we are going to skip it.  We will also only use 3 kmers to save time:

``````
-t 4 -k 21,51,71
-1 SRR2627175_1.fastq.gz
-2 SRR2627175_2.fastq.gz
--nanopore minion.pass.2D.fastq
``````

Use samtools to extract the top contig:

``````
samtools faidx SPAdes_hybrid/contigs.fasta NODE_1_length_4620446_cov_135.169_ID_22238 > single_contig.fa
``````

Finally, a quick comparison to the reference:

``````
sudo apt-get install mummer
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NC_000913.3&rettype=fasta&retmode=txt" > NC_000913.3.fa
nucmer NC_000913.3.fa single_contig.fa
mummerplot -png out.delta
display out.png &
`````` And that is a single global alignment between reference and query 🙂

Simple huh?

Hopefully most of you will be aware of our software for handling MinION data, poRe, published in bioinformatics this year:

```Watson M, Thomson M, Risse J, Talbot R, Santoyo-Lopez J, Gharbi K, Blaxter M. (2015)
poRe: an R package for the visualization and analysis of nanopore sequencing data.
Bioinformatics 31(1):114-5.```

poRe is based on R; many people consider R to be an interactive package, but it is very easy to write command-line scripts for R in very much the same way as you write Perl or Python scripts.  Below is a script that can be used in conjunction with R and poRe to extract 2D fastq data from a directory full of fast5 files:

```#!/usr/bin/env Rscript

# get command line arguments as an array
args <- commandArgs(trailingOnly = TRUE)

# if we don't have any command line arguments
# print out usage
if (length(args) == 0) {
stop("Please provide a directory containing fast5 files as the first (and only) argument")
}

# the first argument is the directory, stop if it doesn't exist
my.dir <- args
if (! file.exists(my.dir)) {
stop(paste(my.dir, " does not exist"))
}

# get a list of all of the files with a .fast5 file name extension
f5files <- dir(path = my.dir, pattern = "\.fast5\$", full.names = TRUE)

# print to STDERR how many files we are going to attempt
write(paste("Extracting 2D fastq from", length(f5files), "files"), stderr())

suppressMessages(library(poRe))

# apply printfastq to every entry in f5files
# printfastq tests for 2D fastq and if found prints to STDOUT
invisible(apply(array(f5files), 1, printfastq))```

The suppressMessages() function means that none of the output is printed when the poRe library is loaded, and the invisible() function suppresses the natural output of apply() (which would otherwise return an array of undefineds the same length as f5files (which we definitely do not want).

The above is based on R 3.1.2 and poRe 0.9.

We could extract template and complement fastq (respectively) by substituting in the following lines:

```# template
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_template"))```
```# complement
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_complement"))```

It has been a long standing aim of mine to enable easy analysis of Oxford Nanopore MinION data, preferably by people who don’t have a ton of bioinformatics skills.  This was one of the reasons we started to develop poRe, which is really easy to install, runs on any platform and is fairly easy to use.

Below is a proof-of-concept method for analysing MinION data without the need for bioinformatics infrastructure – I am doing this on my Windows laptop, with only R and an internet connection.  We are going to use the EBI web-services to do the heavy lifting.  Please do not abuse this!  EBI probably don’t want you to submit 20,000 jobs.  As I said, this is “proof-of-concept”.  It means you’re not supposed to actually use it in practice 😉

Some polite things to do:

1. Use your own, real e-mail address when/if you run the code below
2. Please stick to a small number of jobs! (unless EBI tell you it’s OK)

We’re going to do everything in R.

Make a directory on your computer where we will do the work.  You will need approx. 15Gb of space (I know, the data are big!).   My directory is going to be “C:/Users/Mick/Documents/MinION”.  We’re going to use data from Nick Loman’s group, published here and here.

```# go to the working directory
setwd("C:/Users/Mick/Documents/MinION")

library(poRe)
library(RCurl)

url <- "ftp://ftp.sra.ebi.ac.uk/vol1/ERA362/ERA362836/oxfordnanopore_native/Ecoli_R7_ONI_flowcell_17.tar.gz"
dest <- "Ecoli_R7_ONI_flowcell_17.tar.gz"
untar("Ecoli_R7_ONI_flowcell_17.tar.gz")```

This will take some time, as there is about 10Gb of data there.  If I had a smaller dataset, I would use it.

Now, we use the poRe get_fast5_info() function to read statistics from the several thousand MinION reads, which again takes some time.  We then just select the reads that have a 2D sequence:

```# read information from all .fast5 files in the current working directory

# get a list of fast5 files where the "len2d" is greater than zero

Now, we are going to use the EBI REST API for NCBI BLAST.  You can read about it at the link provided, but essentially it is about building and querying custom URLs to submit BLAST jobs and fetch the result.

We are only going to do this for the first ten fast5 sequences.  Please do not submit more, or the EBI may block you.  If you want to use this in anger, I suggest contacting the EBI directly and asking them politely.  The code blasts each sequence with default parameters against the EMBL bacterial database, saves the BLAST report to your current working directory, and prints out the top hit.

Here goes:

```# iterate over the first ten 2D reads

# use poRe function get_fasta() to get the 2D sequence
fasta.2d <- get_fasta(f5)\$'2D'

# build a list of parameters for the EBI BLAST service
myl <- list(sequence=fasta.2d,
stype="dna",
program="blastn",
database="em_rel_pro")

# build the submission URL and submit the job
runurl <- "http://www.ebi.ac.uk/Tools/services/rest/ncbiblast/run"
jobid <- postForm(runurl, .params=myl)

# query the status URL for the job until the status
# is no longer "RUNNING"
staturl <- "http://www.ebi.ac.uk/Tools/services/rest/ncbiblast/status/"
surl <- paste(staturl, jobid, sep="")
status <- "RUNNING"
while(status == "RUNNING") {
status <- getURL(surl)
}

# build the output URL and fetch the results
outurl <- "http://www.ebi.ac.uk/Tools/services/rest/ncbiblast/result/"
ourl <- paste(outurl, jobid, "/out", sep="")
out <- getURL(ourl)

# write the output to a file in the current working directory
bloutfile <- gsub(".fast5",".blast",f5)
write(out, bloutfile)

# print out the top hit
print(strsplit(out, "n")[])

}
```

Now, it’s not quick – it probably takes about 2-3 minutes to complete on my laptop.  It also isn’t very responsive (in that it looks a bit like it may not be doing anything!).  However, eventually the R terminal will show the top hits: You can see from the above that the first few reads are a bit dodgy and don’t really hit anything, but pretty quickly the reads start hitting E coli K12, which is what they sequenced.

Also, in your directory you will find a whole bunch of BLAST reports: Note, as “proof-of-concept” code, there is very little (i.e. nothing) in the way of error-checking in the above code!  I mean, come on, it’s only a blog post!

Enjoy!