## Opiniomics

### bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

#### Category: R (page 1 of 2)

As many of you will be aware, I like to post some R code, and I especially like to post base R versions of ggplot2 things!

Well these amazing boxplots turned up on github – go and check them out!

So I did my own version in base R – check out the code here and the result below.  Enjoy! Reading data from google sheets into R is something you imagine should be really simple, but often is anything but. However, package googlesheets goes a long way to solving this problem.

Let’s crack on with an example.

First, install the software:

```install.packages("googlesheets")
```

We then need an example sheet to work with, and I’m going to use one from Britain Elects:

So go here and add this to your own collection (yes you’ll need a google account!) using the little “Add to my drive” icon to the right of the title, just next to the star.

Right, let’s play around a bit!

```# load package

# may ask you to authenticate in a browser!
gs_ls()

# get the Britain Elects google sheet
be <- gs_title("Britain Elects / Public Opinion")
```

This should show you all of the sheets you have access to, and also select the one we want.

Now we can see which worksheets exist within the sheet:

```# list worksheets
gs_ws_ls(be)
```

```# get Westminster voting
west <- gs_read(ss=be, ws = "Westminster voting intentions", skip=1)

# convert to data.frame
wdf <- as.data.frame(west)
```

And hey presto, we have the data. Simples!

Now let's plot it:

```# reverse so that data are forward-sorted by time
wdf <- wdf[2769:1,]

# treat all dates as a single point on x-axis
dates <- 1:2769

# smooth parameter
fs <- 0.04

# plot conservative
plot(lowess(dates, wdf\$Con, f=fs), type="l", col="blue", lwd=5, ylim=c(0,65), xaxt="n", xlab="", ylab="%", main="Polls: Westminster voting intention")

axis(side=1, at=dates[seq(1, 2769, by=40)], labels=paste(wdf\$"Fieldwork end date", wdf\$Y)[seq(1, 2769, by=40)], las=2, cex.axis=0.8)

# plot labour and libdem
lines(lowess(dates, wdf\$Lab, f=fs), col="red", lwd=5)
lines(lowess(dates, wdf\$LDem, f=fs), col="orange", lwd=5)

# add UKIP, and we treat absent values as -50 for plotting purposes
ukip <- wdf\$UKIP
ukip[is.na(ukip)] <- -50
lines(lowess(dates, ukip, f=fs), col="purple", lwd=5)

legend(1, 65, legend=c("Con","Lab","LibDem","UKIP"), fill=c("blue","red","orange","purple"))
``` You have a matrix in R, and you want to visualise it – say, for example, with each cell coloured according to the value in the cell.  Not a heatmap, per se, as that requires clustering; just a simple, visual image.

Well, the answer is image() – however, there is the slightly bizarre coding choice buried in the help:

Notice that `image` interprets the `z` matrix as a table of `f(x[i], y[j])` values, so that the x axis corresponds to row number and the y axis to column number, with column 1 at the bottom, i.e. a 90 degree counter-clockwise rotation of the conventional printed layout of a matrix.

Let’s look at that:

``````

mat <- matrix(c(1,1,1,0,0,0,1,0,1), nrow=3, byrow=TRUE)
mat
[,1] [,2] [,3]
[1,]    1    1    1
[2,]    0    0    0
[3,]    1    0    1

image(mat)
``````

This produces: We can clearly see the 1,0,1 row is now the right-hand column i.e. the matrix has in fact been rotated 90 degrees anti-clockwise.

To counteract this we need to rotate the input matrix clockwise before passing to image:

``````
rotate <- function(x) t(apply(x, 2, rev))

image(rotate(mat))
``````

This now produces what we might expect: Easy 🙂

I came across a really cool tweet a few days ago containing an animated GIF demonstrating rise in global temperatures since 1880:

Link to the blog is here.

I wanted to see if I could recreate this GIF in R, and below are some instructions on how to get close.

The first step is to try and see if I can replicate the original, final image.  The code below comes close enough for this demo, but strangely I am approx. 1 degree C out.  I’ve left a comment on the original blog to ask why – it’s not clear how the exact figures are calculated.

Anyway, here is how to get close:

``````
skip=7, header=TRUE, nrows=143, colClasses = "character")

# clean data
d <- d[-grep("Year", d\$Year),]
rownames(d) <- d[,1]
d <- d[,-1]
d <- d[,-13:-19]
d[d=="****"] <- NA

# convert to numeric and celsius
for (i in 1:12) {
d[,i] <- as.numeric(d[,i]) / 100
}

skip=3, header=TRUE, colClasses = "character")
sa <- as.numeric(s\$seas_anom)

# create colours from blue through to red
colours <- colorRampPalette(c("grey","blue","red"))(nrow(d))

# create initial plot
mmin <- -3
mmax <- 3
par(mar=c(2,2,2,0))
plot(1:12, d[1,]+sa, type="l", ylim=c(mmin,mmax),
yaxt="n", xaxt="n", bty="n")

axis(side=1, at=1:12,
labels=c("Jan","Feb","Mar",
"Apr","May","Jun",
"Jul","Aug","Sep",
"Oct","Nov","Dec"),
lwd=0, lwd.ticks=1, cex.axis=0.8)

axis(side=2, at=-3:3, labels=-3:3,
lwd=0, lwd.ticks=1, las=2, cex.axis=0.8)

title(main=expression(paste("Temperature ","Anomaly ","(",degree,"C)")),
title(main="(Difference from 1980-2015 annual mean)", adj=0, cex.main=0.6,
line=0, font.main=1)

# add horizontal lines
for (j in -3:3) {
lines(1:12, rep(j, 12), lty=3)
}

# add yearly temperature lines
for (j in 1:nrow(d)) {
lines(1:12, as.numeric(d[j,])+sa, col=colours[j])
}

``````

This produces the following image: This is close enough!

So how do we create the animation?  The old skool way of doing this is to create every image in R and create an animated GIF with ImageMagick.  So we'll do that 🙂

The slightly adjusted code to create every image is here (creates 137 PNGs in working directory!):

``````
#
# Assumes you have run the code above to create d and sa
#

doplot <- function(i) {

x   <- d[i,] + sa
col <- colours[i]

mmin <- -3
mmax <- 3
par(mar=c(2.5,2.5,2,0))
plot(1:12, x, type="l", ylim=c(mmin,mmax),
yaxt="n", xaxt="n", bty="n", col=col)
axis(side=1, at=1:12, labels=c("Jan","Feb","Mar",
"Apr","May","Jun",
"Jul","Aug","Sep",
"Oct","Nov","Dec"),
lwd=0, lwd.ticks=1, cex.axis=1.5)

axis(side=2, at=-3:3, labels=-3:3, lwd=0,
lwd.ticks=1, las=2, cex.axis=1.5)

title(main=expression(paste("Temperature ","Anomaly ","(",degree,"C)")),
title(main="(Difference from 1980-2015 annual mean)",
adj=0, cex.main=1, line=-0.5, font.main=1)

# add horizontal lines
for (j in -3:3) {
lines(1:12, rep(j, 12), lty=3)
}

# add other years
for (k in 1:i) {
lines(1:12, d[k,]+sa, col=colours[k])
}

# add year label
text(7, 0.8, rownames(d)[i], col=colours[i], font=2, cex=2)

}

# create plots/PNGs
for (j in 1:nrow(d)) {

name <- paste("plot",sprintf("%03.f", j),".png", sep="")
png(name, width=800, height=600)
doplot(j)
dev.off()
}
``````

Then in ImageMagick, we use convert to create the animated GIF:

``````
convert *.png -delay 3 -loop 1 globaltemp.gif
convert globaltemp.gif \( +clone -set delay 500 \) +swap +delete globaltempwpause.gif
```
```

Final result here: Done 🙂

I have to admit to being a bit of a snob when it comes to graphs and charts in scientific papers and presentations.  It’s not like I think I am particularly good at it – I’m OK – it’s just that I know what’s bad.  I’ve seen folk screenshot multiple Excel graphs so they can paste them into a powerpoint table to create multi-panel plots… and it kind of makes me want to scream.   I’m sorry, I really am, but when I see Excel plots in papers I judge the authors, and I don’t mean in a good way.  I can’t help it.  Plotting good graphs is an art, and sticking with the metaphor, Excel is paint-by-numbers and R is a blank canvas, waiting for something beautiful to be created; Excel is limiting, whereas R sets you free.

Readers of this blog will know that I like to take plots that I find which are fabulous and recreate them.  Well let’s do that again 🙂

I saw this Tweet by Trevor Branch on Twitter and found it intriguing:

It shows two plots of the same data.  The Excel plot: And the multi plot: You’re clearly supposed to think the latter is better, and I do; however perhaps disappointingly, the top graph would be easy to plot in Excel but I’m guessing most people would find it impossible to create the bottom one (in Excel or otherwise).

Well, I’m going to show you how to create both, in R. All code now in Github!

The Excel Graph

Now, I’ve shown you how to create Excel-like graphs in R before, and we’ll use some of the same tricks again.

First we set up the data:

``````
# set up the data
df <- data.frame(Circulatory=c(32,26,19,16,14,13,11,11),
Mental=c(11,11,18,24,23,24,26,23),
Musculoskeletal=c(17,18,13,16,12,18,20,26),
Cancer=c(10,15,15,14,16,16,14,14))

rownames(df) <- seq(1975,2010,by=5)

df
```
```

Now let's plot the graph

``````
# set up colours and points
cols <- c("darkolivegreen3","darkcyan","mediumpurple2","coral3")
pch <- c(17,18,8,15)

# we have one point on X axis for each row of df (nrow(df))
# we then add 2.5 to make room for the legend
xmax <- nrow(df) + 2.5

# make the borders smaller
par(mar=c(3,3,0,0))

# plot an empty graph
plot(1:nrow(df), 1:nrow(df), pch="",
xlab=NA, ylab=NA, xaxt="n", yaxt="n",
ylim=c(0,35), bty="n", xlim=c(1,xmax))

# add horizontal lines
for (i in seq(0,35,by=5)) {
lines(1:nrow(df), rep(i,nrow(df)), col="grey")
}

# add points and lines
# for each dataset
for (i in 1:ncol(df)) {

points(1:nrow(df), df[,i], pch=pch[i],
col=cols[i], cex=1.5)

lines(1:nrow(df), df[,i], col=cols[i],
lwd=4)

}

# add bottom axes
axis(side=1, at=1:nrow(df), tick=FALSE,
labels=rownames(df))

axis(side=1, at=seq(-0.5,8.5,by=1),
tick=TRUE, labels=NA)

# add left axis
axis(side=2, at=seq(0,35,by=5), tick=TRUE,
las=TRUE, labels=paste(seq(0,35,by=5),"%",sep=""))

legend(8.5,25,legend=colnames(df), pch=pch,
col=cols, cex=1.5, bty="n",  lwd=3, lty=1)
```
```

And here is the result: Not bad eh?  Actually, yes, very bad; but also very Excel!

The multi-plot

Plotting multi-panel figures in R is sooooooo easy!  Here we go for the alternate multi-plot.  We use the same data.

``````
# split into 2 rows and 2 cols
split.screen(c(2,2))

# keep track of which screen we are
# plotting to
scr <- 1

# iterate over columns
for (i in 1:ncol(df)) {

# select screen
screen(scr)

# reduce margins
par(mar=c(3,2,1,1))

# empty plot
plot(1:nrow(df), 1:nrow(df), pch="", xlab=NA,
ylab=NA, xaxt="n", yaxt="n", ylim=c(0,35),
bty="n")

# plot all data in grey
for (j in 1:ncol(df)) {
lines(1:nrow(df), df[,j],
col="grey", lwd=3)

}

# plot selected in blue
lines(1:nrow(df), df[,i], col="blue4", lwd=4)

points(c(1,nrow(df)), c(df[1,i], df[nrow(df),i]),
pch=16, cex=2, col="blue4")

mtext(df[1,i], side=2, at=df[1,i], las=2)
mtext(df[nrow(df),i], side=4, at=df[nrow(df),i],
las=2)

title(colnames(df)[i])

# add axes if we are one of
# the bottom two plots
if (scr >= 3) {
axis(side=1, at=1:nrow(df), tick=FALSE,
labels=rownames(df))
}

# next screen
scr <- scr + 1
}

# close multi-panel image
close.screen(all=TRUE)
```
```

And here is the result: And there we have it.

So which do you prefer?

Very quick one this!  Really interesting paper from Ioannidis et al about citation indices.

I wanted to recreate figure 1, which is: Closest I could get (code here) is this: Biggest difference is in NS, where they find all negative correlations, but most of mine are positive.

Source data are Table S1 Data.

Am I doing something wrong?  Or is the paper wrong?

UPDATE 9th July 2016

Using Spearman gets us closer but it’s still not quite correct (updated code too) Due to the unspeakable horror of the EU referendum, I have to find something to make me feel better.  This poll of polls usually does so, though it is way too close for comfort.

Anyway, I took their data and plotted it for myself.  Data and script are on github, and all you need is R.

Enjoy! #voteRemain Use of the Pearson correlation co-efficient is common in genomics and bioinformatics, which is OK as it goes (I have used it extensively myself), but it has some major drawbacks – the major one being that Pearson can produce large coefficients in the presence of very large measurements.

This is best shown via example in R:

``````
# let's correlate some random data
g1 <- rnorm(50)
g2 <- rnorm(50)

cor(g1, g2)
#  -0.1486646
``````

So we get a small, -ve correlation from correlating two sets of 50 random values. If we ran this 1000 times we would get a distribution around zero, as expected.

Let's add in a single, large value:

``````
# let's correlate some random data with the addition of a single, large value
g1 <- c(g1, 10)
g2 <- c(g2, 11)

cor(g1, g2)
#  0.6040776
``````

Holy smokes, all of a sudden my random datasets are positively correlated with r>=0.6!

It's also significant.

``````
> cor.test(g1,g2, method="pearson")

Pearsons product-moment correlation

data:  g1 and g2
t = 5.3061, df = 49, p-value = 2.687e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3941015 0.7541199
sample estimates:
cor
0.6040776
```
```

So if you have used Pearson in large datasets, you will almost certainly have some of these spurious correlations in your data.

How can you solve this? By using Spearman, of course:

``````
> cor(g1, g2, method="spearman")
 -0.0961086
> cor.test(g1, g2, method="spearman")

Spearmans rank correlation rho

data:  g1 and g2
S = 24224, p-value = 0.5012
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.0961086
```
```

If you have MiSeq and ONT MinION data, it is now very easy to combine them into a single-contig, high quality assembly.  Here I will start with a base Ubuntu 14.04 LTS distribution and take you through every step, including software installation.  I am starting with an Amazon EC2 m4.xlarge AMI with 4 vCPUs and 16Gb RAM.  If you want to repeat the work below, you don’t need to use EC2, any computer running Ubuntu and with enough disk/RAM should work.

For MinION data, we will use Nick’s SQK-MAP-006 data at the EBI.  Using a simple trick, and because I know the internal structure of the tar archive, we can download only the base-called data:

``````
curl ftp://ftp.sra.ebi.ac.uk/vol1/ERA540/ERA540530/oxfordnanopore_native/MAP006-1.tar | tar -xf - MAP006-1/MAP006-1_downloads &
``````

We can also get some MiSeq data from ENA from the same strain, E coli K-12 MG1655:

``````
wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR262/005/SRR2627175/SRR2627175_2.fastq.gz &
wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR262/005/SRR2627175/SRR2627175_1.fastq.gz &
``````

Now, while that is downloading, we need to install some software on the server:

``````
sudo apt-get update
sudo apt-get install r-base-core git samtools
``````

This installs R which we will use to extract the MinION data (using poRe), git and samtools.  Just reply “Y” to any questions.

We now need to install the poRe dependencies in R, which is very easy:

``````
R
source("http://www.bioconductor.org/biocLite.R")
biocLite("rhdf5")
install.packages(c("shiny","bit64","data.table","svDialogs"))
q()
``````

R may ask if you want to install into a local library, just say Y and accept defaults.  We need to download poRe from sourecforge and we are using version 0.16

Once downloaded, and back at the Linux command line:

``````
R CMD INSTALL poRe_0.16.tar.gz
``````

The fastq extraction scripts for poRe are in github, so let’s go get those:

``````
git clone https://github.com/mw55309/poRe_scripts.git
``````

We will assemble using SPAdes, so let’s go get that:

``````
gunzip < SPAdes-3.6.2-Linux.tar.gz | tar xvf -
``````

Now, we are ready to go.  First off, let’s extract the 2D sequence data as FASTQ from the MinION data.  Nick’s SQK-MAP-006 data are in the old FAST5 format so we use the script in “old_format”:

``````
``````

Finally we can use SPAdes to assemble the data.  SPAdes is a swiss-army knife of genome assembly tools, and by default includes read correction.  This takes up lots of RAM, so we are going to skip it.  We will also only use 3 kmers to save time:

``````
-t 4 -k 21,51,71
-1 SRR2627175_1.fastq.gz
-2 SRR2627175_2.fastq.gz
--nanopore minion.pass.2D.fastq
``````

Use samtools to extract the top contig:

``````
samtools faidx SPAdes_hybrid/contigs.fasta NODE_1_length_4620446_cov_135.169_ID_22238 > single_contig.fa
``````

Finally, a quick comparison to the reference:

``````
sudo apt-get install mummer
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NC_000913.3&rettype=fasta&retmode=txt" > NC_000913.3.fa
nucmer NC_000913.3.fa single_contig.fa
mummerplot -png out.delta
display out.png &
`````` And that is a single global alignment between reference and query 🙂

Simple huh?

A little while ago I demonstrated how to extract FASTQ from MinION FAST5 files using Rscript and the Linux command line.

In that article, I described how to extract the different FASTQ data using a parameter:

```# 2D is the default
invisible(apply(array(f5files), 1, printfastq))
# template
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_template"))
# complement
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_complement"))
```

The f5path parameter is the address of the FASTQ data within the FAST5 file, and that’s been changed recently by nanopore.  Well, we can very easily simply edit our scripts to cope with the new format:

```# 2D is the default
invisible(apply(array(f5files), 1, printfastq))
# template
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_1D_000/BaseCalled_template"))
# complement
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_1D_000/BaseCalled_complement"))
```

And that’s it! No new version of poRe required, simply a couple of edits to scripts – example scripts can be found on github.