Opiniomics

bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Tag: R

Beautiful boxplots in base R

As many of you will be aware, I like to post some R code, and I especially like to post base R versions of ggplot2 things!

Well these amazing boxplots turned up on github – go and check them out!

So I did my own version in base R – check out the code here and the result below.  Enjoy!

probability

Reading data from google sheets into R

Reading data from google sheets into R is something you imagine should be really simple, but often is anything but. However, package googlesheets goes a long way to solving this problem.

Let’s crack on with an example.

First, install the software:

install.packages("googlesheets")

We then need an example sheet to work with, and I’m going to use one from Britain Elects:

So go here and add this to your own collection (yes you’ll need a google account!) using the little “Add to my drive” icon to the right of the title, just next to the star.

Right, let’s play around a bit!

# load package
library(googlesheets)

# which google sheets do you have access to?
# may ask you to authenticate in a browser!
gs_ls()

# get the Britain Elects google sheet
be <- gs_title("Britain Elects / Public Opinion")

This should show you all of the sheets you have access to, and also select the one we want.

Now we can see which worksheets exist within the sheet:

# list worksheets
gs_ws_ls(be)

We can "download" one of the sheets using gs_read()

# get Westminster voting
west <- gs_read(ss=be, ws = "Westminster voting intentions", skip=1)

# convert to data.frame
wdf <- as.data.frame(west)

And hey presto, we have the data. Simples!

Now let's plot it:

# reverse so that data are forward-sorted by time
wdf <- wdf[2769:1,]

# treat all dates as a single point on x-axis
dates <- 1:2769

# smooth parameter
fs <- 0.04

# plot conservative
plot(lowess(dates, wdf$Con, f=fs), type="l", col="blue", lwd=5, ylim=c(0,65), xaxt="n", xlab="", ylab="%", main="Polls: Westminster voting intention")

# add labels
axis(side=1, at=dates[seq(1, 2769, by=40)], labels=paste(wdf$"Fieldwork end date", wdf$Y)[seq(1, 2769, by=40)], las=2, cex.axis=0.8)

# plot labour and libdem
lines(lowess(dates, wdf$Lab, f=fs), col="red", lwd=5)
lines(lowess(dates, wdf$LDem, f=fs), col="orange", lwd=5)

# add UKIP, and we treat absent values as -50 for plotting purposes
ukip <- wdf$UKIP
ukip[is.na(ukip)] <- -50
lines(lowess(dates, ukip, f=fs), col="purple", lwd=5)

# add legend
legend(1, 65, legend=c("Con","Lab","LibDem","UKIP"), fill=c("blue","red","orange","purple"))


westminster

Creating an image of a matrix in R using image()

You have a matrix in R, and you want to visualise it – say, for example, with each cell coloured according to the value in the cell.  Not a heatmap, per se, as that requires clustering; just a simple, visual image.

Well, the answer is image() – however, there is the slightly bizarre coding choice buried in the help:

Notice that image interprets the z matrix as a table of f(x[i], y[j]) values, so that the x axis corresponds to row number and the y axis to column number, with column 1 at the bottom, i.e. a 90 degree counter-clockwise rotation of the conventional printed layout of a matrix.

Let’s look at that:



mat <- matrix(c(1,1,1,0,0,0,1,0,1), nrow=3, byrow=TRUE)
mat
     [,1] [,2] [,3]
[1,]    1    1    1
[2,]    0    0    0
[3,]    1    0    1

image(mat)

This produces:

mat1

We can clearly see the 1,0,1 row is now the right-hand column i.e. the matrix has in fact been rotated 90 degrees anti-clockwise.

To counteract this we need to rotate the input matrix clockwise before passing to image:


rotate <- function(x) t(apply(x, 2, rev))

image(rotate(mat))

This now produces what we might expect:

mat2

Easy 🙂

How to produce an animated gif of rise in global temperature in R

I came across a really cool tweet a few days ago containing an animated GIF demonstrating rise in global temperatures since 1880:

Link to the blog is here.

I wanted to see if I could recreate this GIF in R, and below are some instructions on how to get close.

The first step is to try and see if I can replicate the original, final image.  The code below comes close enough for this demo, but strangely I am approx. 1 degree C out.  I’ve left a comment on the original blog to ask why – it’s not clear how the exact figures are calculated.

Anyway, here is how to get close:


# download data
d <- read.table("http://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts.txt", 
                skip=7, header=TRUE, nrows=143, colClasses = "character")

# clean data
d <- d[-grep("Year", d$Year),]
rownames(d) <- d[,1]
d <- d[,-1]
d <- d[,-13:-19]
d[d=="****"] <- NA

# convert to numeric and celsius
for (i in 1:12) {
	d[,i] <- as.numeric(d[,i]) / 100
}	

# download seasonal adjustments
s <- read.table("http://data.giss.nasa.gov/gistemp/faq/merra2_seas_anom.txt", 
                skip=3, header=TRUE, colClasses = "character")
sa <- as.numeric(s$seas_anom)

# create colours from blue through to red
colours <- colorRampPalette(c("grey","blue","red"))(nrow(d))

# create initial plot
mmin <- -3
mmax <- 3
par(mar=c(2,2,2,0))
plot(1:12, d[1,]+sa, type="l", ylim=c(mmin,mmax), 
     yaxt="n", xaxt="n", bty="n")

# add axes
axis(side=1, at=1:12, 
     labels=c("Jan","Feb","Mar",
              "Apr","May","Jun",
              "Jul","Aug","Sep",
              "Oct","Nov","Dec"), 
     lwd=0, lwd.ticks=1, cex.axis=0.8)

axis(side=2, at=-3:3, labels=-3:3, 
     lwd=0, lwd.ticks=1, las=2, cex.axis=0.8)

# add title
title(main=expression(paste("Temperature ","Anomaly ","(",degree,"C)")), 
      adj=0, cex.main=0.8)
title(main="(Difference from 1980-2015 annual mean)", adj=0, cex.main=0.6, 
      line=0, font.main=1)

# add horizontal lines
for (j in -3:3) {
	lines(1:12, rep(j, 12), lty=3)
}

# add yearly temperature lines
for (j in 1:nrow(d)) {
	lines(1:12, as.numeric(d[j,])+sa, col=colours[j])
}

This produces the following image:

final

This is close enough!

So how do we create the animation?  The old skool way of doing this is to create every image in R and create an animated GIF with ImageMagick.  So we'll do that 🙂

The slightly adjusted code to create every image is here (creates 137 PNGs in working directory!):


#
# Assumes you have run the code above to create d and sa
#

doplot <- function(i) {

	x   <- d[i,] + sa
	col <- colours[i]

	mmin <- -3
	mmax <- 3
	par(mar=c(2.5,2.5,2,0))
	plot(1:12, x, type="l", ylim=c(mmin,mmax), 
             yaxt="n", xaxt="n", bty="n", col=col)
	axis(side=1, at=1:12, labels=c("Jan","Feb","Mar",
                                       "Apr","May","Jun",
                                       "Jul","Aug","Sep",
                                       "Oct","Nov","Dec"), 
             lwd=0, lwd.ticks=1, cex.axis=1.5)

	axis(side=2, at=-3:3, labels=-3:3, lwd=0, 
             lwd.ticks=1, las=2, cex.axis=1.5)

	title(main=expression(paste("Temperature ","Anomaly ","(",degree,"C)")), 
              adj=0, cex.main=1.5)
	title(main="(Difference from 1980-2015 annual mean)", 
              adj=0, cex.main=1, line=-0.5, font.main=1)

	# add horizontal lines
	for (j in -3:3) {
		lines(1:12, rep(j, 12), lty=3)
	}

	# add other years
	for (k in 1:i) {
		lines(1:12, d[k,]+sa, col=colours[k])
	}

	# add year label
	text(7, 0.8, rownames(d)[i], col=colours[i], font=2, cex=2)

}

# create plots/PNGs
for (j in 1:nrow(d)) {
	
	name <- paste("plot",sprintf("%03.f", j),".png", sep="")
	png(name, width=800, height=600)
	doplot(j)
	dev.off()
}

Then in ImageMagick, we use convert to create the animated GIF:


convert *.png -delay 3 -loop 1 globaltemp.gif
convert globaltemp.gif \( +clone -set delay 500 \) +swap +delete globaltempwpause.gif

Final result here:

globaltempwpause

Done 🙂

Plotting cool graphs in R

I have to admit to being a bit of a snob when it comes to graphs and charts in scientific papers and presentations.  It’s not like I think I am particularly good at it – I’m OK – it’s just that I know what’s bad.  I’ve seen folk screenshot multiple Excel graphs so they can paste them into a powerpoint table to create multi-panel plots… and it kind of makes me want to scream.   I’m sorry, I really am, but when I see Excel plots in papers I judge the authors, and I don’t mean in a good way.  I can’t help it.  Plotting good graphs is an art, and sticking with the metaphor, Excel is paint-by-numbers and R is a blank canvas, waiting for something beautiful to be created; Excel is limiting, whereas R sets you free.

Readers of this blog will know that I like to take plots that I find which are fabulous and recreate them.  Well let’s do that again 🙂

I saw this Tweet by Trevor Branch on Twitter and found it intriguing:

It shows two plots of the same data.  The Excel plot:

excel

And the multi plot:

multi

You’re clearly supposed to think the latter is better, and I do; however perhaps disappointingly, the top graph would be easy to plot in Excel but I’m guessing most people would find it impossible to create the bottom one (in Excel or otherwise).

Well, I’m going to show you how to create both, in R. All code now in Github!

The Excel Graph

Now, I’ve shown you how to create Excel-like graphs in R before, and we’ll use some of the same tricks again.

First we set up the data:


# set up the data
df <- data.frame(Circulatory=c(32,26,19,16,14,13,11,11),
		 Mental=c(11,11,18,24,23,24,26,23),
		 Musculoskeletal=c(17,18,13,16,12,18,20,26),
		 Cancer=c(10,15,15,14,16,16,14,14))

rownames(df) <- seq(1975,2010,by=5)

df

Now let's plot the graph


# set up colours and points
cols <- c("darkolivegreen3","darkcyan","mediumpurple2","coral3")
pch <- c(17,18,8,15)

# we have one point on X axis for each row of df (nrow(df))
# we then add 2.5 to make room for the legend
xmax <- nrow(df) + 2.5

# make the borders smaller
par(mar=c(3,3,0,0))

# plot an empty graph
plot(1:nrow(df), 1:nrow(df), pch="", 
		xlab=NA, ylab=NA, xaxt="n", yaxt="n", 
		ylim=c(0,35), bty="n", xlim=c(1,xmax))

# add horizontal lines
for (i in seq(0,35,by=5)) {
	lines(1:nrow(df), rep(i,nrow(df)), col="grey")
}

# add points and lines 
# for each dataset
for (i in 1:ncol(df)) {

	points(1:nrow(df), df[,i], pch=pch[i], 
		col=cols[i], cex=1.5)

	lines(1:nrow(df), df[,i], col=cols[i], 
		lwd=4)


}

# add bottom axes
axis(side=1, at=1:nrow(df), tick=FALSE, 
		labels=rownames(df))

axis(side=1, at=seq(-0.5,8.5,by=1), 
		tick=TRUE, labels=NA)

# add left axis
axis(side=2, at=seq(0,35,by=5), tick=TRUE, 
		las=TRUE, labels=paste(seq(0,35,by=5),"%",sep=""))

# add legend
legend(8.5,25,legend=colnames(df), pch=pch, 
		col=cols, cex=1.5, bty="n",  lwd=3, lty=1)

And here is the result:

excel_plot

Not bad eh?  Actually, yes, very bad; but also very Excel!

The multi-plot

Plotting multi-panel figures in R is sooooooo easy!  Here we go for the alternate multi-plot.  We use the same data.


# split into 2 rows and 2 cols
split.screen(c(2,2))

# keep track of which screen we are
# plotting to
scr <- 1

# iterate over columns
for (i in 1:ncol(df)) {

	# select screen
	screen(scr)

	# reduce margins
	par(mar=c(3,2,1,1))

	# empty plot
	plot(1:nrow(df), 1:nrow(df), pch="", xlab=NA, 
		ylab=NA, xaxt="n", yaxt="n", ylim=c(0,35), 
		bty="n")

	# plot all data in grey
	for (j in 1:ncol(df)) {
		lines(1:nrow(df), df[,j], 
		col="grey", lwd=3)

	}	

	# plot selected in blue
	lines(1:nrow(df), df[,i], col="blue4", lwd=4)

	# add blobs
	points(c(1,nrow(df)), c(df[1,i], df[nrow(df),i]), 
		pch=16, cex=2, col="blue4")

	# add numbers
	mtext(df[1,i], side=2, at=df[1,i], las=2)
	mtext(df[nrow(df),i], side=4, at=df[nrow(df),i], 
		las=2)	

	# add title
	title(colnames(df)[i])

	# add axes if we are one of
	# the bottom two plots
	if (scr >= 3) {
		axis(side=1, at=1:nrow(df), tick=FALSE, 
			labels=rownames(df))
	}

	# next screen
	scr <- scr + 1
}

# close multi-panel image
close.screen(all=TRUE)

And here is the result:

multi_plot

 


And there we have it.

So which do you prefer?

Your strongly correlated data is probably nonsense

Use of the Pearson correlation co-efficient is common in genomics and bioinformatics, which is OK as it goes (I have used it extensively myself), but it has some major drawbacks – the major one being that Pearson can produce large coefficients in the presence of very large measurements.

This is best shown via example in R:


# let's correlate some random data
g1 <- rnorm(50)
g2 <- rnorm(50)

cor(g1, g2)
# [1] -0.1486646

So we get a small, -ve correlation from correlating two sets of 50 random values. If we ran this 1000 times we would get a distribution around zero, as expected.

Let's add in a single, large value:


# let's correlate some random data with the addition of a single, large value
g1 <- c(g1, 10)
g2 <- c(g2, 11)
 
cor(g1, g2)
# [1] 0.6040776

Holy smokes, all of a sudden my random datasets are positively correlated with r>=0.6!

It's also significant.


> cor.test(g1,g2, method="pearson")

        Pearsons product-moment correlation

data:  g1 and g2
t = 5.3061, df = 49, p-value = 2.687e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3941015 0.7541199
sample estimates:
      cor 
0.6040776 

So if you have used Pearson in large datasets, you will almost certainly have some of these spurious correlations in your data.

How can you solve this? By using Spearman, of course:


> cor(g1, g2, method="spearman")
[1] -0.0961086
> cor.test(g1, g2, method="spearman")

        Spearmans rank correlation rho

data:  g1 and g2
S = 24224, p-value = 0.5012
alternative hypothesis: true rho is not equal to 0
sample estimates:
       rho 
-0.0961086 

Generate a single contig, hybrid assembly of E coli using MiSeq and MinION data

If you have MiSeq and ONT MinION data, it is now very easy to combine them into a single-contig, high quality assembly.  Here I will start with a base Ubuntu 14.04 LTS distribution and take you through every step, including software installation.  I am starting with an Amazon EC2 m4.xlarge AMI with 4 vCPUs and 16Gb RAM.  If you want to repeat the work below, you don’t need to use EC2, any computer running Ubuntu and with enough disk/RAM should work.

For MinION data, we will use Nick’s SQK-MAP-006 data at the EBI.  Using a simple trick, and because I know the internal structure of the tar archive, we can download only the base-called data:


curl ftp://ftp.sra.ebi.ac.uk/vol1/ERA540/ERA540530/oxfordnanopore_native/MAP006-1.tar | tar -xf - MAP006-1/MAP006-1_downloads &

We can also get some MiSeq data from ENA from the same strain, E coli K-12 MG1655:


wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR262/005/SRR2627175/SRR2627175_2.fastq.gz &
wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR262/005/SRR2627175/SRR2627175_1.fastq.gz &

Now, while that is downloading, we need to install some software on the server:


sudo apt-get update
sudo apt-get install r-base-core git samtools

This installs R which we will use to extract the MinION data (using poRe), git and samtools.  Just reply “Y” to any questions.

We now need to install the poRe dependencies in R, which is very easy:


R
source("http://www.bioconductor.org/biocLite.R")
biocLite("rhdf5")
install.packages(c("shiny","bit64","data.table","svDialogs"))
q()

R may ask if you want to install into a local library, just say Y and accept defaults.  We need to download poRe from sourecforge and we are using version 0.16

Once downloaded, and back at the Linux command line:


R CMD INSTALL poRe_0.16.tar.gz

The fastq extraction scripts for poRe are in github, so let’s go get those:


git clone https://github.com/mw55309/poRe_scripts.git

We will assemble using SPAdes, so let’s go get that:


wget http://spades.bioinf.spbau.ru/release3.6.2/SPAdes-3.6.2-Linux.tar.gz
gunzip < SPAdes-3.6.2-Linux.tar.gz | tar xvf -

Now, we are ready to go.  First off, let’s extract the 2D sequence data as FASTQ from the MinION data.  Nick’s SQK-MAP-006 data are in the old FAST5 format so we use the script in “old_format”:


./poRe_scripts/old_format/extract2D MAP006-1/MAP006-1_downloads/pass/ > minion.pass.2D.fastq &

Finally we can use SPAdes to assemble the data.  SPAdes is a swiss-army knife of genome assembly tools, and by default includes read correction.  This takes up lots of RAM, so we are going to skip it.  We will also only use 3 kmers to save time:


./SPAdes-3.6.2-Linux/bin/spades.py --only-assembler 
				   -t 4 -k 21,51,71 
				   -1 SRR2627175_1.fastq.gz 
				   -2 SRR2627175_2.fastq.gz 
				   --nanopore minion.pass.2D.fastq 
				   -o SPAdes_hybrid & 

Use samtools to extract the top contig:


head -n 1 SPAdes_hybrid/contigs.fasta
samtools faidx SPAdes_hybrid/contigs.fasta
samtools faidx SPAdes_hybrid/contigs.fasta NODE_1_length_4620446_cov_135.169_ID_22238 > single_contig.fa

Finally, a quick comparison to the reference:


sudo apt-get install mummer 
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NC_000913.3&rettype=fasta&retmode=txt" > NC_000913.3.fa
nucmer NC_000913.3.fa single_contig.fa
mummerplot -png out.delta
display out.png &

ecoli_mummer

And that is a single global alignment between reference and query 🙂

Simple huh?

Extracting MinION fastq on the command line using poRe

Hopefully most of you will be aware of our software for handling MinION data, poRe, published in bioinformatics this year:

Watson M, Thomson M, Risse J, Talbot R, Santoyo-Lopez J, Gharbi K, Blaxter M. (2015) 
poRe: an R package for the visualization and analysis of nanopore sequencing data. 
Bioinformatics 31(1):114-5.

poRe is based on R; many people consider R to be an interactive package, but it is very easy to write command-line scripts for R in very much the same way as you write Perl or Python scripts.  Below is a script that can be used in conjunction with R and poRe to extract 2D fastq data from a directory full of fast5 files:

#!/usr/bin/env Rscript

# get command line arguments as an array
args <- commandArgs(trailingOnly = TRUE)

# if we don't have any command line arguments
# print out usage
if (length(args) == 0) {
        stop("Please provide a directory containing fast5 files as the first (and only) argument")
}

# the first argument is the directory, stop if it doesn't exist
my.dir <- args[1]
if (! file.exists(my.dir)) {
        stop(paste(my.dir, " does not exist"))
}

# get a list of all of the files with a .fast5 file name extension
f5files <- dir(path = my.dir, pattern = "\.fast5$", full.names = TRUE)

# print to STDERR how many files we are going to attempt
write(paste("Extracting 2D fastq from", length(f5files), "files"), stderr())

# load poRe
suppressMessages(library(poRe))

# apply printfastq to every entry in f5files
# printfastq tests for 2D fastq and if found prints to STDOUT
invisible(apply(array(f5files), 1, printfastq))

The suppressMessages() function means that none of the output is printed when the poRe library is loaded, and the invisible() function suppresses the natural output of apply() (which would otherwise return an array of undefineds the same length as f5files (which we definitely do not want).

The above is based on R 3.1.2 and poRe 0.9.

We could extract template and complement fastq (respectively) by substituting in the following lines:

# template
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_template"))
# complement
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_complement"))

 

Analysing MinION data with no bioinformatics expertise nor infrastructure

It has been a long standing aim of mine to enable easy analysis of Oxford Nanopore MinION data, preferably by people who don’t have a ton of bioinformatics skills.  This was one of the reasons we started to develop poRe, which is really easy to install, runs on any platform and is fairly easy to use.

Below is a proof-of-concept method for analysing MinION data without the need for bioinformatics infrastructure – I am doing this on my Windows laptop, with only R and an internet connection.  We are going to use the EBI web-services to do the heavy lifting.  Please do not abuse this!  EBI probably don’t want you to submit 20,000 jobs.  As I said, this is “proof-of-concept”.  It means you’re not supposed to actually use it in practice 😉

Some polite things to do:

  1. Use your own, real e-mail address when/if you run the code below
  2. Please stick to a small number of jobs! (unless EBI tell you it’s OK)

We’re going to do everything in R.

Make a directory on your computer where we will do the work.  You will need approx. 15Gb of space (I know, the data are big!).   My directory is going to be “C:/Users/Mick/Documents/MinION”.  We’re going to use data from Nick Loman’s group, published here and here.

# go to the working directory
setwd("C:/Users/Mick/Documents/MinION")

# load the necessary packages
library(poRe)
library(RCurl)

# download one of Nick's runs and unpack it
url <- "ftp://ftp.sra.ebi.ac.uk/vol1/ERA362/ERA362836/oxfordnanopore_native/Ecoli_R7_ONI_flowcell_17.tar.gz"
dest <- "Ecoli_R7_ONI_flowcell_17.tar.gz"
download.file(url=url, destfile=dest)
untar("Ecoli_R7_ONI_flowcell_17.tar.gz")

This will take some time, as there is about 10Gb of data there.  If I had a smaller dataset, I would use it.

Now, we use the poRe get_fast5_info() function to read statistics from the several thousand MinION reads, which again takes some time.  We then just select the reads that have a 2D sequence:

# read information from all .fast5 files in the current working directory
finfo <- read.fast5.info(dir=getwd())

# get a list of fast5 files where the "len2d" is greater than zero
reads.2d <- rownames(finfo)[finfo$len2d>0]

Now, we are going to use the EBI REST API for NCBI BLAST.  You can read about it at the link provided, but essentially it is about building and querying custom URLs to submit BLAST jobs and fetch the result.

We are only going to do this for the first ten fast5 sequences.  Please do not submit more, or the EBI may block you.  If you want to use this in anger, I suggest contacting the EBI directly and asking them politely.  The code blasts each sequence with default parameters against the EMBL bacterial database, saves the BLAST report to your current working directory, and prints out the top hit.

Here goes:

# iterate over the first ten 2D reads
for (f5 in reads.2d[1:10]) {

    # use poRe function get_fasta() to get the 2D sequence
    	fasta.2d <- get_fasta(f5)$'2D'
	
    # build a list of parameters for the EBI BLAST service
    	myl <- list(sequence=fasta.2d,
                stype="dna",
		                program="blastn",
		                email="yourownaddress@email.com",
		                database="em_rel_pro")

    # build the submission URL and submit the job
    runurl <- "http://www.ebi.ac.uk/Tools/services/rest/ncbiblast/run"
    	jobid <- postForm(runurl, .params=myl)

    # query the status URL for the job until the status
    # is no longer "RUNNING"
    staturl <- "http://www.ebi.ac.uk/Tools/services/rest/ncbiblast/status/"
    	surl <- paste(staturl, jobid, sep="")
    	status <- "RUNNING"
    	while(status == "RUNNING") {
          		status <- getURL(surl)
	    }

    # build the output URL and fetch the results
    outurl <- "http://www.ebi.ac.uk/Tools/services/rest/ncbiblast/result/"
    	ourl <- paste(outurl, jobid, "/out", sep="")
    	out <- getURL(ourl)

    # write the output to a file in the current working directory
    	bloutfile <- gsub(".fast5",".blast",f5)
    	write(out, bloutfile)

    # print out the top hit
    	print(strsplit(out, "n")[[1]][22])

}

Now, it’s not quick – it probably takes about 2-3 minutes to complete on my laptop.  It also isn’t very responsive (in that it looks a bit like it may not be doing anything!).  However, eventually the R terminal will show the top hits:

blast_top_hit

You can see from the above that the first few reads are a bit dodgy and don’t really hit anything, but pretty quickly the reads start hitting E coli K12, which is what they sequenced.

Also, in your directory you will find a whole bunch of BLAST reports:

blast_files

Note, as “proof-of-concept” code, there is very little (i.e. nothing) in the way of error-checking in the above code!  I mean, come on, it’s only a blog post!

Enjoy!

© 2019 Opiniomics

Theme by Anders NorenUp ↑