bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Category: fun stuff (page 1 of 3)

Come and work with us

We have an amazing post-doc position in our group:

Please apply and please do so via the University system!

Below are all the reasons why you MUST COME AND WORK WITH US!  These are the the kind of thing we can’t fit in the job advert. Please apply, even if you think you don’t yet have the skills – I am sure we can work on that for the right person.

1. Institutional environment

You will be joining The Roslin Institute, a world-famous research institute in Edinburgh which shocked the world in 1996 by producing the first cloned animal from an adult cell, Dolly the Sheep. We continue to carry out world class reseach in biotechnology to this day, but we are now fully embedded within the University of Edinburgh, consistently ranked in the top 20 universities worldwide.

But enough of that “we’re excellent” nonsense 😉

The institute is housed in a brand new building put up in 2011, and is across the road from another new building, the Royal (Dick) Vet School. We have two restaurants on site, and a gym. The site is served by several buses several times a day, and it really is a pleasant place to work. We’re just a little outside the city centre and the views of the local countryside are stunning.

Roslin has a very relaxed atmosphere. There are over 100 PhD students,  and we also have a lively set of post-docs. This means it is easy to make friends and there is always something social going on – for example, Friday is “cake day”, and if you join the cake club you can go and enjoy someone’s cake in the restaurant with all of the other members. You can read the post-doc handbook here!

Honestly, it’s a fantastic place to work and I think you’d love it.

2. Group environment

My aim is to run a relaxed, supportive, open and inclusive group, with a focus on helping individuals realise their objectives, whatever they may be. We have a weekly lab meeting, Thursdays at 9:30am, where we talk about science and people are encouraged to talk about any problems/issues they have, or whatever they have been thinking about (this is not mandatory though, for those of you who are introverts). My door is always open for one-to-one meetings.

My attitude is that I don’t expect you to succeed, but I do expect you to try. Failure is perfectly fine – we can deal with that and I, or other members of the group, can help. All I expect of you is to put the effort in and try. As long as I see sufficient effort on the project you are working on, as long as there is progress, and as long as you make up the time somehow, I am flexible about how, where and when you work.

I encourage group members NOT to work too hard and to take all of their annual leave. If you need flexible working I am 100% OK with that; need Fridays off? No problem. Need to leave every day at 3 to collect your kids? No problem. Want to work from home? No problem. Having two young children myself, I am particularly sensitive to the needs of young families, and I am also committed to helping young parents get back in to work. I also have no age bias – please, if you are interested in one of our positions, apply!

Obviously it’s about balance – it’s good for you to come in and see the other members of the group and to see me, but I am relaxed about this, and will leave it up to you to decide.

3. Research

Our work largely involves analysis of huge amounts of sequence data, either for functional genomics or metagenomics/microbiome. As a lab I encourage everyone to use Snakemake, which has conda and Docker support built in for dealing with software installation issues. However, we are branching out into NextFlow, and one of the jobs will explicitly produce NextFlow pipelines. We have access to a 4000+ core University high performance computing system, and we have begun working with cloud technologies such as Google Cloud Platform and Kubernetes. Most of the group code in Python, though there is a significant amount of R and Perl hanging around. I am an R guru, if you’d like to learn it 😉

I am always happy for folk to learn on the job – if you don’t think you have the skills, but would like to learn them (by doing!) then apply anyway – if you’re the right person and the right fit for the job, we can fill in the technical skills later. Honestly, the technical skills might be the least important. As long as you are comfortable in Linux, everything else can be learned quickly.

Our focus is on discovery biology fuelled by reproducible science. What can we discover that is novel about biological systems using computational biology? We also write and maintain novel software workflows to ensure that (i) our science is reproducible and (ii) others can follow our work and apply it to their own problems.

The group is always full of ideas of what we can do, and we try to keep ourselves at the cutting edge of technology.

4. Edinburgh

Edinburgh is one of the World’s most beautiful cities, with historical buildings, huge green open spaces, and a long golden beach (yes really!). There are too many attractions to list them here, but needless to say, it is an incredible place to live and you will never be short of things to do here. Living expenses are affordable compared to many other cities in the UK and Europe, and decent accomodation at all levels can be found throughout the city; alternatively, many people at Roslin choose to settle just outside the city, in one of the many local villages and towns. There are plenty of options!

5. Funding

Unfortunately I cannot change the academic funding model of the UK single handedly, and so many of the posts in the groupe are 2 or 3 year contracts. However, what I can say is that we have a track record of success in funding and personally I have been consistently funded since 2008, and have current grants that run to 2022 – that will be 14 years of funding from various research agencies with no break. I am confident that I will have positions available for you when you run to the end of your contract, should you wish to stay; if not, there are hundreds of opportunities elsewhere in the University, for all sorts of career paths; and computational biologists are always in demand.

6. Travel

Generally speaking I support travel and attendance at conferences and workshops (within reason, and within budget). Attendance at the UK’s Genome Science conference, which I help organise, is always encouraged and many of the team members attend PAG in San Diego every January. We have an active project with collaborators in Nairobi too, and so travel there is also a possibility. The H2020 project we have is a pan European project, and I expect the post holder to attend meetings throughout Europe in that time. However, if you don’t wish to travel, or can’t, for whatever reason, that’s also fine and we can always discuss the best way to handle meetings that are necessary for the project you’re working on.

7. Salary

Unfortunately the one thing I am not in control of is salary – I am so sorry! I would pay you more if I could! The University sets the pay band, and even worse than that, each project will have been funded with a specific salary band – I have no room to negotiate on this. I wish I did! Hopefully there is enough above to convince you to come and work for me.


That’s it! We have a great group, and we are doing fun and important things. Please come!

Voice is dead, will email be far behind?

I have a not-so-secret secret – I never answer the phone. With landlines, it is 100% of the time, because I can’t see who is calling me.  Despite the fact my work landline has a messages facility, I don’t listen to those either. Not in 8 years since I moved to Edinburgh. I must have 100+ voice messages on there – I will never listen to them. I won’t.

Guess what? Nothing bad happened. No-one died. Projects finished. I’m doing pretty well and the University is still standing. So, maybe, just maybe, we don’t need voice calls anymore?

My mobile is slightly different – I answer calls from four “people”: my wife, my parents, my brother and my friends. I do check the messages though. If it’s an unknown number, i just ignore it – if it’s important, I’ll pick up the message and call back later.

Voice is dead. It’s so dead. Dead as a dodo, deader than a dodo because it’s possible George Church might bring dodos back. Why? I learned that 99.9% of the calls I get are calls I don’t want. Cold calls, or calls about stuff I don’t want to know about or hear about, or calls to remind me to do something I have already decided I don’t want to do, or have already scheduled. I don’t need calls. I could 100% turn off incoming calls for the rest of my life and I would be soooo happy.

Without doubt the biggest culprits are cold call companies trying to sell me stuff. The vast majority of calls are these type of calls and they have killed voice. It’s dead. Stone dead, we do not need it anymore. I do not need incoming calls. I no longer require this feature, please turn it off.

The question is, will email go the same way?

What with journal and conference spam, benign spam from my University, and the endless growth of cc- and bcc-, I’d say the percentage of emails I have zero interest in reading is above 90%, possibly even 95%. It is approaching the level at which voice died.  Spam filters don’t help, especially with benign spam (yesterday’s was an email-to-all about a dying relative’s fundraiser, today’s (it’s only 08:30am) was about a bumper new crop of vegetables).

Spam within an organisation, deliberate, stupid, ridiculous spam driven by cc- and bcc- cancer and reply-all mailing lists is possibly the most harmful self-mutilation large organisations can do, as they watch productivity grind to a halt and then wonder “why?” possibly sending out a few all-staff surveys to find out.

Here is what I would love, please tell me if this exists – I want e-mail to be like ssh and have public private keys. Like, you only get to contact me if you have my public key, and it matches the current private key in my inbox. You want to contact me? You have to figure out how to get my public key. Hint: you won’t be able to get it by calling me.


Beautiful boxplots in base R

As many of you will be aware, I like to post some R code, and I especially like to post base R versions of ggplot2 things!

Well these amazing boxplots turned up on github – go and check them out!

So I did my own version in base R – check out the code here and the result below.  Enjoy!


Plotting cool graphs in R

I have to admit to being a bit of a snob when it comes to graphs and charts in scientific papers and presentations.  It’s not like I think I am particularly good at it – I’m OK – it’s just that I know what’s bad.  I’ve seen folk screenshot multiple Excel graphs so they can paste them into a powerpoint table to create multi-panel plots… and it kind of makes me want to scream.   I’m sorry, I really am, but when I see Excel plots in papers I judge the authors, and I don’t mean in a good way.  I can’t help it.  Plotting good graphs is an art, and sticking with the metaphor, Excel is paint-by-numbers and R is a blank canvas, waiting for something beautiful to be created; Excel is limiting, whereas R sets you free.

Readers of this blog will know that I like to take plots that I find which are fabulous and recreate them.  Well let’s do that again 🙂

I saw this Tweet by Trevor Branch on Twitter and found it intriguing:

It shows two plots of the same data.  The Excel plot:


And the multi plot:


You’re clearly supposed to think the latter is better, and I do; however perhaps disappointingly, the top graph would be easy to plot in Excel but I’m guessing most people would find it impossible to create the bottom one (in Excel or otherwise).

Well, I’m going to show you how to create both, in R. All code now in Github!

The Excel Graph

Now, I’ve shown you how to create Excel-like graphs in R before, and we’ll use some of the same tricks again.

First we set up the data:

# set up the data
df <- data.frame(Circulatory=c(32,26,19,16,14,13,11,11),

rownames(df) <- seq(1975,2010,by=5)


Now let's plot the graph

# set up colours and points
cols <- c("darkolivegreen3","darkcyan","mediumpurple2","coral3")
pch <- c(17,18,8,15)

# we have one point on X axis for each row of df (nrow(df))
# we then add 2.5 to make room for the legend
xmax <- nrow(df) + 2.5

# make the borders smaller

# plot an empty graph
plot(1:nrow(df), 1:nrow(df), pch="", 
		xlab=NA, ylab=NA, xaxt="n", yaxt="n", 
		ylim=c(0,35), bty="n", xlim=c(1,xmax))

# add horizontal lines
for (i in seq(0,35,by=5)) {
	lines(1:nrow(df), rep(i,nrow(df)), col="grey")

# add points and lines 
# for each dataset
for (i in 1:ncol(df)) {

	points(1:nrow(df), df[,i], pch=pch[i], 
		col=cols[i], cex=1.5)

	lines(1:nrow(df), df[,i], col=cols[i], 


# add bottom axes
axis(side=1, at=1:nrow(df), tick=FALSE, 

axis(side=1, at=seq(-0.5,8.5,by=1), 
		tick=TRUE, labels=NA)

# add left axis
axis(side=2, at=seq(0,35,by=5), tick=TRUE, 
		las=TRUE, labels=paste(seq(0,35,by=5),"%",sep=""))

# add legend
legend(8.5,25,legend=colnames(df), pch=pch, 
		col=cols, cex=1.5, bty="n",  lwd=3, lty=1)

And here is the result:


Not bad eh?  Actually, yes, very bad; but also very Excel!

The multi-plot

Plotting multi-panel figures in R is sooooooo easy!  Here we go for the alternate multi-plot.  We use the same data.

# split into 2 rows and 2 cols

# keep track of which screen we are
# plotting to
scr <- 1

# iterate over columns
for (i in 1:ncol(df)) {

	# select screen

	# reduce margins

	# empty plot
	plot(1:nrow(df), 1:nrow(df), pch="", xlab=NA, 
		ylab=NA, xaxt="n", yaxt="n", ylim=c(0,35), 

	# plot all data in grey
	for (j in 1:ncol(df)) {
		lines(1:nrow(df), df[,j], 
		col="grey", lwd=3)


	# plot selected in blue
	lines(1:nrow(df), df[,i], col="blue4", lwd=4)

	# add blobs
	points(c(1,nrow(df)), c(df[1,i], df[nrow(df),i]), 
		pch=16, cex=2, col="blue4")

	# add numbers
	mtext(df[1,i], side=2, at=df[1,i], las=2)
	mtext(df[nrow(df),i], side=4, at=df[nrow(df),i], 

	# add title

	# add axes if we are one of
	# the bottom two plots
	if (scr >= 3) {
		axis(side=1, at=1:nrow(df), tick=FALSE, 

	# next screen
	scr <- scr + 1

# close multi-panel image

And here is the result:



And there we have it.

So which do you prefer?

Plot your own EU referendum poll results

Due to the unspeakable horror of the EU referendum, I have to find something to make me feel better.  This poll of polls usually does so, though it is way too close for comfort.

Anyway, I took their data and plotted it for myself.  Data and script are on github, and all you need is R.

Enjoy! #voteRemain


Travel tips for highly organised and important people

Karthik Ram (@_inundata on Twitter) wrote down recently some of his “travel hacks“, and so I thought I would repeat the kind of sage advice he offers here. Here it is:

1. Luggage: choose the bag that smells the least bad. Remember that the cat likes to pee in them so check that first; also, the baby might have puked in there. The best bag is one that still contains all the stuff from your last trip that you can simply use again (to prevent the use of inappropriate clothes, simply never ever dress seasonally).  Also, pack at the very last minute, to make sure you don’t waste any precious time.

2. Power: go to the drawer that contains all of the screwdrivers, old keys and travel adapters and put every single one of the travel adapters you own in the bag.  Every. Single. One. Then buy another one at the airport.  Take the external battery you bought, but is never charged.  Load up at least 3 broken USB cables, and make sure to put it in your checked bag so you can’t use them in the airport, even though you really want to.  Simply buy a whole new set of chargers in every airport you go to.

3. Sleep: never, ever refuse the opportunity to have an espresso in the airport, no matter what time of day or night, so that you’re wired and sweating for the entire flight.  Do not watch sad movies.  Remember that sad things in movies come back to you late at night when you can’t sleep, and you imagine them happening to you and your family and you cry into your pillow (and that really didn’t work out that time in China when the pillow was made of rice).

4. WiFi: always run up at least a £100 roaming 3G bill on your personal mobile phone by checking your emails and Twitter constantly.  Never take your work mobile, it is two models out of date and soooo uncool.

5. Health: ensure you maintain your health by somehow sitting next to the person who has the worst cold you have ever witnessed.  Every. Single. Time.

6. Accessories: buy those tiny expensive toiletries at the airport and never claim them on expenses.  They’re so cute!  They look exactly like the grown up ones, but teeny – squee!


And with that, I will wish you a happy new year!




Finally, a definition for bioinformatics

Following the trend for ultra-short-but-to-the-point blog posts, I have decided to finally define bioinformatics:

bio-informatics (bi-oh-in-foh-shit-I-don’t-understand-how-that-works)

From the word bio, meaning “of or related to biology” and informatics, meaning “absolutely anything your collaborators or boss don’t understand about maths, statistics or computing, including why they can’t print and how the internet works”

Happy to finally put that one to bed!

101 bioinformatics facts

I am trying to crowd-source 101 fun bioinformatics facts!  Please contribute in the comments and I will add them below:

  1. BLAST is so fast, the authors had to deliberately slow down the code so it doesn’t overheat the servers
  2. GCG, the old bioinformatics package, was named after the authors kept high-fiving each other, shouting “good code guys!”
  3. Bowtie is named so because “it is almost impossible to tie”, referring to code to avoid a “race condition” when using multiple processors
  4. TopHat is named do because it was the first spliced RNA-Seq aligner, and when it worked first time, the authors shouted “Top that!”
  5. Over 1 billion people have searched the NCBI protein database for their own name
  6. The EBI is an elaborate front-end to NCBI services
  7. The SRA (short read archive) is the best known of the archives, and not many people know or use the MRA (medium read archive), the KLRA (kinda long read archive) and the LRA (long read archive)
  8. Europe PubMed Central has only ever been accessed by people accidentally clicking on links.  100% of visitors immediately bounce to pubmed.com
  9. There are now more journals than papers
  10. The HGAP assembler is actually an elaborate front-end hiding three thousand slave labourers all running GAP4 (via @IanGoodhead)
  11. HPC actually means “Homunculus Powered Computing”, and all servers are actually just mechanical turks full of leprechauns (via @froggleston)
  12. The biosemantics.org group of LUMC is doing psipred protein folding on 1kWh household radiators (128 cores each) https://vimeo.com/122893200 (via Eric Feliksik)
  13. The ‘p’ in p-value actually stands for p-otentially interesting! (via )
  14. Velvet is so named because @dzerbino wore velvet gloves when coding it (via @pathogenomenick)
  15. The @PacBio machines are so large because inside’s an Illumina machine + a bioinformatician running assemblies (via )
  16. The Cloud is actually just a cloud. That’s it. A real cloud (via @froggleston)
  17. If you plug in a @nanopore MinION and hit left,right,up,up,A,B,down, it’ll transform into a lifesize statue of @Clive_G_Brown (via @froggleston)
  18. DDBJ have their data centre in a volcano, and are basically a front for Osato Chemicals (SPECTRE) (via @SCEdmunds)
  19. Hidden Markov Models were initially developed to find Waldo shawnhymel.com/portfolio/413/ (via )
  20. 99.5% of people who cite Altschul et al have never read the paper
  21. Bioinformatics Applications Notes have to be automatically generated by the software they describe
  22. BGI exclusively publish in Nature journals because their papers are first rejected by Gigascience
  23. BGI actually only have one HiSeq but made to look like hundreds by a set up of mirrors, like that bit in Enter the Dragon (via @froggleston)
  24. the consumption rate of coffee (+ beer 🍻) among Bioinformaticians from around the world is increasing every year. TRUE FACT! (via @NazeefaFatima)
  25. EBI” actually stands for “European bureau of investigation”. It’s a front of the EU secret service, collecting genomic info (via @klmr)
  26. There are only 3 facts in 101 (via @mcaccamo)
  27. If all you have is a hmmer everything looks like it can be resolved with Viterbi (via @mcaccamo)
  28. Hidden Markov Models are like the recipe for Kentucky Fried Chicken.  There are only three people in the world who understand small parts of how HMMs work, and only when they get together do they know the full picture
  29. The “e” in e-value stands for “excellent”, as in “that’s an excellent BLAST hit”
  30. The Burrows-Wheeler transform, used in BWA and Bowtie, saves memory by transforming the DNA sequence data into a parallel dimension, meaning it ceases to exist in 4D space/time in this Universe
  31. Base qualities are called “Phred” scores in honour of Fred Sanger who developed DNA sequencing. #101bioinfofunfacts (via @tostenseemann)
  32. In a recent public survey of the 100 most desirable jobs, bioinformatician was a close second to astronaut (via @dynomics)
  33. Heng Li writes all his code in x86 assembly language, and uses a C decompiler before releasing it. @lh3lh3 (via @torstenseemann)
  34. The EBI secretly funds the Perl Foundation to ensure its legacy internal software infrastructure won’t collapse (via @torstenseemann)
  35. Illumina reads are short as before the development of Basespace they were delivered via Twitter (via @RoyChaudhuri)
  36. Pet Bioinformaticians are paid with cuddling #101bioinfofunfacts (via )
  37. Python was conceived in the 1980’s by @gvanrossum & named after his favourite British comedy, Monty Python’s Flying Circus (via )
  38. the word “ELVIS” appears 35 times in human peps (GRch38). “ELVISLIVES” appears 0 times. The king has left the genome #slowday (via @rdemes)
  39. Tuxedo suit is so named that only ‘privileged’ know how to use it ! #bioinformaticsfun (via )
  40. It’s easy! You only have to download this database in which all the genes have only one ID and you can retrieve the IDs in the most important databases (via @jorjial)
  41. If you stand in front of a mirror and say ‘HiSeq’ 3 times, Illumina staff member will show up holding the HiSeq X Ten system (via @nazeetafatima)
  42. @BenLangmead wrote Bowtie while wearing a tuxedo but he did all the testing in zip-up onesie batman pajamas (via @coletrapnell)
  43. Spike-ins are like gold (via @nomad421)
  44. Do you need more hard disk space to store and do the analysis? Sure! Let’s buy 10 hard disk of 3 TB in the supermarket (via @jorjial)
  45. This could be the basis for 10.1 papers in PLOS Comp. Biol. (via @kbradnam)
  46. All bioinformatics problems can be solved through the medium of twitter, snide and ranting 😉 (via @guyleonard)
  47. Installing TopHat with option –reverse will install HotTap, a program that spews vapid results on a random science hot topic (via @CamLBerthelot)
  48. SOLiD sequencers generated colour-space sequence using an algorithm based on the once popular “Simon Says” hand held game (via @iandcalling)
  49. CriMap was called CriMap because users do an awful lot of crying before they get a half decent map (via @dj_de_koning)
  50. A single anonymous donor, RP11, accounts for 72 percent of the human reference genome (via CanGenom)
  51. If you amass the de-bugging tears of a bioinformatician it is enough to fill an Olympic size swimming pool annually (via @paulhoskisson)
  52. FASTA 80 character line wrapping was invented to standardise data sharing using MS Word (via @IanGoodhead)
  53. nine out of ten Bioinformaticians prefer Excel (via
  54. if you’ve never shown the NIH sequencing costs plot in talk/lecture you’re not a real bioinformatician pic.twitter.com/jQzG7MGosd (via @AliciaOshlack)
  55. Illumina is short for Illuminati, the shadowy organisation that controls sequencing worldwide. (via @neilfws)
  56. Every time you run a closed source bioinformatics tool, a PhD student’s soul is sacrificed to the Blood God. (via @froggleston)
  57. The number of replicates needed for your RNA-seq experiment equals the impact factor of the journal you want to publish in (via @torstenseemann)
  58. NCBI’s bacterial annotation takes 6 weeks because it’s done manually by work experience students pasting ORFs into web BLAST (via @torstenseemann)
  59. The majority of bioinformaticians can’t pronounce “de Bruijn” properly (see also thegenomefactory.blogspot.sg/2013/08/how-to… @torstenseemann) (via @rvaerle)
  60. Oxford Nanopore plans to introduce a new FASTQ encoding scheme using an ASCII offset of 48 with optional emoji (via @torstenseemann)
  61. The HMMer package was so named when someone asked how it worked, and the developers said “Hmmmm… errr….” (via @mgollery)
  62. 63% of Bioinformaticists were Biologists to start with, but they realized that the cold room is really COLD! (via @mgollery)
  63. It has been calculated that there are twice as many data formats as there are Bioinformaticians (via @mgollery)

Analysing MinION data with no bioinformatics expertise nor infrastructure

It has been a long standing aim of mine to enable easy analysis of Oxford Nanopore MinION data, preferably by people who don’t have a ton of bioinformatics skills.  This was one of the reasons we started to develop poRe, which is really easy to install, runs on any platform and is fairly easy to use.

Below is a proof-of-concept method for analysing MinION data without the need for bioinformatics infrastructure – I am doing this on my Windows laptop, with only R and an internet connection.  We are going to use the EBI web-services to do the heavy lifting.  Please do not abuse this!  EBI probably don’t want you to submit 20,000 jobs.  As I said, this is “proof-of-concept”.  It means you’re not supposed to actually use it in practice 😉

Some polite things to do:

  1. Use your own, real e-mail address when/if you run the code below
  2. Please stick to a small number of jobs! (unless EBI tell you it’s OK)

We’re going to do everything in R.

Make a directory on your computer where we will do the work.  You will need approx. 15Gb of space (I know, the data are big!).   My directory is going to be “C:/Users/Mick/Documents/MinION”.  We’re going to use data from Nick Loman’s group, published here and here.

# go to the working directory

# load the necessary packages

# download one of Nick's runs and unpack it
url <- "ftp://ftp.sra.ebi.ac.uk/vol1/ERA362/ERA362836/oxfordnanopore_native/Ecoli_R7_ONI_flowcell_17.tar.gz"
dest <- "Ecoli_R7_ONI_flowcell_17.tar.gz"
download.file(url=url, destfile=dest)

This will take some time, as there is about 10Gb of data there.  If I had a smaller dataset, I would use it.

Now, we use the poRe get_fast5_info() function to read statistics from the several thousand MinION reads, which again takes some time.  We then just select the reads that have a 2D sequence:

# read information from all .fast5 files in the current working directory
finfo <- read.fast5.info(dir=getwd())

# get a list of fast5 files where the "len2d" is greater than zero
reads.2d <- rownames(finfo)[finfo$len2d>0]

Now, we are going to use the EBI REST API for NCBI BLAST.  You can read about it at the link provided, but essentially it is about building and querying custom URLs to submit BLAST jobs and fetch the result.

We are only going to do this for the first ten fast5 sequences.  Please do not submit more, or the EBI may block you.  If you want to use this in anger, I suggest contacting the EBI directly and asking them politely.  The code blasts each sequence with default parameters against the EMBL bacterial database, saves the BLAST report to your current working directory, and prints out the top hit.

Here goes:

# iterate over the first ten 2D reads
for (f5 in reads.2d[1:10]) {

    # use poRe function get_fasta() to get the 2D sequence
    	fasta.2d <- get_fasta(f5)$'2D'
    # build a list of parameters for the EBI BLAST service
    	myl <- list(sequence=fasta.2d,

    # build the submission URL and submit the job
    runurl <- "http://www.ebi.ac.uk/Tools/services/rest/ncbiblast/run"
    	jobid <- postForm(runurl, .params=myl)

    # query the status URL for the job until the status
    # is no longer "RUNNING"
    staturl <- "http://www.ebi.ac.uk/Tools/services/rest/ncbiblast/status/"
    	surl <- paste(staturl, jobid, sep="")
    	status <- "RUNNING"
    	while(status == "RUNNING") {
          		status <- getURL(surl)

    # build the output URL and fetch the results
    outurl <- "http://www.ebi.ac.uk/Tools/services/rest/ncbiblast/result/"
    	ourl <- paste(outurl, jobid, "/out", sep="")
    	out <- getURL(ourl)

    # write the output to a file in the current working directory
    	bloutfile <- gsub(".fast5",".blast",f5)
    	write(out, bloutfile)

    # print out the top hit
    	print(strsplit(out, "n")[[1]][22])


Now, it’s not quick – it probably takes about 2-3 minutes to complete on my laptop.  It also isn’t very responsive (in that it looks a bit like it may not be doing anything!).  However, eventually the R terminal will show the top hits:


You can see from the above that the first few reads are a bit dodgy and don’t really hit anything, but pretty quickly the reads start hitting E coli K12, which is what they sequenced.

Also, in your directory you will find a whole bunch of BLAST reports:


Note, as “proof-of-concept” code, there is very little (i.e. nothing) in the way of error-checking in the above code!  I mean, come on, it’s only a blog post!


Excited by bacterial DNA in sweet potatoes? Check this out!

Cool paper – turns out that the genome of the sweet potato has segments of the Agrobacterium genome in it! Genuinely cool stuff.

However, I found something even more exciting!  A tiny virus genome present in huge numbers of sequenced bacteria.  Check it out:

  1. GO here: http://blast.ncbi.nlm.nih.gov/Blast.cgi
  2. Click on Nucleotide Blast
  3. Where it says “Enter accession number, gi or FASTA”, simply enter the number: 9626372
  4. Where it says “Database” make sure you choose “others”
  5. Just under there, where it says “Organism (optional)” start typing the word Bacteria, and choose “Bacteria (taxid: 2)”
  6. Click the Blast button at the bottom

OMG!  This tiny viral genome is in E coli, it’s in Acinetobacter, Desulfitobacterium, Sphingobacterium….


Older posts

© 2022 Opiniomics

Theme by Anders NorenUp ↑