bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Author: biomickwatson (page 2 of 16)

University of California makes legal move against Roger Chen (and Genia)

The relationship between sequencing companies is frosty and beset by legal issues, which I’ve covered before here and here.  Keith Robison tends to cover in more detail ūüėČ

Most recently, PacBio moved against Oxford Nanopore, we think claiming that ONT’s 2D technology violated their patent on CCR (link).

Well now the absolute latest is a filing by the University of California against Roger Chen and therefore Genia. If you click through to the documents (requires registration) you’ll see that UC claim Chen, with others, produced key inventions whilst at UC that he later assigned to Genia, but which should have automatically been assigned to UC according to UC’s “oath of allegiance”, which Chen signed as a UC employee.

It awaits to be seen how important this is and no doubt Chen/Genia/Roche will fight tooth and nail; however if the courts decide in UC’s favour it could spell the end of Genia, and at the very least see a large cash settlement with UC.

Fascinating times!

Is the long read sequencing war already over?

My enthusiasm for nanopore sequencing is well known; we have some awesome software for working with the data; we won a grant to support this work; and we successfully assembled a tricky bacterial genome.  This all led to Nick and I writing an editorial for Nature Methods.

So, clearly some bias towards ONT from me.

Having said all of that, when PacBio announced the Sequel, I was genuinely excited.   Why?  Well, revolutionary and wonderful as the MinION was at the time, we were getting ~100Mb runs.  Amazing technology, mobile sequencer, tri-corder, just incredible engineering Рbut 100Mb was never going to change the world.  Some uses, yes; but for other uses we need more data.  Enter Sequel.

However, it turns out Sequel isn’t really delivering on promises. ¬†Rather than 10Gb runs, folk are getting between 3 and 5Gb from the Sequel:

At the same time, MinION has been coming along great guns:

Whilst we are right to be skeptical about ONT’s claims about their own sequencer, other people who use the MinION have backed up these claims and say they regularly get figures similar to this. If you don’t believe me, go get some of the World’s first Nanopore human data here.

PacBio also released some data for Sequel here.

So how do they stack up against one another? ¬†I won’t deal with accuracy here, but we can look at #reads, read length and throughput.

To be clear, we are comparing “rel2-nanopore-wgs-216722908-FAB42316.fastq.gz”¬†a fairly middling run from the NA12878 release, m54113_160913_184949.subreads.bam¬†and one of the Sequel SMRT cell datasets released.

Read length histograms:


As you can see, the longer reads are roughly equivalent in length, but MinION has far more reads at shorter read lengths.  I know the PacBio samples were size selected on Blue Pippin, but unsure about the MinION data.

The MinION dataset includes 466,325 reads, over twice as many as the Sequel dataset at 208,573 reads.

In terms of throughput, MinION again came out on top, with 2.4Gbases of data compared to just 2Gbases for the Sequel.

We can limit to reads >1000bp, and see a bit more detail:


  • The MinION data has¬†326,466 reads greater than 1000bp summing to 2.37Gb.
  • The Sequel data has¬†192,718 reads greater than 1000bp, summing to 2Gb.

Finally, for reads over 10,000bp:

  • The MinION data has 84,803 reads greater than 10000bp summing to 1.36Gb.
  • The Sequel data has 83,771¬†reads greater than 10000bp, summing to 1.48Gb.

These are very interesting stats!

This is pretty bad news for PacBio.  If you add in the low cost of entry for MinION, and the £300k cost of the Sequel, the fact that MinION is performing as well as, if not better, than Sequel is incredible.  Both machines have a long way to go РPacBio will point to their roadmap, with longer reads scheduled and improvements in chemistry and flowcells.  In response, ONT will point to the incredible development path of MinION, increased sequencing speeds and bigger flowcells.  And then there is PromethION.

So is the war already over?   Not quite yet.  But PacBio are fighting for their lives.

People are wrong about sequencing costs on the internet again

People are wrong about sequencing costs on the internet again and it hurts my face, so I had to write a blog post.

Phil Ashton, whom I like very much, posted this blog:

But the words are all wrong ūüėČ ¬†I’ll keep this short:

  • COST¬†is what it COSTS to do something. ¬†It includes all COSTS. ¬†The clue is in the name. ¬†COST. It’s right there.
  • PRICE is what a consumer pays for something.

These are not the same thing.

As a service provider, if the PRICE you charge to users is lower than your COST, then you are either SUBSIDISED or LOSING MONEY, and are probably NOT SUSTAINABLE.

COST, amongst other things, includes:

  • Reagents
  • Staff time
  • Capital cost or replacement cost (sequencer and compute)
  • Service and maintenance fees
  • Overheads
  • Rent

Someone is paying these, even if it’s not the consumer. ¬†So please – when discussing sequencing – distinguish between¬†PRICE and¬†COST.

Thank you 

Reading data from google sheets into R

Reading data from google sheets into R is something you imagine should be really simple, but often is anything but. However, package googlesheets goes a long way to solving this problem.

Let’s crack on with an example.

First, install the software:


We then need an example sheet to work with, and I’m going to use one from Britain Elects:

So go here and add this to your own collection (yes you’ll need a google account!) using the little “Add to my drive” icon to the right of the title, just next to the star.

Right, let’s play around a bit!

# load package

# which google sheets do you have access to?
# may ask you to authenticate in a browser!

# get the Britain Elects google sheet
be <- gs_title("Britain Elects / Public Opinion")

This should show you all of the sheets you have access to, and also select the one we want.

Now we can see which worksheets exist within the sheet:

# list worksheets

We can "download" one of the sheets using gs_read()

# get Westminster voting
west <- gs_read(ss=be, ws = "Westminster voting intentions", skip=1)

# convert to data.frame
wdf <- as.data.frame(west)

And hey presto, we have the data. Simples!

Now let's plot it:

# reverse so that data are forward-sorted by time
wdf <- wdf[2769:1,]

# treat all dates as a single point on x-axis
dates <- 1:2769

# smooth parameter
fs <- 0.04

# plot conservative
plot(lowess(dates, wdf$Con, f=fs), type="l", col="blue", lwd=5, ylim=c(0,65), xaxt="n", xlab="", ylab="%", main="Polls: Westminster voting intention")

# add labels
axis(side=1, at=dates[seq(1, 2769, by=40)], labels=paste(wdf$"Fieldwork end date", wdf$Y)[seq(1, 2769, by=40)], las=2, cex.axis=0.8)

# plot labour and libdem
lines(lowess(dates, wdf$Lab, f=fs), col="red", lwd=5)
lines(lowess(dates, wdf$LDem, f=fs), col="orange", lwd=5)

# add UKIP, and we treat absent values as -50 for plotting purposes
ukip <- wdf$UKIP
ukip[is.na(ukip)] <- -50
lines(lowess(dates, ukip, f=fs), col="purple", lwd=5)

# add legend
legend(1, 65, legend=c("Con","Lab","LibDem","UKIP"), fill=c("blue","red","orange","purple"))


From there to here

This is going to be quite a self-indulgent post – in summary, I am now a full Professor at the University of Edinburgh, and this is the story of my early life and career up until now. ¬†Why would you want to read it? Well this is just some stuff I want to say, it’s a personal thing. ¬†You might find it boring, or you may not. ¬†There are some aspects of my career that are non-traditional, so perhaps that will be inspiring to others; to learn that there is another way and that there are multiple routes to academic success.

Anyway, let’s get¬†it over with ūüėČ

Early life

I was born and bred¬†in Newcastle. ¬†I come from a working class background (both of my grandfathers were coal miners) and growing up, most of my extended family lived in council houses. ¬†I went to a fairly average comprehensive (i.e. state) school, which I pretty much hated. ¬†Something to endure rather than enjoy. ¬†Academic ability was not celebrated – I don’t know the exact figures but I’d guess less than 5% of my fellow pupils ended up in higher education. ¬†Being able to fight and play football were what made you popular, and I could do neither ūüôā Choosing to work extra hours so I could learn German didn’t improve my popularity much…

My parents both worked hard – really hard – and¬†I had everything I needed, but not everything I wanted. ¬†Looking back this was a good thing. ¬†My Dad especially is a bit of a hero. ¬†He started off as a “painter and decorator” – going to other peoples’ houses and doing DIY, painting, putting up shelves, wallpaper, laying carpets etc. ¬†As if doing that, having a small family (I have an older brother) and buying a house weren’t enough, he studied¬†part-time for¬†a degree in sociology at Newcastle polytechnic, famously going off to do painting and decorating jobs between lectures, and at the weekends. ¬†After graduating, he went on to have a successful career in adult social care, and finished life as lecturer at a further education college. ¬†My mother also worked in social care, and together they supported me through University (BSc and MSc). ¬†Just amazing, hard working parents. ¬†I attribute my own work ethic to both of them, they set the example and both my brother and I followed.


I did a BSc hons in Biology at the University of York. ¬†Some bits I loved, some bits I didn’t, and I came out with a 2.1.

One practical sticks in my mind – in pairs, we had 3 hours to write a program in BASIC (on a BBC B!) to recreate a graph (yes – we had to recreate an ASCII graph) based on an ecological formula (I don’t remember which). ¬†I finished it in ten minutes and my partner didn’t even touch the keyboard. ¬†Writing this now reminds me of two things – firstly, hours sat with my Mum patiently punching type-in programs into our Vic 20 so I could play space invaders (written in, you guessed it, BASIC); and secondly, hacking one of my favourite ZX Spectrum¬†games, United, so I could have infinite money to spend on players. ¬†Happy days!

At the end of my undergraduate, I didn’t know what else to do, so I took the MSc in Biological Computation, also at the University of York. ¬†This was awesome and I loved every minute (previous graduates include Clive Brown, but more about that later). ¬†The prospectus was so wide-ranging – we covered programming (in C), mathematical modelling, statistics (lots and lots of statistics), GIS and Bioinformatics, among many other courses. ¬†It was hard work but wonderful. ¬†It really taught me independence; that I could start the day knowing nothing, and end the day having completed major tasks, using nothing but books and the internet.

At the end of the course I had three job offers and offers to do a PhD – looking back this was a pivotal decision, but I didn’t realise it at the time. ¬†I chose a job in the pharmaceutical industry.

Early career

My first job was a 12 month contract at GlaxoWellcome in Stevenage. ¬†GW had invested pretty heavily in gene expression arrays. ¬†Now, these were very different to modern microarrays – this was pre-genome, so we didn’t know how many genes were in the human genome, nor what they did. ¬†What we had were cDNA libraries taken from various tissues and normalised in various ways. ¬†These would be spotted on to nylon membranes, naively – we didn’t know what was on each spot. ¬†We then performed experiments using radioactively labelled mRNA, and any differentially expressed spots were identified. ¬†We could then go back to the original plate, pick the clone, sequence it and find out what it was.

It’s now obvious that having a genome is really, really useful ūüėČ

I was in a team responsible for all aspects of the bioinformatics of this process. ¬†We manufactured the arrays, so we wrote and managed the LIMS; and then we handled the resulting data, including statistical analysis. ¬†I worked under Clive Brown (now CTO at Oxford Nanopore). ¬†He asked me in interview whether I could code in Perl and I had never heard of it! ¬†How times have changed…. he was very much into Perl in those days, and Microsoft – we wrote a lot of stuff in Visual Basic. ¬†Yes – bioinformatics in Visual Basic. ¬†It can be done…

I spent about four years there – working under Francesco Falciani for a time – then left to find new challenges. ¬†I spent an uneventful year at Incyte Genomics working for Tim Cutts¬†(now at Sanger) and David Townley (now at Illumina), before we were all unceremoniously made redundant. ¬†With the genome being published, Incyte’s core business disappeared and they were shutting everything down.

Remember this when you discuss the lack of job security in academia – there is no job security in industry either. ¬†The notion of job security disappeared with the baby boomers, I’m afraid.

I managed to get a job at Paradigm Therapeutics, also in Cambridge, a small spin-out focused on mouse knockouts and phenotyping.  I set up and ran a local Ensembl server (I think version 12) including the full genome annotation pipeline.  This was really my first experience of Linux sys/admin and running my own LAMP server.  It was fun and I enjoyed it, but again I stayed less than a year.  Having been made redundant from Incyte, my CV was on a few recruitment websites, and from there an agent spotted me and invited me to apply for my first senior role РHead of Bioinformatics at the Institute for Animal Health (now Pirbright).


So, my first job in academia. ¬†It’s 2002. ¬†I am 28. ¬†I have never been in academia before. ¬†I have no PhD. ¬†I have never written a paper, nor written a grant (I naively asked if we got the opportunity to present our grant ideas in person in front of the committee). ¬†I am a group leader/PI so I am expected¬†to do both – I need to win money to build up a bioinformatics group. ¬†What the institute needed was bioinformatics support; but they wanted me to win research grants to do it.

This job really was really, really tough, and I was hopelessly ill-equipped to do it.

My first grant application (to create and manage pathway genome databases for the major bacterial pathogens we worked on) was absolutely annihilated by one reviewer, who wrote a two page destruction of all of my arguments. ¬†Other reviewers were OK, but this one killed it. ¬†Disaster. ¬†I figured out that no-one knew who I was, I needed a foot-print, a track record and I needed to publish (I’d still like to find out who that reviewer was….)

In 2005 I published my first paper, and in 2006 my second. ¬†You’ll note I am the sole author on both. ¬†Take from that what you will. ¬†Meanwhile I was also supporting other PIs at the institute, carrying out data analyses, building collaborations, and these also led to papers; and so slowly but surely I began building my academic life. ¬†I got involved in European networks such as EADGENE, which eventually funded a post-doc in my lab. ¬†I won a BBSRC grant from the BEP-II¬†initiative to investigate co-expression networks across animal gene expression studies; and I successfully applied for and won a PhD studentship via an internal competition. ¬†With papers and money coming in, the institute agreed to provide me with a post-doc from their core support and hey presto, I was up and running. ¬†I had a research group publishing papers, and providing support to the rest of the institute. ¬†It only took me five or six years of hard work; good luck; and some excellent support from one of my managers, Fiona Tomley. ¬†During my time at IAH I had 7 managers in 8 years; Fiona was the only good one, and she provided excellent support and advice I’ll never forget.

So what are the lessons here? ¬†I think this was an awful job and had I known better I wouldn’t have taken it. ¬†At the beginning there was little or no support, and I was expected to do things I had no chance of achieving. ¬†However, I loved it, every minute of it. ¬†I made some great friends, and we did some excellent work. ¬†I forged collaborations that still exist today. ¬†I worked really, really hard and it was quite stressful – at one point my dentist advised gum shields as I was grinding my teeth – but ultimately, through hard work and good luck, I did it.

It’s worth pointing out here that I believe this was only possible in bioinformatics. ¬†My skills were in such short supply that a place like IAH had no choice but to take on an inexperienced kid with no PhD. ¬† This couldn’t have happened in any other discipline, in my opinion. ¬†It’s sad, because ultimately I showed that if you invest in youth they can and will make a success of it. ¬†Being older, or having a PhD, is no guarantee of success; and being young and inexperienced is no guarantee of failure.


And so, in 2010, to Roslin. ¬†There was quite an exodus from IAH to Roslin as the former shrank and the latter grew. ¬†I was lucky enough to be part of that exodus. ¬†My role at Roslin was to build a research group and to help manage their sequencing facility, ARK-Genomics. ¬†I certainly don’t claim all of the credit, but when I arrived ARK-Genomics had a single Illumina GAIIx and when we merged to form Edinburgh Genomics, we had three HiSeq 2000/2500. ¬†We also had significantly better computing infrastructure, and I’m proud that we supported well over 200 academic publications. ¬†My research is also going really well – we have, or have had, grants from TSB, BBSRC, and industry, and we’re publishing plenty of papers too. ¬†The focus is on improved methods for understanding big sequencing datasets and how they can contribute to improved farm animal health and production. ¬†Scientifically, we are tackling questions in gut microbiome and host-pathogen interactions.

Roslin is a great place to work and has been very supportive; the group enjoys a small amount of core support through which we help deliver Roslin’s core strategic programmes; and it is an incredibly rich, dynamic environment with over 70 research groups and over 100 PhD students. ¬†You should come join us!

In 2015, I was awarded a PhD by research publication – “Bioinformatic analysis of genome-scale data reveals insights into host-pathogen interactions in farm animals” (I don’t think it’s online yet) – and in 2016 promoted to¬†Personal chair in bioinformatics and computational biology. ¬†Happy days!

A note on privilege

Clearly being a white male born in the UK has provided me with advantages that, in today’s society, are simply not available to others. ¬†I want readers of this blog to know I am aware of this. ¬†I wish we lived in a more equal society, and if you have suggestions about ways in which I could help make that happen, please do get in contact.

So what next?

More of the same РI love what I do, and am lucky to do it!  More specifically, I want to be a champion of bioinformatics as a science and as an area of research; I want to champion the young and early career researchers; I want to continue to train and mentor scientists entering or surviving in the crazy world of academia; and I want to keep pushing for open science.  Mostly, I want to keep being sarcastic on Twitter and pretending to be grumpy.

Cheers ūüôā


Tips for PhD students and early-career researchers

As we enter October, here at Roslin it is the start of the academic year and many new PhD students begin. ¬†We have over 100 PhD students here at any one time; it’s a very exciting and dynamic environment. ¬†However, for some (many?) a PhD is one of the most stressful things that will ever happen to them (ed: interesting choice of language). ¬† So how can we get round that? ¬†Below are my¬†tips for new PhD students. ¬†Note my experience is in bio, so if you’re in another field, be aware of that. ¬†Enjoy.

1. Lead

PhD projects, and PhD supervisors, come in all shapes and sizes, and work in many different ways. ¬†For some, there will be a very detailed plan for the whole 3/4 years, with well defined objectives/chapters etc; others will be little more than a collection of ideas that may or may not work; and many will be between these two extremes. ¬†Whichever project/supervisor you have, you the student are responsible for making it all happen. ¬†This will be difficult for many; some people are not “natural born” leaders; and even those who are may not have had much chance to practice. ¬†However, we have to recognize that a PhD is not a taught course; it is a project whereby a student learns how to carry out their own research, to investigate their own ideas, to plan and execute research. ¬†That doesn’t happen if someone tells you what to do at every stage. ¬†So, lead. ¬†Take the lead. ¬†This is your project – if you have ideas that go beyond what’s written in the original project plan, then you now have the opportunity to explore them. ¬†Of course take advice; speak to your supervisor; speak to other experts; figure out whether your ideas are¬†good or not; ¬†do things by the book and be healthy and safe. ¬†But if your ideas¬†are good and they are worth exploring, get on and do it. ¬†If there is a predefined plan, execute it. ¬†Don’t wait; don’t ask; don’t sit nervously waiting for your supervisor to ask if there’s anything you want to explore – get on and do it. ¬†Lead.

2. Read

Read the literature. ¬†In a fast paced field such as genomics, papers will be out of date within ~2 years. ¬†This means that to be on top of your game, you will have to read a lot of papers. ¬†This is something you need to just bite the bullet and do. ¬†Hopefully, if you’re in a field you love, this won’t be too arduous. ¬†Combine with Twitter (see below) and news feeds so you can figure out which papers to prioritize. ¬†As I have said before, you want to be the student sending a mail to your supervisor saying “hey have you seen this cool new paper?” rather than the student receiving those mails. ¬†Take a coffee, a bunch of papers, go somewhere quiet and read them.

3. Write

Write. ¬†For the love of all that is holy, write. ¬†Learn to write quickly and well. ¬†Science is not only pipettes, tubes, plates, labs and computers; it is about writing. ¬†As a scientist it’s what I spend my time doing more than any other activity. ¬†Grants, papers, reviews, reports, plans, emails etc etc. ¬†Being asked to put together a coherent (referenced!) 3-page document in 24 hours is not unheard of; being asked to do the same for a “one pager” in a few hours is even more common. ¬†Writing is so important. ¬†I can’t emphasize this enough. ¬†If you hate writing, then perhaps science isn’t for you; honestly, genuinely, I mean that. ¬†Think about it. ¬†Being a scientist means being a writer. If all the results are in, papers shouldn’t take months to write; posters should take no more than a few hours.

4. Engage

I am not the first to say this and I won’t be the last, so here it is again – science is a social activity. Engage with people. Talk to people. Go to meetings, conferences, workshops and be active in them. Talk to people about yourself and about your project, ask them what their interests are. Much of success is about luck, about being in the right place at the right time. Go out and talk to people about what you do. Don’t be shy, don’t think they aren’t interested. You might also consider blogging and social media. The more you are out there, the more people know about you and what you’re doing, the higher chance they might want to work with you.

5. Record

Keep a record of everything you do. In my group, we use a wiki; others use Github; obviously lab-based students have a lab-book (but this isn’t always sufficient!). It doesn’t matter what you use, the basic requirement is to keep a record of what you’ve done to such a standard that the work can easily be understood and repeated by someone else. This will help you in future, and it may very well serve as a record for protecting intellectual property.

6. Tweet

It’s just possible that if I had to name the single thing that has had the biggest impact on my career, that thing might be joining Twitter. I can say with great confidence that people on every continent know who I am and what I do. I’m not sure that would have been true going on just my scientific record alone. Twitter has enabled me to meet, and engage with, 1000s of wonderful people, scientific leaders in their field. Twitter is an online community of (mostly liberal) forward-thinking scientists. If you’re not on Twitter, you don’t get it; I was once a person who didn’t get Twitter, I actually joined just to drum up applications for a job I had advertised. However, it has been a transformational experience. Now – it’s hard. When you first join, you have no followers, and no-one is listening. You have to work hard to get followers, to get retweets, and it’ll take years to get 1000s of followers. But it’s worth it, I promise you!

7. Learn

Find out what skills are in demand and try and focus your research on those. Bioinformatics is a good example – learn Unix, learn to code and learn some stats. If you have those, you will always be employable and in demand. Try and look for trends – at the moment CRISPR is very hot, as is single-cell work – if you’re in the lab, can you integrate these into your project and therefore practice these techniques? Seriously, anyone can do DNA preps and PCR; find out the skills and techniques that are in demand and learn them, if you can.

8. Plan

I want my PhD students to have a thesis plan by 3 months. You don’t have to stick to it, but it’s good to have an idea. What are the results chapters? What are the likely papers? If you are a post-doc, then if you have 3 years, what are you going to do with them? If you’re a PI, again, plan – what questions? Which experiments? Papers? Where is this going? What will be your first grant and how do you get to the stage where you are ready to be funded? Plan.

9. Speak

At meetings, conferences, workshops. Submit abstracts and talk about your research. If you are not confident at public speaking, then practice until you are. Don’t submit poster abstracts, submit speaking abstracts. People remember amazing talks they have seen more than they remember amazing posters. It’s quite common, I find, for young scientists to default to posters as they don’t feel ready or willing to speak. You must get over this. I know it’s hard. I used to be nervous as hell before speaking at conferences. However, it has to be done and it has to be done well. Practice is key. Get out there, overcome your fears, and do it.

10. Realise

You have an amazing job and an amazing position. You get to push back the boundaries of human knowledge. That’s your job. You come in to work and by the end of the day we know more about the world we live in than we did before. It is an amazing, incredible, privileged position to be in. Yes it’s hard; yes there are barriers and it can be stressful. However, if you can get by those, and you have a good supportive team around you, then you have the most amazing job/position in the world. Enjoy it!

And I’ll leave it there. As always, I look forward to your comments!

Creating an image of a matrix in R using image()

You have a matrix in R, and you want to visualise it Рsay, for example, with each cell coloured according to the value in the cell.  Not a heatmap, per se, as that requires clustering; just a simple, visual image.

Well, the answer is image() – however, there is the slightly bizarre coding choice buried in the help:

Notice that image interprets the z matrix as a table of f(x[i], y[j]) values, so that the x axis corresponds to row number and the y axis to column number, with column 1 at the bottom, i.e. a 90 degree counter-clockwise rotation of the conventional printed layout of a matrix.

Let’s look at that:

mat <- matrix(c(1,1,1,0,0,0,1,0,1), nrow=3, byrow=TRUE)
     [,1] [,2] [,3]
[1,]    1    1    1
[2,]    0    0    0
[3,]    1    0    1


This produces:


We can clearly see the 1,0,1 row is now the right-hand column i.e. the matrix has in fact been rotated 90 degrees anti-clockwise.

To counteract this we need to rotate the input matrix clockwise before passing to image:

rotate <- function(x) t(apply(x, 2, rev))


This now produces what we might expect:


Easy ūüôā

How to produce an animated gif of rise in global temperature in R

I came across a really cool tweet a few days ago containing an animated GIF demonstrating rise in global temperatures since 1880:

Link to the blog is here.

I wanted to see if I could recreate this GIF in R, and below are some instructions on how to get close.

The first step is to try and see if I can replicate the original, final image. ¬†The code below comes close enough for this demo, but strangely I am approx. 1 degree C out. ¬†I’ve left a comment on the original blog to ask why – it’s not clear how the exact figures are calculated.

Anyway, here is how to get close:

# download data
d <- read.table("http://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts.txt", 
                skip=7, header=TRUE, nrows=143, colClasses = "character")

# clean data
d <- d[-grep("Year", d$Year),]
rownames(d) <- d[,1]
d <- d[,-1]
d <- d[,-13:-19]
d[d=="****"] <- NA

# convert to numeric and celsius
for (i in 1:12) {
	d[,i] <- as.numeric(d[,i]) / 100

# download seasonal adjustments
s <- read.table("http://data.giss.nasa.gov/gistemp/faq/merra2_seas_anom.txt", 
                skip=3, header=TRUE, colClasses = "character")
sa <- as.numeric(s$seas_anom)

# create colours from blue through to red
colours <- colorRampPalette(c("grey","blue","red"))(nrow(d))

# create initial plot
mmin <- -3
mmax <- 3
plot(1:12, d[1,]+sa, type="l", ylim=c(mmin,mmax), 
     yaxt="n", xaxt="n", bty="n")

# add axes
axis(side=1, at=1:12, 
     lwd=0, lwd.ticks=1, cex.axis=0.8)

axis(side=2, at=-3:3, labels=-3:3, 
     lwd=0, lwd.ticks=1, las=2, cex.axis=0.8)

# add title
title(main=expression(paste("Temperature ","Anomaly ","(",degree,"C)")), 
      adj=0, cex.main=0.8)
title(main="(Difference from 1980-2015 annual mean)", adj=0, cex.main=0.6, 
      line=0, font.main=1)

# add horizontal lines
for (j in -3:3) {
	lines(1:12, rep(j, 12), lty=3)

# add yearly temperature lines
for (j in 1:nrow(d)) {
	lines(1:12, as.numeric(d[j,])+sa, col=colours[j])

This produces the following image:


This is close enough!

So how do we create the animation? ¬†The old skool way of doing this is to create every image in R and create an animated GIF with ImageMagick. ¬†So we'll do that ūüôā

The slightly adjusted code to create every image is here (creates 137 PNGs in working directory!):

# Assumes you have run the code above to create d and sa

doplot <- function(i) {

	x   <- d[i,] + sa
	col <- colours[i]

	mmin <- -3
	mmax <- 3
	plot(1:12, x, type="l", ylim=c(mmin,mmax), 
             yaxt="n", xaxt="n", bty="n", col=col)
	axis(side=1, at=1:12, labels=c("Jan","Feb","Mar",
             lwd=0, lwd.ticks=1, cex.axis=1.5)

	axis(side=2, at=-3:3, labels=-3:3, lwd=0, 
             lwd.ticks=1, las=2, cex.axis=1.5)

	title(main=expression(paste("Temperature ","Anomaly ","(",degree,"C)")), 
              adj=0, cex.main=1.5)
	title(main="(Difference from 1980-2015 annual mean)", 
              adj=0, cex.main=1, line=-0.5, font.main=1)

	# add horizontal lines
	for (j in -3:3) {
		lines(1:12, rep(j, 12), lty=3)

	# add other years
	for (k in 1:i) {
		lines(1:12, d[k,]+sa, col=colours[k])

	# add year label
	text(7, 0.8, rownames(d)[i], col=colours[i], font=2, cex=2)


# create plots/PNGs
for (j in 1:nrow(d)) {
	name <- paste("plot",sprintf("%03.f", j),".png", sep="")
	png(name, width=800, height=600)

Then in ImageMagick, we use convert to create the animated GIF:

convert *.png -delay 3 -loop 1 globaltemp.gif
convert globaltemp.gif \( +clone -set delay 500 \) +swap +delete globaltempwpause.gif

Final result here:


Done ūüôā

Plotting cool graphs in R

I have to admit to being a bit of a snob when it comes to graphs and charts in scientific papers and presentations. ¬†It’s not like I think I am particularly good at it – I’m OK – it’s just that I know what’s bad. ¬†I’ve seen folk screenshot multiple Excel graphs so they can paste them into a powerpoint table to create multi-panel plots… and it kind of makes me want to scream. ¬† I’m sorry, I really am, but when I see Excel plots in papers I judge the authors, and I don’t mean in a good way. ¬†I can’t help it. ¬†Plotting good graphs is an art, and sticking with the metaphor, Excel is paint-by-numbers and R is a blank canvas, waiting for something beautiful to be created; Excel is limiting, whereas R sets you free.

Readers of this blog will know that I like to take plots that I find which are fabulous and recreate them. ¬†Well let’s do that again ūüôā

I saw this Tweet by Trevor Branch on Twitter and found it intriguing:

It shows two plots of the same data.  The Excel plot:


And the multi plot:


You’re clearly supposed to think the latter is better, and I do; however perhaps¬†disappointingly, the top graph would be easy to plot in Excel but I’m guessing most people would find it impossible to create the bottom one (in Excel or otherwise).

Well, I’m going to show you how to create both, in R. All code now in Github!

The Excel Graph

Now, I’ve shown you how to create Excel-like graphs in R before, and we’ll use some of the same tricks again.

First we set up the data:

# set up the data
df <- data.frame(Circulatory=c(32,26,19,16,14,13,11,11),

rownames(df) <- seq(1975,2010,by=5)


Now let's plot the graph

# set up colours and points
cols <- c("darkolivegreen3","darkcyan","mediumpurple2","coral3")
pch <- c(17,18,8,15)

# we have one point on X axis for each row of df (nrow(df))
# we then add 2.5 to make room for the legend
xmax <- nrow(df) + 2.5

# make the borders smaller

# plot an empty graph
plot(1:nrow(df), 1:nrow(df), pch="", 
		xlab=NA, ylab=NA, xaxt="n", yaxt="n", 
		ylim=c(0,35), bty="n", xlim=c(1,xmax))

# add horizontal lines
for (i in seq(0,35,by=5)) {
	lines(1:nrow(df), rep(i,nrow(df)), col="grey")

# add points and lines 
# for each dataset
for (i in 1:ncol(df)) {

	points(1:nrow(df), df[,i], pch=pch[i], 
		col=cols[i], cex=1.5)

	lines(1:nrow(df), df[,i], col=cols[i], 


# add bottom axes
axis(side=1, at=1:nrow(df), tick=FALSE, 

axis(side=1, at=seq(-0.5,8.5,by=1), 
		tick=TRUE, labels=NA)

# add left axis
axis(side=2, at=seq(0,35,by=5), tick=TRUE, 
		las=TRUE, labels=paste(seq(0,35,by=5),"%",sep=""))

# add legend
legend(8.5,25,legend=colnames(df), pch=pch, 
		col=cols, cex=1.5, bty="n",  lwd=3, lty=1)

And here is the result:


Not bad eh?  Actually, yes, very bad; but also very Excel!

The multi-plot

Plotting multi-panel figures in R is sooooooo easy!  Here we go for the alternate multi-plot.  We use the same data.

# split into 2 rows and 2 cols

# keep track of which screen we are
# plotting to
scr <- 1

# iterate over columns
for (i in 1:ncol(df)) {

	# select screen

	# reduce margins

	# empty plot
	plot(1:nrow(df), 1:nrow(df), pch="", xlab=NA, 
		ylab=NA, xaxt="n", yaxt="n", ylim=c(0,35), 

	# plot all data in grey
	for (j in 1:ncol(df)) {
		lines(1:nrow(df), df[,j], 
		col="grey", lwd=3)


	# plot selected in blue
	lines(1:nrow(df), df[,i], col="blue4", lwd=4)

	# add blobs
	points(c(1,nrow(df)), c(df[1,i], df[nrow(df),i]), 
		pch=16, cex=2, col="blue4")

	# add numbers
	mtext(df[1,i], side=2, at=df[1,i], las=2)
	mtext(df[nrow(df),i], side=4, at=df[nrow(df),i], 

	# add title

	# add axes if we are one of
	# the bottom two plots
	if (scr >= 3) {
		axis(side=1, at=1:nrow(df), tick=FALSE, 

	# next screen
	scr <- scr + 1

# close multi-panel image

And here is the result:



And there we have it.

So which do you prefer?

The unbearable madness of microbiome

This is my attempt to collate the literature on how easy it is to introduce bias into microbiome studies. ¬†I hope to crowd-source papers and add them below under each category. ¬†PLEASE GET INVOLVED. ¬†If this works well we can turn it into a comprehensive review and publish ūüôā Add new papers in the comments or Tweet me ūüôā

Special mention to these blogs:

  1. Microbiomdigest’s page on Sample Storage
  2. Microbe.net’s¬†Best practices for sample processing and storage prior to microbiome DNA analysis freeze? buffer? process?
  3. The-Scientist.com’s Spoiler Alert


UPDATE 7th August 2016


For prosperity the original blog post is below, but I am now trying to manage this through a ZOTERO GROUP LIBRARY.  Please continue to contribute Рjoin the group, get involved!  I think you can join the group here.  I am trying to avoid just listing software papers BTW I would prefer to focus on papers that specifically demonstrate sources of bias.

I won’t be updating the text below.


  • “We propose that best practice should include the use of a proofreading¬†polymerase and a highly processive polymerase, that sequencing¬†primers that overlap with amplification primers should not be used,¬†that template concentration should be optimized, and that the number¬†of PCR cycles should be minimized.” (doi:10.1038/nbt.3601)

Sample collection and sample storage

  • “Samples frozen with and without glycerol as cryoprotectant indicated a major loss of Bacteroidetes in unprotected samples” (24161897)
  • “No significant differences occurred in the culture-based analysis between the fresh, snap or -80¬įC frozen samples.” (25748176)
  • “In seven of nine cases, the Firmicutes toBacteroidetes 16S rRNA gene ratio was significantly higher in fecal samples that had been frozen compared to identical samples that had not….. The results demonstrate that storage conditions of fecal samples may adversely affect the determined Firmicutes to Bacteroidetes ratio, which is a frequently used biomarker in gut microbiology.” (22325006)
  • “Our results indicate that environmental factors and biases in molecular techniques likely confer greater amounts of variation to microbial communities than do differences in short-term storage conditions, including storage for up to 2 weeks at room temperature” (10.1111/j.1574-6968.2010.01965.x)
  • “The trichloroacetic acid preserved sample showed significant loss of protein band integrity on the SDS-PAGE gel. The RNAlaterpreserved sample showed the highest number of protein identifications (103% relative to the control; 520‚ÄȬĪ‚ÄČ31 identifications in RNAlater versus 504‚ÄȬĪ‚ÄČ4 in the control), equivalent to the frozen control. Relative abundances of individual proteins in the RNAlater treatment were quite similar to that of the frozen control (average ratio of 1.01‚ÄȬĪ‚ÄČ0.27 for the 50 most abundant proteins), while the SDS-extraction buffer, ethanol, and B-PER all showed significant decreases in both number of identifications and relative abundances of individual proteins.” (10.3389/fmicb.2011.00215)
  • “Optimized Cryopreservation of Mixed Microbial Communities for Conserved Functionality and Diversity” (10.1371/journal.pone.0099517)
  • “A previously known bias in FTA(¬ģ) cards that results in lower recovery of pure cultures of Gram-positive bacteria was also detected in mixed community samples. There appears to be a uniform bias across all five preservation methods against microorganisms with high G + C DNA. Overall, the liquid-based preservatives (DNAgard(‚ĄĘ), RNAlater(¬ģ), and DESS) outperformed the card-based methods.” (22974342)
  • “Microbial composition of frozen and ethanol samples were most similar to fresh samples. FTA card and RNAlater-preserved samples had the least similar microbial composition and abundance compared to fresh samples.” (10.1016/j.mimet.2015.03.021)
  • “Bray-Curtis dissimilarity and (un)weighted UniFrac showed a significant higher distance between fecal swabs and -80¬įC versus the other methods and -80¬įC samples (p<0.009). The relative abundance of Ruminococcus and Enterobacteriaceae did not differ between the storage methods versus -80¬įC, but was higher in fecal swabs (p<0.05)” (10.1371/journal.pone.0126685)
  • “We experimentally determined that the bacterial taxa varied with room temperature storage beyond 15 minutes and beyond three days storage in a domestic frost-free freezer. While freeze thawing only had an effect on bacterial taxa abundance beyond four cycles, the use of samples stored in RNAlater should be avoided as overall DNA yields were reduced as well as the detection of bacterial taxa.” (10.1371/journal.pone.0134802)
  • “A key assumption in many studies is the stability of samples stored long term at ‚ąí80¬†¬įC prior to extraction. After 2¬†years, we see relatively few changes: increased abundances of lactobacilli and bacilli and a reduction in the overall OTU count. Where samples cannot be frozen, we find that storing samples at room temperature does lead to significant changes in the microbial community after 2¬†days.” (10.1186/s40168-016-0186-x)

DNA extraction

  • “Caution should be paid when the intention is to pool and analyse samples or data from studies which have used different DNA extraction methods.” (27456340)
  • “Samples clustered according to the type of extracted DNA due to considerable differences between iDNA and eDNA bacterial profiles, while storage temperature and cryoprotectants additives had little effect on sample clustering” (24125910)
  • “Bifidobacteria were only well represented among amplified 16S rRNA gene sequences when mechanical disruption (bead-beating) procedures for DNA extraction were employed together with optimised ‚Äúuniversal‚ÄĚ PCR primers”¬†(26120470)
  • Qiagen DNA stool kit is biased (misses biffidobacteria)¬†¬†(26120470)
  • “Bead-beating has a major impact on the determined composition of the human stool microbiota. ¬†Different bead-beating instruments from the same producer gave a 3-fold difference in the Bacteroidetes to Firmicutes ratio” (10.1016/j.mimet.2016.08.005)
  • “We observed that using different DNA extraction kits can produce dramatically different results but bias is introduced regardless of the choice of kit.”¬†(10.1186/s12866-015-0351-6)

Sequencing strategy

  • Bifidobacteria were only well represented among amplified 16S rRNA gene sequences …¬†with optimised “universal” PCR primers. These primers incorporate degenerate bases at positions where mismatches to bifidobacteria and other bacterial taxa occur” (26120470)
  • Anything other than 2x250bp sequencing of V4 region (approx 250bp in length) inflates number of OTUs (23793624)
  • “This study demonstrates the potential for differential bias in bacterial community profiles resulting from the choice of sequencing platform alone.” (10.1128/AEM.02206-14)
  • “The effects of DNA extraction and PCR amplification for our protocols were much larger than those due to sequencing and classification” (10.1186/s12866-015-0351-6)
  • “Nested PCR introduced bias in estimated diversity and community structure. The bias was more significant for communities with relatively higher diversity and when more cycles were applied in the first round of PCR” (10.1371/journal.pone.0132253)
  • “pyrosequencing errors can lead to artificial inflation of diversity estimates” (19725865)
  • “Our findings suggest that when alternative sequencing approaches are used for microbial molecular profiling they can perform with good reproducibility, but care should be taken when comparing small differences between distinct methods” (25421243)

Data analysis strategy


Bioinformatics / database issues


Contamination in kits

  • “Reagent and laboratory contamination can critically impact sequence-based microbiome analyses” (25387460)
  • “Due to contamination of DNA extraction reagents, false-positive results can occur when applying broad-range real-time PCR based on bacterial 16S rDNA” (15722157)
  • “Relatively high initial densities of planktonic bacteria (10(2) to 10(3) bacteria per ml) were seen within [operating ultrapure water treatment systems intended for laboratory use]” (8517737)
  • “Sensitive, real-time PCR detects low-levels of contamination by Legionella pneumophila in commercial reagents” (16632318)
  • “Taq polymerase contains bacterial DNA of unknown origin” (2087233)


Commercial stuff







Older posts Newer posts

© 2019 Opiniomics

Theme by Anders NorenUp ↑