bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Month: August 2019

The three technologies bioinformaticians need to be using right now

I’m old. Sometimes I feel really old. I’m old enough to remember when you had to install things manually, and that’s if you were allowed to; worse would be that you depend on an IT department to install software for you. I am old enough to remember when changing servers meant reinstalling everything again from scratch. I am old enough to have been frightened by dependency hell, where you do not know all the dependencies of the software you need to install, and whether they would conflict with other software you also need to install. It’s not too long ago that my “pipelines” were simple Perl of bash scripts, with for() and while() loops in them, whose purpose it was to run things in the correct order and maybe do some basic checking to ensure things ran.

My bone-chilling fear is that many of you, maybe even most of you, still do this. My fear is that if you are a pet bioinformatician, you don’t have anyone around to tell you that it needn’t be like this anymore. How do I sleep? Sometimes very badly.

Look, this isn’t a tutorial, so here it is. These are the three technologies that, as a bioinformatician, you need to be engaging with right now. Not next week, or month. Now. They are not new technologies. Now is not the time to put this off, it’s time to realise that if you don’t use them, you are already out of date, and the longer you leave it, the worse things will become.

Here they are:

1. Software environments and/or containers

Software environments can be thought of as a folder where you install all the software you need to run a particular analysis. “But I already have one of those!” I hear you cry. Ah, but there’s more. Software environments don’t require root access, they handle all software dependencies, they are exportable, shareable and re-useable, they can be used with multiple different flavours of Linux/Unix, and in the vast majority of cases they make installing and maintaining software simple and pain-free. You can have multiple environments (e.g. one for each pipeline you run), and you can also version them. The one I am most familair with is conda (https://docs.conda.io/en/latest/) which has an associated BioConda project (https://bioconda.github.io/) with many bioinformatics tools set up and ready to go.

Containers are similar in that they are a place you install all of the software you need to run a pipeline. Containers are a form of light-weight VM, and the idea is you describe once how to install software into a container, and then you use that container multiple times i.e each time you run your pipeline. No changes are made to your local machine, the software simply downloads the container and runs your pipeline inside of it. Again you can set up and use multiple containers, and version them. The most common software tools for containers are Docker and Singularity, and there is a BioContainers project (https://biocontainers.pro/#/)

Installing software is now so simple in the vast majority of cases, using either or both of the above. Please use them.

2. Workflow management systems

It’s a law of the universe that every bioinformatician will say they have a “pipeline”, but often I find this is actually a bash, perl or python script. They were great for their time, but it is time to move on.

Workflow management systems describe true pipelines. Each has their own, often simple, language; the language describes jobs, and each job will have input and output. The input of one job can be the output of another job, and the jobs will be executed in the correct order to make sure all of the inputs are present before running. They handle wild cards and scale to tens of thousands of input files. They run on your local computer, they integrate with common HPC submission systems, they can submit jobs to the cloud, and they often also integrate with software environments and containers. Yes, so each job in your workflow can be run within its own environment or container. They track which jobs succeeded and which failed, they can re-submit those that failed, they can restart where the pipeline finished last time, they clean up when things fail, and you can add more input files at a later date and they will only re-run the jobs they need to. There are hundreds of these, but those I am most familiar with are Snakemake (https://snakemake.readthedocs.io/en/stable/) and NextFlow (https://www.nextflow.io/)

Honestly, adopting (1) and (2) will change your life. Do it.

3. Cloud

The third is optional, and also not new, but it is now so much easier to access. No matter how big your University cluster, nothing will match the size of the compute clouds of Amazon, Microsoft and Google. These things are vast, they run Netflix and Dropbox, and once you start using the cloud, you realise there are essentially no computational limits to your research.

What’s amazing about cloud is that the vast majority of it is sat around doing nothing most of the time, and companies sell off these “on demand” resources at a massive discount. The downside is that if there is a surge of Netflix then they may kill your jobs, but the upside is you only pay 10-20% of the advertised cost. And if you use a workflow management system, it can pick up where you left off anyway, so….

Snakemake can submit jobs to Google Cloud (via Kubernetes) and NextFlow can submit to Amazon. Broad adopted Google Cloud for GATK workflows and got the cost down to $5 per sample. Remember that the next time you pay your University HPC bill. There are now essentially no computational limits to your research – when you realise that, it frees your mind and you begin to think much bigger than you did before.

And before you say “I don’t have a credit card”, I am here to tell you that both Amazon and Google will send monthly invoices. It’s all good.


There we have it. None of this is new and apologies if I sound patronising; but if you are a bioinformatician and you’re not using any of the above, please, I beg of you, take a few days out to learn about them all, and start using them. It will change your life, for the better. If you are a student or a post-doc looking for a job, these are the technologies you will need to talk about in your interview. The future is here – in fact it arrived a few years ago – the time to change is now.

Massive scale assembly of genomes from metagenomes in ruminants

Hopefully some of you will have seen our recent paper published in Nature Biotech:

Compendium of 4,941 rumen metagenome-assembled genomes for rumen microbiome biology and enzyme discovery
Robert D. Stewart, Marc D. Auffret, Amanda Warr, Alan W. Walker, Rainer Roehe & Mick Watson

Nature Biotechnology volume 37, pages 953โ€“961 (2019)

In this paper we analyse samples from 283 beef cattle, generate between 50-60M PE150 reads from each, perform single sample assemblies, perform co-assemblies, bin the whole lot and talk about the 4941 metagenome-assembled-genomes that are at least 80% complete and at most 10% contaminated.

We also ran three MinION flowcells on one sample and generated lots of contigs >1Mb in length, and at least three whole bacterial chromosomes from novel bacterial species.

This was a real labour of love, and the entire dataset is available at DOI https://doi.org/10.7488/ds/2470 and at ENA accession PRJEB31266.

I just wanted to highlight a few things from the paper you may have missed in the details ๐Ÿ˜€

We have released all of our data

And I mean all of it. This includes the primary raw data, the single sample metagenome assemblies, the metagenome co-assemblies, the bin layer (more of what that means below), the 4941 MAGs, the annotations and predicted proteins, DIAMOND search hits against KEGG and UniProt – literally everything. If there is something you want that’s missing let us know. Some of it may not have uploaded to ENA yet, but it should all be there very soon. If it is not in the ENA yet, it should be in at the Edinburgh DataShare DOI https://doi.org/10.7488/ds/2470.

There are more MAGs than you think….

Recent massive scale re-analysis of the human microbiome data by the Segata, Finn and Kyrpides labs revealed 10,000s of new genomes from all human body sites. In theory these studies dwarf our own, but they used a cut-off of 50% complete, whereas our 4941 are all 80% complete.

Well, the new ENA model for metagenomic assemblies includes both a bin layer and a MAG layer. The MAGs are your final product – the fully analysed, cleaned and annotated genomes. In our case this is 4941. The bin layer is the layer of bins from which the MAGs derive. In our case we have chosen to put all bins that were at least 50% complete and no more than 10% contaminated into the bin layer. So in actual fact, whilst we do not talk about this in the paper, we are releasing 20,469 new MAGs from the rumen microbiome. These should all be available from the ENA soon.

We got a (near) complete bacterial chromosome from an Illumina metagenome assembly!!

Almost unheard of, in my experience, but we assembled a near complete genome of a novel Proteobacteria species as a single contig direct from the Illumina data using IDBA-UD. We are not sure why the stars aligned and enabled us to do it,ย  but we were certainly presently surprised.

These small Proteobacteria are numerous and abundant in our rumen datasets and have also been found in other metagenomic datasets. They do not appear to have been cultured and sequenced, and with their small genomes, we have to speculate that they depend on other microbes to survive, so can only be isolated from communities. Nevertheless, we know their CAZyme profile now and one of my students is going to have a crack at culturing one of these in a placement with Alan Walker later this year.

MAGs can be very high quality

I know there are MAG-sceptics out there, but I would just like to point them to Supplementary Figure 13.ย  The bottom left shows whole-genome alignments between an Illumina MAG (vertical axis) and a Nanopore whole-chromosome. These came from different samples. SO what is the take home? Well these are from two different samples, sequenced using two different technologies and assembled using completely different software and methods… yet they came up with the same answer. In other words, the Illumina MAG is of such high quality that it agrees, almost completely, with a single, whole-chromosome assembly from the Nanopore data. Remarkable.

This paper is a reference for IDEEL

I have blogged at great length about indels in long-read assembled genomes, and we created a simple Snakemake pipeline (IDEEL – Indels are not ideal) to demonstrate these were a problem. I was pleased when this pipeline got a shout out from Kat Holt on Twitter

Until now, IDEEL has not had a formal publication, and as we used it extensively in this paper, and mention it explicitly by name, then I propose that our new paper is the formal citation for IDEEL.

The real story is the proteins

Whilst the figures of 4941 MAGs and the single chromosome assemblies will take the headlines, for me the real story here is in the proteins. When clustering all known rumen miccrobial proteins, using a 90% identity cut-off, we ended up with 6.24million clusters. However, 3.67million of them are singletons that only contain proteins from our new MAGs. This is an incredible amount of new protein diversity, and figuring out what their function is will be the next major challenge for metagenomics researchers.

We also compared CAZymes to CAZy and find that CAZy poorly represents rumen CAZymes.

All microbiome work ends up with the need to culture

I increasingly believe we have to bring in to culture many of these microbes if we are to take the next step in microbiome research. We predict over 2000 novel species of bacteria and archaea. We cannot possibly know what they do, how they behave and interact, without bringing them in to culture.

There is a myth that 95% of rumen bacteria are uncultureable. This myth needs to die. It already killed one of our grant applications ๐Ÿ™ There is nothing inherently uncultureable with the majority of these bugs, and even with the small Proteobacteria I mention above, it should be possible with a bit of work and the right co-culturing conditions. So please let this myth die!! All bugs are cultureable, some are more difficult than others.

Onwards and upwards ๐Ÿ™‚

© 2021 Opiniomics

Theme by Anders NorenUp ↑