Hopefully some of you will have seen our recent paper published in Nature Biotech:
Compendium of 4,941 rumen metagenome-assembled genomes for rumen microbiome biology and enzyme discovery
Robert D. Stewart, Marc D. Auffret, Amanda Warr, Alan W. Walker, Rainer Roehe & Mick Watson
Nature Biotechnology volume 37, pages 953–961 (2019)
In this paper we analyse samples from 283 beef cattle, generate between 50-60M PE150 reads from each, perform single sample assemblies, perform co-assemblies, bin the whole lot and talk about the 4941 metagenome-assembled-genomes that are at least 80% complete and at most 10% contaminated.
We also ran three MinION flowcells on one sample and generated lots of contigs >1Mb in length, and at least three whole bacterial chromosomes from novel bacterial species.
I just wanted to highlight a few things from the paper you may have missed in the details 😀
We have released all of our data
And I mean all of it. This includes the primary raw data, the single sample metagenome assemblies, the metagenome co-assemblies, the bin layer (more of what that means below), the 4941 MAGs, the annotations and predicted proteins, DIAMOND search hits against KEGG and UniProt – literally everything. If there is something you want that’s missing let us know. Some of it may not have uploaded to ENA yet, but it should all be there very soon. If it is not in the ENA yet, it should be in at the Edinburgh DataShare DOI https://doi.org/10.7488/ds/2470.
There are more MAGs than you think….
Recent massive scale re-analysis of the human microbiome data by the Segata, Finn and Kyrpides labs revealed 10,000s of new genomes from all human body sites. In theory these studies dwarf our own, but they used a cut-off of 50% complete, whereas our 4941 are all 80% complete.
Well, the new ENA model for metagenomic assemblies includes both a bin layer and a MAG layer. The MAGs are your final product – the fully analysed, cleaned and annotated genomes. In our case this is 4941. The bin layer is the layer of bins from which the MAGs derive. In our case we have chosen to put all bins that were at least 50% complete and no more than 10% contaminated into the bin layer. So in actual fact, whilst we do not talk about this in the paper, we are releasing 20,469 new MAGs from the rumen microbiome. These should all be available from the ENA soon.
We got a (near) complete bacterial chromosome from an Illumina metagenome assembly!!
Almost unheard of, in my experience, but we assembled a near complete genome of a novel Proteobacteria species as a single contig direct from the Illumina data using IDBA-UD. We are not sure why the stars aligned and enabled us to do it, but we were certainly presently surprised.
These small Proteobacteria are numerous and abundant in our rumen datasets and have also been found in other metagenomic datasets. They do not appear to have been cultured and sequenced, and with their small genomes, we have to speculate that they depend on other microbes to survive, so can only be isolated from communities. Nevertheless, we know their CAZyme profile now and one of my students is going to have a crack at culturing one of these in a placement with Alan Walker later this year.
MAGs can be very high quality
I know there are MAG-sceptics out there, but I would just like to point them to Supplementary Figure 13. The bottom left shows whole-genome alignments between an Illumina MAG (vertical axis) and a Nanopore whole-chromosome. These came from different samples. SO what is the take home? Well these are from two different samples, sequenced using two different technologies and assembled using completely different software and methods… yet they came up with the same answer. In other words, the Illumina MAG is of such high quality that it agrees, almost completely, with a single, whole-chromosome assembly from the Nanopore data. Remarkable.
This paper is a reference for IDEEL
I have blogged at great length about indels in long-read assembled genomes, and we created a simple Snakemake pipeline (IDEEL – Indels are not ideal) to demonstrate these were a problem. I was pleased when this pipeline got a shout out from Kat Holt on Twitter
Re spotting low quality ‘finished’ genomes in NCBI – recommend using @BioMickWatson tool ideel to check for overabundance of truncated ORFS. Dead giveaway of homopolymer problems or poor basecalling in ONT-only assemblies https://t.co/EnL716vOSm #ABPHM19
— Kat Holt (@DrKatHolt) June 6, 2019
Until now, IDEEL has not had a formal publication, and as we used it extensively in this paper, and mention it explicitly by name, then I propose that our new paper is the formal citation for IDEEL.
The real story is the proteins
Whilst the figures of 4941 MAGs and the single chromosome assemblies will take the headlines, for me the real story here is in the proteins. When clustering all known rumen miccrobial proteins, using a 90% identity cut-off, we ended up with 6.24million clusters. However, 3.67million of them are singletons that only contain proteins from our new MAGs. This is an incredible amount of new protein diversity, and figuring out what their function is will be the next major challenge for metagenomics researchers.
We also compared CAZymes to CAZy and find that CAZy poorly represents rumen CAZymes.
All microbiome work ends up with the need to culture
I increasingly believe we have to bring in to culture many of these microbes if we are to take the next step in microbiome research. We predict over 2000 novel species of bacteria and archaea. We cannot possibly know what they do, how they behave and interact, without bringing them in to culture.
There is a myth that 95% of rumen bacteria are uncultureable. This myth needs to die. It already killed one of our grant applications 🙁 There is nothing inherently uncultureable with the majority of these bugs, and even with the small Proteobacteria I mention above, it should be possible with a bit of work and the right co-culturing conditions. So please let this myth die!! All bugs are cultureable, some are more difficult than others.
Onwards and upwards 🙂