Just a quick note, as this is very preliminary data.  I plan to blog about these genomes as the analysis continues.

We have five new bacterial genomes, all sampled from a novel microbiome, so at present we have no idea what they are.  We have sequenced them all in a single lane of our MiSeq using 2x250bp reads and an insert size around 700bp.

Now, being totally honest, this was a bad run – Illumina admitted the reagents weren’t up to scratch and we are having them replaced.  Even so, even from a bad run, we still had enough read pairs that were Q30 to at least 200bp in each read to get enough data for between 40x and 90x coverage (assuming each genome is 5Mb)

We used the excellent VelvetOptimiser script, and here I plot the vanilla N50 for a range of kmers.  Note, there is no optimisation for coverage here – these are just the initial “vanilla” velvetg runs.  I have the final N50s, after optimisation, and in some cases they are 4-5 times higher than those reported here.  More on this in future posts.

What we see here is that the best N50, in all cases, is above 100bp – which showcases very nicely the potential benefits of longer reads.

I must add some caveats here – these may be local optima, I have not done a complete search of Kmer/coverage space;  there may be mis-assmemblies;  there may be better optimisation criteria.  As I said, this is a work in progress.

However, there is certainly evidence here that read length matters.