Opiniomics

bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Category: bioinformatics (page 1 of 7)

With great power comes great responsibility

Recently I published a blog post about a fairly simple test to find out whether you have “short” protein predictions in your bacterial genomes, and predicted that some of these short peptides may be the result of unresolved errors in long-read, single molecule assemblies.

Perhaps not surprisingly, there was a reaction from the PacBio community over this, and here is my response.

Before I begin, I just want to say that, whilst most people see me as some kind of Nanopore fan-boy, the reality is I am a fan of cool technology and that includes PacBio.  The hard facts are that I have spent around £200k on PacBio sequencing over the last 18 months, and about £20k on nanopore in the same time period.  I also encouraged our core to buy a PacBio Sequel.  So I am not anti-Pacbio.  I am, however, anti-bullshit 😉

In addition, my blog post wasn’t about problems with technology per se, it was about problems with people.  If you don’t know errors are there, you might not correct them, and if you believe company hype (for example that PacBio data is Q50 after polishing) you might believe your assembly is perfect.

It isn’t.  They never are.

Let’s dive into some data.  The three PacBio genomes I chose in the last blog post are:

AP018165	Mycobacterium stephanolepidis 
CP025317	Escherichia albertii 
ERS530415	Yersinia enterocolitica

if you search GenBank genomes for these, there is only one Mycobacterium stephanolepidis genome, but there are 39 Escherichia albertii genomes and there are 176 Yersinia enterocolitica.  Many of these will be sequenced on different technologies, which allows us to do a comparison within a species.

Escherichia albertii 

I downloaded the 39 genomes from GenBank, and ran the process I described last week.  For the complete genomes, this is what the data look like, ordered from lowest number of short peptides, to greatest:

Accession Technology Contigs # proteins # short
GCA_001549955.1 Sanger+454 1 4299 19
GCA_002285455.1 Sanger+454 1 4157 19
GCA_002285475.1 Sanger+454 1 4295 28
GCA_002741375.1 PacBio 1 (plus 6 plasmids) 5126 79
GCA_002895205.1 Illumina+454+Ion 1 4423 83
GCA_000512125.1 PacBio 1 4482 106
GCA_002872455.1 PacBio 1 (plus 3 plasmids) 4984 124

Sure is unlucky that PacBio assemblies are at the bottom of the table isn’t it?  Of course, Sanger is the gold standard, and many will be asking what the Illumina assemblies look like.

Let’s look at the next ten genomes, which are not complete, but are the least fragmented:

Accession Technology Contigs # proteins # short
GCA_002109845.1 454 25 4758 93
GCA_001514945.1 Illumina 43 4237 82
GCA_002563295.1 Ion 44 4531 143
GCA_001514965.1 Illumina 50 4390 86
GCA_001515045.1 Illumina 53 4480 110
GCA_001515005.1 Illumina 59 4963 117
GCA_001514645.1 Illumina 63 4470 84
GCA_001514685.1 Illumina 70 4209 68
GCA_001514925.1 Illumina 73 4464 84
GCA_001514905.1 Illumina 78 4622 113

What I think is worth pointing out here is that the PacBio genomes in the first table, which are complete, have about the same number of short proteins as the Illumina and 454 assemblies in the second table, which are fragmented.  We would usually expect fragmented assemblies to have more short proteins because the contig ends would interupt ORFs.   Indeed, compared to Sanger complete assemblies, the fragmented assemblies do have more short proteins.

They just don’t have more than the PacBio complete assemblies.  Odd that.  If there were no uncorrected errors in the PacBio assemblies, they would be more like the Sanger assemblies than the Illumina ones.

Yersinia enterocolitica

Of the 176 GenBank genomes for this species, my automated script could only detect sequencing technology for 35 of them.  Here are the 21 that have a claim to be complete or near-complete (<5 contigs)

Accession Technology contigs # proteins # short
GCA_000834195.1 Illumina+454 2 4161 17
GCA_000987925.1 PacBio 2 4340 19
GCA_001304755.1 PacBio 2 4344 20
GCA_002082275.2 PacBio+Illumina 1 4067 31
GCA_002554625.2 PacBio+Illumina 1 4384 35
GCA_001305635.1 Illumina 2 4162 35
GCA_000834795.1 PacBio+Illumina 2 4259 36
GCA_000755045.1 Illumina+454 2 4226 44
GCA_000834735.1 PacBio+Illumina 2 4138 48
GCA_000755055.1 Illumina+454 1 4095 53
GCA_000754975.1 Illumina+454 2 4083 53
GCA_000597945.2 Illumina+454 3 4640 59
GCA_000754985.1 Illumina+454 3 4282 65
GCA_002083285.2 PacBio+Illumina 4 4198 76
GCA_002082245.2 PacBio+Illumina 2 4521 106
GCA_001708575.1 PacBio 3 4673 221
GCA_001708615.1 PacBio 3 4615 224
GCA_001708635.1 PacBio 3 4659 224
GCA_001708655.1 PacBio 3 4654 226
GCA_001708555.1 PacBio 3 4674 230
GCA_001708595.1 PacBio 2 4733 235

Slightly different story here, in that the PacBio (and PacBio hybrid) genomes appear to have some of the lowest number of predicted short proteins.  This is what people now expect when they see PacBio bacterial genomes.  However, there are also six PacBio genomes at the bottom of the table, and so I don’t think you can really look at this data and think there isn’t a problem.  It’s possible that those six just happen to be the strains of Yersinia enterocolitica that have undergone the most pseudogenisation, but I don’t think so.



Let’s get some things straight

  • I know this is an incomplete analysis and obviously more work needs to be done
  • If I personally wanted a perfect microbial genome, I would probably use PacBio+Illumina
  • I have nothing against PacBio
  • Nanopore aren’t in this list because they’re not in the species I chose, but I am sure they would also have significant problems

 

As I said above, it’s not that PacBio has a problem per se, it’s that people have a problem.  Yes, many errors are correctable with Quiver, Arrow and Pilon; often multiple rounds are necessary and you still won’t catch everything.

But not everyone knows that, and it’s clear to me from the above that many people are still pumping out poor quality, uncorrected, indel-ridden PacBio genomes.

The same is true for Nanopore, I have no doubt.

Let’s stop bullshitting.  These technologies have problems.  It doesn’t mean they are bad technologies, anything but – both PacBio and Nanopore have been transformational.  There is no need for bullshit.

“With great power comes great responsibility” – in this case, the responsibility is two-fold.  One, stop contaminating public databases with sh*t assemblies; and two, stop bullshitting that this isn’t a problem for your favourite technology.  That helps no-one.

 

A simple test for uncorrected insertions and deletions (indels) in bacterial genomes

A friend and a colleague of mine once said about me “he’s a details man”, and it was after we had discussed the fact some of my papers consist solely in pointing out the errors other people ignore – in RNA-Seq for example, or in genome assemblies (I have another under review!).

By now, those of you familiar with my earlier work will be jumping up and shouting

“Of course!  Even as far back as 2012, he was warning us about the dangers of mis-annotated pseudogenes!  Just look at Figure 2 in his review of bacterial genome annotation!”

Well, let’s dig into that a bit more.

For the uninitiated, the vast majority of bases in most bacterial genomes fall within open-reading frames (ORFs) that are eventually translated into proteins.  Bacterial evolution, and often niche specificity, is characterised by the psedogen-isation of these ORFs – basically, many replication errors in ORFs (especially insertions and deletions) can introduce a top codon, the transcript and resulting protein are truncated, they are often no longer functional, and you have a pseudogene.  This happens when the ORFs are not under positive selection.

HOWEVER.  Sequencing errors can also introduce errors, and you end up with incorrectly annotated pseudogenes.

And OH NOES we just so happen to be flooded with data from long read technologies that have an indel problem.

Let’s assume you have a shiny new bacterial genome, sequenced on either PacBio or Nanopore, and you want to figure out if it has an indel problem.  Here is my simple and quick test:

First predict proteins using a gene finder.  Then map those proteins to UniProt (using blastp or diamond).  The ratio of the query length to the length of the top hit should be a tight and normal distribution around 1.

OK so what does this look like in practice?

Let’s look at a genome we know if OK, E coli MG1655:


MG1655.combined

On the left, we have the raw histogram, and on the right we have zoomed in so the y-axis ends at 500.  Generally speaking, this has worked I think – the vast majority of predicted proteins have a 1:1 ratio with their top hit in UniProt.

Here is Salmonella gallinarum which we know “has undergone extensive degradation through deletion and pseudogene formation”


gallinarum.combined

Well, would you look at that!  There they are, all those pseudogenes appearing as predicted proteins that are shorter than their top hit.  For those of you saying “It’s not that obvious”, in the plots above, MG1655 has 157 proteins < 90% of the length of their top hit, whereas for Salmonella gallinarum the figure is 432.   By my recollection, gallinarum has over 200 annotated pseudogenes.

If you’re convinced that this actually might be working, then read on!

So here are a couple of PacBio genomes.  I genuinely chose these at random from recent Genome Announcements.  Generally speaking, all had high PacBio coverage, were assembled using Canu or HGAP, and polished using Pilon.

In no particular order:


AP018165.combined

(237 proteins < 90% of top hit)


CP025317.combined

(246 proteins < 90% of top hit)


ERS530415.combined

(268 proteins < 90% of top hit)

Now, these are not terrible numbers, in fact they are quite good, but those long tails are a tiny bit of a worry, and are definitely worth checking out.  Bear in mind, if 200 pseudogenes in Salmonella gallinarum is “extensive degradation through deletion and pseudogene formation“, then 200 incorrect indel pseudogenes in your genome is a problem.  In the above, I can’t possibly tell you whether these  numbers of predicted short/interrupted ORFs are real or not because I don’t know about the biology.  However, I can say that given they were created with a technology known to have an indel problem, they are worth further investigation.

Indel sequencing errors are not the only problem, so are fragmented assemblies.  In fragmented assemblies, the contig ends also tend to be in ORFs and again, you get short/interrupted proteins.  Here is one of our MAGs from our recent rumen paper:


RUG001.combined

(310 proteins < 90% of top hit)

It’s not all bad news though, some of the MAGs are OK:


RUG045.combined

(126 proteins < 90% of top hit)

And at least other peoples’ MAGs are just as bad (or worse):


UBA2753_genomic.combined

(265 proteins < 90% of top hit)

My absolute favourite way to assemble genomes is a hybrid approach, creating contigs with Illumina and then stitching together with a small amount of nanopore data.  We did this with B fragilis and the results look good:


fragilis_be1.combined

(63 proteins < 90% of top hit)


So are there any examples of where it goes really badly hideously wrong?  Well the answer is yes.  We are assembling data from metagenomic samples using both PacBio and Nanopore.   The issue with metagenomes is that you may not have enough coverage to correct errors using the consensus approach – there simply aren’t enough bases at each position to get an accurate reflection.  We are seeing the same pattern for both PacBio and Nanopore, so I won’t name which technology, but…. well scroll down for the horror show.

 

 

 

 

 

 

 

 

 

 

 

 

Are you ready?  Scroll down….

 

 

 

 

 

 

 

 

 

 

A 3Mb contig from a Canu assembly:


tig00000022.combined

(2635 proteins < 90% of top hit)

Now, the only work this has had done on it is that Canu performs read correction before assembly.  We did one round.  Sweet holy Jesus that’s bad isn’t it??

Can it be corrected?  To a degree.  Here it is after two rounds of Pilon and some technology-specific polishing:


tig00000022.pilon2.combined

(576 proteins < 90% of top hit).

So less bad, but still not great.  I’d say this assembly has problems, which is why we haven’t published yet.


What’s the take home message here?  Well there are a few:

  1. Errors matter.  Pay attention to them.
  2. Both long read technologies have indel problems and these probably cause frameshifts in your ORFs
  3. Polishing and consensus and Illumina correction helps but it doesn’t catch everything
  4. The problem will inevitably be worse if you have low coverage
  5. Comparing predicted proteins with their top hits in UniProt can identify where you have errors in your ORFs (or real pseudogenes!)

Code for doing this type of analysis will appear here.

Judge rules in favour of Oxford Nanopore in patent dispute with PacBio

Forgive me if I get any of the details wrong, I am not a lawyer, but the title of this post is my take on a judgement passed down in the patent infringement case PacBio brought against ONT.

To get your hands on the documentation, you need to register and log in to EDIS, click “Advanced Search”, do a search for “single molecule sequencing” and click the top hit.

My interpretation of the documentation is that the judge has massively limited the scope of the patents in question by expanding on the definition of “single molecule sequencing”.  ONT argued that in the patents in question, “single molecule sequencing” referred only to “sequencing of a single molecule by template-dependent synthesis”, and the judge agreed with this definition.

sms

All claims are then subsequently limited to template-dependent synthesis, which of course is NOT what Oxford Nanopore do.

tds

The document then goes into an area that would make all biological ontologists rejoice – THEY TRY AND DEFINE THE TERM “SEQUENCE”.  I can almost hear the voices shouting “I told you so!” coming out of Manchester and Cambridge as I write 😉

Beautiful boxplots in base R

As many of you will be aware, I like to post some R code, and I especially like to post base R versions of ggplot2 things!

Well these amazing boxplots turned up on github – go and check them out!

So I did my own version in base R – check out the code here and the result below.  Enjoy!

probability

Plotting cool graphs in R

I have to admit to being a bit of a snob when it comes to graphs and charts in scientific papers and presentations.  It’s not like I think I am particularly good at it – I’m OK – it’s just that I know what’s bad.  I’ve seen folk screenshot multiple Excel graphs so they can paste them into a powerpoint table to create multi-panel plots… and it kind of makes me want to scream.   I’m sorry, I really am, but when I see Excel plots in papers I judge the authors, and I don’t mean in a good way.  I can’t help it.  Plotting good graphs is an art, and sticking with the metaphor, Excel is paint-by-numbers and R is a blank canvas, waiting for something beautiful to be created; Excel is limiting, whereas R sets you free.

Readers of this blog will know that I like to take plots that I find which are fabulous and recreate them.  Well let’s do that again 🙂

I saw this Tweet by Trevor Branch on Twitter and found it intriguing:

It shows two plots of the same data.  The Excel plot:

excel

And the multi plot:

multi

You’re clearly supposed to think the latter is better, and I do; however perhaps disappointingly, the top graph would be easy to plot in Excel but I’m guessing most people would find it impossible to create the bottom one (in Excel or otherwise).

Well, I’m going to show you how to create both, in R. All code now in Github!

The Excel Graph

Now, I’ve shown you how to create Excel-like graphs in R before, and we’ll use some of the same tricks again.

First we set up the data:


# set up the data
df <- data.frame(Circulatory=c(32,26,19,16,14,13,11,11),
		 Mental=c(11,11,18,24,23,24,26,23),
		 Musculoskeletal=c(17,18,13,16,12,18,20,26),
		 Cancer=c(10,15,15,14,16,16,14,14))

rownames(df) <- seq(1975,2010,by=5)

df

Now let's plot the graph


# set up colours and points
cols <- c("darkolivegreen3","darkcyan","mediumpurple2","coral3")
pch <- c(17,18,8,15)

# we have one point on X axis for each row of df (nrow(df))
# we then add 2.5 to make room for the legend
xmax <- nrow(df) + 2.5

# make the borders smaller
par(mar=c(3,3,0,0))

# plot an empty graph
plot(1:nrow(df), 1:nrow(df), pch="", 
		xlab=NA, ylab=NA, xaxt="n", yaxt="n", 
		ylim=c(0,35), bty="n", xlim=c(1,xmax))

# add horizontal lines
for (i in seq(0,35,by=5)) {
	lines(1:nrow(df), rep(i,nrow(df)), col="grey")
}

# add points and lines 
# for each dataset
for (i in 1:ncol(df)) {

	points(1:nrow(df), df[,i], pch=pch[i], 
		col=cols[i], cex=1.5)

	lines(1:nrow(df), df[,i], col=cols[i], 
		lwd=4)


}

# add bottom axes
axis(side=1, at=1:nrow(df), tick=FALSE, 
		labels=rownames(df))

axis(side=1, at=seq(-0.5,8.5,by=1), 
		tick=TRUE, labels=NA)

# add left axis
axis(side=2, at=seq(0,35,by=5), tick=TRUE, 
		las=TRUE, labels=paste(seq(0,35,by=5),"%",sep=""))

# add legend
legend(8.5,25,legend=colnames(df), pch=pch, 
		col=cols, cex=1.5, bty="n",  lwd=3, lty=1)

And here is the result:

excel_plot

Not bad eh?  Actually, yes, very bad; but also very Excel!

The multi-plot

Plotting multi-panel figures in R is sooooooo easy!  Here we go for the alternate multi-plot.  We use the same data.


# split into 2 rows and 2 cols
split.screen(c(2,2))

# keep track of which screen we are
# plotting to
scr <- 1

# iterate over columns
for (i in 1:ncol(df)) {

	# select screen
	screen(scr)

	# reduce margins
	par(mar=c(3,2,1,1))

	# empty plot
	plot(1:nrow(df), 1:nrow(df), pch="", xlab=NA, 
		ylab=NA, xaxt="n", yaxt="n", ylim=c(0,35), 
		bty="n")

	# plot all data in grey
	for (j in 1:ncol(df)) {
		lines(1:nrow(df), df[,j], 
		col="grey", lwd=3)

	}	

	# plot selected in blue
	lines(1:nrow(df), df[,i], col="blue4", lwd=4)

	# add blobs
	points(c(1,nrow(df)), c(df[1,i], df[nrow(df),i]), 
		pch=16, cex=2, col="blue4")

	# add numbers
	mtext(df[1,i], side=2, at=df[1,i], las=2)
	mtext(df[nrow(df),i], side=4, at=df[nrow(df),i], 
		las=2)	

	# add title
	title(colnames(df)[i])

	# add axes if we are one of
	# the bottom two plots
	if (scr >= 3) {
		axis(side=1, at=1:nrow(df), tick=FALSE, 
			labels=rownames(df))
	}

	# next screen
	scr <- scr + 1
}

# close multi-panel image
close.screen(all=TRUE)

And here is the result:

multi_plot

 


And there we have it.

So which do you prefer?

I can’t recreate a graph from Ioannidis et al – can you?

Very quick one this!  Really interesting paper from Ioannidis et al about citation indices.

I wanted to recreate figure 1, which is:

journal.pbio.1002501.g001

Closest I could get (code here) is this:

plos_weird

Biggest difference is in NS, where they find all negative correlations, but most of mine are positive.

Source data are Table S1 Data.

Am I doing something wrong?  Or is the paper wrong?

 

UPDATE 9th July 2016

Using Spearman gets us closer but it’s still not quite correct (updated code too)

results_spearman

Which reference manager do you use?

So I sent out this tweet yesterday and it produced a bit of a response, so I thought it would be good to get an idea of how people reference when writing papers and grants:

Here is how I do it in Word and Mendeley.

1) Create a new group in Mendeley Desktop

blog1

2) Find a paper I want to cite in pubmed

blog2

3) Click on the Mendeley Chrome plug-in and save it to my new group

blog3

4) Click “insert a citation in Word”:

blog4

5) Search and add the citation in the Mendeley pop-up:

blog5

6) Change the style to something I want….

blog6

7) here choosing “Genome Biology”
blog7

8) Add my bibliography by clicking “Insert Bibliography” in Word:

blog8

 

9) Rinse and repeat and I generally add publications iteratively as I write 🙂

 

In an ideal world this would spurn many other blog posts where people show how they use alternate reference managers 🙂

Your strongly correlated data is probably nonsense

Use of the Pearson correlation co-efficient is common in genomics and bioinformatics, which is OK as it goes (I have used it extensively myself), but it has some major drawbacks – the major one being that Pearson can produce large coefficients in the presence of very large measurements.

This is best shown via example in R:


# let's correlate some random data
g1 <- rnorm(50)
g2 <- rnorm(50)

cor(g1, g2)
# [1] -0.1486646

So we get a small, -ve correlation from correlating two sets of 50 random values. If we ran this 1000 times we would get a distribution around zero, as expected.

Let's add in a single, large value:


# let's correlate some random data with the addition of a single, large value
g1 <- c(g1, 10)
g2 <- c(g2, 11)
 
cor(g1, g2)
# [1] 0.6040776

Holy smokes, all of a sudden my random datasets are positively correlated with r>=0.6!

It's also significant.


> cor.test(g1,g2, method="pearson")

        Pearsons product-moment correlation

data:  g1 and g2
t = 5.3061, df = 49, p-value = 2.687e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3941015 0.7541199
sample estimates:
      cor 
0.6040776 

So if you have used Pearson in large datasets, you will almost certainly have some of these spurious correlations in your data.

How can you solve this? By using Spearman, of course:


> cor(g1, g2, method="spearman")
[1] -0.0961086
> cor.test(g1, g2, method="spearman")

        Spearmans rank correlation rho

data:  g1 and g2
S = 24224, p-value = 0.5012
alternative hypothesis: true rho is not equal to 0
sample estimates:
       rho 
-0.0961086 

Building a Kraken database with new FTP structure and no GI numbers

Progress is usually good, except when a reliable resource completely changes their data and file structures and mess up everything you have been trying to do.  The NCBI used to arrange bacterial genomes in such a very easy way to understand and download; now it’s a bit tougher.  So tough in fact that they had to create a new FAQ.

At the same time as moving around genome data, they also decided to retire GI numbers (thanks to Andreas for the link!).

This is a problem for Kraken, a metagenomic profiler, because it relies both on the old style genomic data structures and the GI numbers that come with them.  Below are my attempts to build a Kraken database, using the new FTP structure and no GIs.

Downloading the data

It is pretty easy to follow the FAQ above, but I have created some perl scripts here that should help.

You will need:

  • Perl
  • BioPerl
  • Git
  • A linux shell (bash) that has wget
  • Internet access
  • A folder that you have write access to
  • Maybe 40Gb of space

The magic in these scripts is that they download the data and then add the necessary “|kraken:taxid|xxxx” text that replaces GI number for building what Kraken refers to as “custom databases”

Note also that the script only downloads fasta for assemblies classed as “Complete Genome”.  Other options are:

  • Chromosome
  • Complete Genome
  • Contig
  • Scaffold

If you want these others then edit the scripts appropriately.


# clone the git repo
git clone https://github.com/mw55309/Kraken_db_install_scripts.git

# either put the scripts in your path or run them from that directory

# make download directory
mkdir downloads
cd downloads

# run for each branch of life you wish to download
perl download_bacteria.pl
perl download_archaea.pl
perl download_fungi.pl
perl download_protozoa.pl
perl download_viral.pl

# you should now have five directories, one for each branch

# build a new database 
# download taxonomy
kraken-build --download-taxonomy --db kraken_bvfpa_080416

# for each branch, add all fna in the directory to the database
for dir in fungi protozoa archaea viral bacteria; do
        for fna in `ls $dir/*.fna`; do
                kraken-build --add-to-library $fna --db kraken_bvfpa_080416
        done
done

# build the actual database
kraken-build --build --db kraken_bvfpa_080416


I haven’t tested the above database yet, but the output of the build script ends with:


Creating GI number to seqID map (step 4 of 6)...
GI number to seqID map created. [14.935s]
Creating seqID to taxID map (step 5 of 6)...
7576 sequences mapped to taxa. [1.188s]
Setting LCAs in database (step 6 of 6)...
Database LCAs set. [21m24.805s]
Database construction complete. [Total: 1h1m28.108s]

Which certainly makes it look like things have worked.

I’ll update the post with info from testing in due course.

 

Update 13/04/2016

I simulated some reads from a random set of fastas in the database, then ran Kraken, and it worked (plus/minus a few false positives)

kraken_test_png

We need to stop making this simple f*cking mistake

I’m not perfect.  Not in any way.  I am sure if anyone was so inclined, they could work their way through my research with clinical forensic attention-to-detail and uncover all sorts of mistakes.  The same will be true for any other scientist, I expect.  We’re human and we make mistakes.

However, there is one mistake in bioinformatics that is so common, and which has been around for so long, that it’s really annoying when it keeps happening:

It turns out the Carp genome is full of Illumina adapters.

One of the first things we teach people in our NGS courses is how to remove adapters.  It’s not hard – we use CutAdapt, but many other tools exist.   It’s simple, but really important – with De Bruijn graphs you will get paths through the graphs converging on kmers from adapters; and with OLC assemblers you will get spurious overlaps.  With gap-fillers, it’s possible to fill the gaps with sequences ending in adapters, and this may be what happened in the Carp genome.

Why then are we finding such elementary mistakes in such important papers?  Why aren’t reviewers picking up on this?  It’s frustrating.

This is a separate, but related issue, to genomic contamination – the Wheat genome has PhiX in it; tons of bacterial genomes do too; and lots of bacterial genes were problematically included in the Tardigrade genome and declared as horizontal gene transfer.

Genomic contamination can be hard to find, but sequence adapters are not.  Who isn’t adapter trimming in 2016?!

Older posts

© 2018 Opiniomics

Theme by Anders NorenUp ↑