bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Category: Uncategorised

University of California makes legal move against Roger Chen (and Genia)

The relationship between sequencing companies is frosty and beset by legal issues, which I’ve covered before here and here.  Keith Robison tends to cover in more detail ūüėČ

Most recently, PacBio moved against Oxford Nanopore, we think claiming that ONT’s 2D technology violated their patent on CCR (link).

Well now the absolute latest is a filing by the University of California against Roger Chen and therefore Genia. If you click through to the documents (requires registration) you’ll see that UC claim Chen, with others, produced key inventions whilst at UC that he later assigned to Genia, but which should have automatically been assigned to UC according to UC’s “oath of allegiance”, which Chen signed as a UC employee.

It awaits to be seen how important this is and no doubt Chen/Genia/Roche will fight tooth and nail; however if the courts decide in UC’s favour it could spell the end of Genia, and at the very least see a large cash settlement with UC.

Fascinating times!

People are wrong about sequencing costs on the internet again

People are wrong about sequencing costs on the internet again and it hurts my face, so I had to write a blog post.

Phil Ashton, whom I like very much, posted this blog:

But the words are all wrong ūüėČ ¬†I’ll keep this short:

  • COST¬†is what it COSTS to do something. ¬†It includes all COSTS. ¬†The clue is in the name. ¬†COST. It’s right there.
  • PRICE is what a consumer pays for something.

These are not the same thing.

As a service provider, if the PRICE you charge to users is lower than your COST, then you are either SUBSIDISED or LOSING MONEY, and are probably NOT SUSTAINABLE.

COST, amongst other things, includes:

  • Reagents
  • Staff time
  • Capital cost or replacement cost (sequencer and compute)
  • Service and maintenance fees
  • Overheads
  • Rent

Someone is paying these, even if it’s not the consumer. ¬†So please – when discussing sequencing – distinguish between¬†PRICE and¬†COST.

Thank you 

The unbearable madness of microbiome

This is my attempt to collate the literature on how easy it is to introduce bias into microbiome studies. ¬†I hope to crowd-source papers and add them below under each category. ¬†PLEASE GET INVOLVED. ¬†If this works well we can turn it into a comprehensive review and publish ūüôā Add new papers in the comments or Tweet me ūüôā

Special mention to these blogs:

  1. Microbiomdigest’s page on Sample Storage
  2. Microbe.net’s¬†Best practices for sample processing and storage prior to microbiome DNA analysis freeze? buffer? process?
  3. The-Scientist.com’s Spoiler Alert


UPDATE 7th August 2016


For prosperity the original blog post is below, but I am now trying to manage this through a ZOTERO GROUP LIBRARY.  Please continue to contribute Рjoin the group, get involved!  I think you can join the group here.  I am trying to avoid just listing software papers BTW I would prefer to focus on papers that specifically demonstrate sources of bias.

I won’t be updating the text below.


  • “We propose that best practice should include the use of a proofreading¬†polymerase and a highly processive polymerase, that sequencing¬†primers that overlap with amplification primers should not be used,¬†that template concentration should be optimized, and that the number¬†of PCR cycles should be minimized.” (doi:10.1038/nbt.3601)

Sample collection and sample storage

  • “Samples frozen with and without glycerol as cryoprotectant indicated a major loss of Bacteroidetes in unprotected samples” (24161897)
  • “No significant differences occurred in the culture-based analysis between the fresh, snap or -80¬įC frozen samples.” (25748176)
  • “In seven of nine cases, the Firmicutes toBacteroidetes 16S rRNA gene ratio was significantly higher in fecal samples that had been frozen compared to identical samples that had not….. The results demonstrate that storage conditions of fecal samples may adversely affect the determined Firmicutes to Bacteroidetes ratio, which is a frequently used biomarker in gut microbiology.” (22325006)
  • “Our results indicate that environmental factors and biases in molecular techniques likely confer greater amounts of variation to microbial communities than do differences in short-term storage conditions, including storage for up to 2 weeks at room temperature” (10.1111/j.1574-6968.2010.01965.x)
  • “The trichloroacetic acid preserved sample showed significant loss of protein band integrity on the SDS-PAGE gel. The RNAlaterpreserved sample showed the highest number of protein identifications (103% relative to the control; 520‚ÄȬĪ‚ÄČ31 identifications in RNAlater versus 504‚ÄȬĪ‚ÄČ4 in the control), equivalent to the frozen control. Relative abundances of individual proteins in the RNAlater treatment were quite similar to that of the frozen control (average ratio of 1.01‚ÄȬĪ‚ÄČ0.27 for the 50 most abundant proteins), while the SDS-extraction buffer, ethanol, and B-PER all showed significant decreases in both number of identifications and relative abundances of individual proteins.” (10.3389/fmicb.2011.00215)
  • “Optimized Cryopreservation of Mixed Microbial Communities for Conserved Functionality and Diversity” (10.1371/journal.pone.0099517)
  • “A previously known bias in FTA(¬ģ) cards that results in lower recovery of pure cultures of Gram-positive bacteria was also detected in mixed community samples. There appears to be a uniform bias across all five preservation methods against microorganisms with high G + C DNA. Overall, the liquid-based preservatives (DNAgard(‚ĄĘ), RNAlater(¬ģ), and DESS) outperformed the card-based methods.” (22974342)
  • “Microbial composition of frozen and ethanol samples were most similar to fresh samples. FTA card and RNAlater-preserved samples had the least similar microbial composition and abundance compared to fresh samples.” (10.1016/j.mimet.2015.03.021)
  • “Bray-Curtis dissimilarity and (un)weighted UniFrac showed a significant higher distance between fecal swabs and -80¬įC versus the other methods and -80¬įC samples (p<0.009). The relative abundance of Ruminococcus and Enterobacteriaceae did not differ between the storage methods versus -80¬įC, but was higher in fecal swabs (p<0.05)” (10.1371/journal.pone.0126685)
  • “We experimentally determined that the bacterial taxa varied with room temperature storage beyond 15 minutes and beyond three days storage in a domestic frost-free freezer. While freeze thawing only had an effect on bacterial taxa abundance beyond four cycles, the use of samples stored in RNAlater should be avoided as overall DNA yields were reduced as well as the detection of bacterial taxa.” (10.1371/journal.pone.0134802)
  • “A key assumption in many studies is the stability of samples stored long term at ‚ąí80¬†¬įC prior to extraction. After 2¬†years, we see relatively few changes: increased abundances of lactobacilli and bacilli and a reduction in the overall OTU count. Where samples cannot be frozen, we find that storing samples at room temperature does lead to significant changes in the microbial community after 2¬†days.” (10.1186/s40168-016-0186-x)

DNA extraction

  • “Caution should be paid when the intention is to pool and analyse samples or data from studies which have used different DNA extraction methods.” (27456340)
  • “Samples clustered according to the type of extracted DNA due to considerable differences between iDNA and eDNA bacterial profiles, while storage temperature and cryoprotectants additives had little effect on sample clustering” (24125910)
  • “Bifidobacteria were only well represented among amplified 16S rRNA gene sequences when mechanical disruption (bead-beating) procedures for DNA extraction were employed together with optimised ‚Äúuniversal‚ÄĚ PCR primers”¬†(26120470)
  • Qiagen DNA stool kit is biased (misses biffidobacteria)¬†¬†(26120470)
  • “Bead-beating has a major impact on the determined composition of the human stool microbiota. ¬†Different bead-beating instruments from the same producer gave a 3-fold difference in the Bacteroidetes to Firmicutes ratio” (10.1016/j.mimet.2016.08.005)
  • “We observed that using different DNA extraction kits can produce dramatically different results but bias is introduced regardless of the choice of kit.”¬†(10.1186/s12866-015-0351-6)

Sequencing strategy

  • Bifidobacteria were only well represented among amplified 16S rRNA gene sequences …¬†with optimised “universal” PCR primers. These primers incorporate degenerate bases at positions where mismatches to bifidobacteria and other bacterial taxa occur” (26120470)
  • Anything other than 2x250bp sequencing of V4 region (approx 250bp in length) inflates number of OTUs (23793624)
  • “This study demonstrates the potential for differential bias in bacterial community profiles resulting from the choice of sequencing platform alone.” (10.1128/AEM.02206-14)
  • “The effects of DNA extraction and PCR amplification for our protocols were much larger than those due to sequencing and classification” (10.1186/s12866-015-0351-6)
  • “Nested PCR introduced bias in estimated diversity and community structure. The bias was more significant for communities with relatively higher diversity and when more cycles were applied in the first round of PCR” (10.1371/journal.pone.0132253)
  • “pyrosequencing errors can lead to artificial inflation of diversity estimates” (19725865)
  • “Our findings suggest that when alternative sequencing approaches are used for microbial molecular profiling they can perform with good reproducibility, but care should be taken when comparing small differences between distinct methods” (25421243)

Data analysis strategy


Bioinformatics / database issues


Contamination in kits

  • “Reagent and laboratory contamination can critically impact sequence-based microbiome analyses” (25387460)
  • “Due to contamination of DNA extraction reagents, false-positive results can occur when applying broad-range real-time PCR based on bacterial 16S rDNA” (15722157)
  • “Relatively high initial densities of planktonic bacteria (10(2) to 10(3) bacteria per ml) were seen within [operating ultrapure water treatment systems intended for laboratory use]” (8517737)
  • “Sensitive, real-time PCR detects low-levels of contamination by Legionella pneumophila in commercial reagents” (16632318)
  • “Taq polymerase contains bacterial DNA of unknown origin” (2087233)


Commercial stuff







Not providing feedback on rejected research proposals is a betrayal of PIs, especially early career researchers

Those of you who follow UK research priorities can’t have missed the creation of the Global Challenges Research Fund (GCRF). ¬†Over the last few months the research councils in the UK have been running the first GCRF competition, a two stage process where preliminary applications are sifted and only certain chosen proposals are allowed to go through to the second stage. ¬† We put one in, and like many others didn’t make the cut. ¬†I don’t mind this, rejection is part of the process; however I do worry about this phrase in the rejection email:

Please note specific feedback on individual outlines will not be provided.

Before I go on, I want to launch an impassioned defense of the research councils themselves.  Overall I think they do a very good job of a very complex task.  They must receive many hundreds of applications, perhaps thousands, every year and they ensure each is reviewed, discussed at committee and a decision taken.  Feedback is provided and clearly UK science punches above its weight in global terms, so they must be doing something right.  They are funding the right things.

I’m also aware that they have had their budgets cut to the bone over the last decade and by all accounts (anecdotal so I can’t provide links) Swindon office has been cut to the bare minimum needed to have a functional system. ¬†In other words, in the face of cuts they have protected research spending. ¬†Good work ūüôā

I kind of knew GCRF was in trouble when I heard there had been 1400 preliminary applications. ¬†¬£40M pot with expected grants of ¬£600k means around 66 will be funded. ¬†That’s quite a sift!

The argument will go that, with that sheer number of applications there is no way the research councils can provide feedback.  Besides, it was a preliminary application anyway, so it matters less.

I couldn’t disagree more, on both accounts.

First of all lets deal with the “preliminary” application thing. ¬†Here is what had to happen to get our¬†preliminary application together:

  • Initial exchange of ideas via e-mail, meetings held, coffee drunk
  • Discussions with overseas collaborators in ODA country via skype
  • 4-page concept note¬†submitted to Roslin¬†Science Management Group (SMG)
  • SMG discussed at weekly meeting, feedback provided
  • Costings form submitted and acted on by Roslin finance
  • Quote for sequencing obtained from Edinburgh Genomics
  • Costings provided by two external partners, including partner in ODA country
  • Drafts circulated, commented on, acted on.
  • BBSRC form filled in (almost 2000 words)
  • Je-S form filled in, CV’s gathered, formatted and attached, form submitted

In actual fact this is quite a lot of work. ¬†I wouldn’t want to guess at the cost.

Do we deserve some feedback?  Yes.   Of course we do.

When my collaborators ask me why this was rejected, what do I tell them? ¬†“I don’t know”? ¬†Really?

Secondly, let’s deal with the “there were too many applications for us to provide feedback” issue. ¬†I have no idea how these applications were judged internally. ¬†I am unsure of the process. ¬†However, someone somewhere read it; they judged it; they scored it; forms were filled in; bullet points written; e-mails sent; meetings had; a ranked list of applications was created; somewhere, somehow, information about the quality of each proposal was created – why can we not have access to that information? ¬†Paste it into an e-mail and click send. ¬†I know it takes a bit of work, but¬†we put in a lot of work too, as did 1400 other applications. ¬†We deserve feedback.

At the moment we just don’t know – was the science poor? ¬†Not ODA enough? ¬†Not applied enough? ¬†Too research-y? ¬†Too blue sky? ¬†Wrong partners? ¬†We are floundering here and we need help.

Feedback to failed proposals is essential. ¬†It is essential for us to improve, especially for young and mid- career researchers, the ones who haven’t got secure positions, who are being judged on their ability to bring in money. ¬†Failure is never welcome, but feedback always is. ¬†It helps us understand the decision making processes of committees so we can do better next time. ¬†We are always being told “request feedback” so when it doesn’t come it feels like a betrayal. ¬†How can we do better if we don’t know how and why we failed?

So come on research councils; yes you do a great job; yes you fund great science, in the face of some savage budget cuts.  But please, think again on your decision to not provide feedback.  We need it.

Open analysis of ZiBRA project MinION data

The ZiBRA project aims to travel around Brazil, collecting Zika samples as they go and sequencing them in “real time” using Oxford Nanopore’s MinION. ¬†They released the first data yesterday, and I have put together some quick scripts to analyse that data and produce a consensus sequence.

First we get the data:

# get Zika reference
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NC_012532.1&rettype=fasta&retmode=txt" > zika_genome.fa

# get the data
wget -q http://s3.climb.ac.uk/nanopore/Zika_MP1751_PHE_Long_R9_2D.tgz

# surprise, it's not zipped
tar xvf Zika_MP1751_PHE_Long_R9_2D.tgz

Extract 2D FASTQ/A, calculate read lengths, extract metadata:

# extract 2D FASTQ
extract2D Zika_MP1751_PHE_Long_R9_2D/downloads_rnn_wf839/pass/ > zika.pass.2D.fastq 
extract2D Zika_MP1751_PHE_Long_R9_2D/downloads_rnn_wf839/fail/ > zika.fail.2D.fastq 

# convert to FASTA
porefq2fa zika.pass.2D.fastq > zika.pass.2D.fasta
porefq2fa zika.fail.2D.fastq > zika.fail.2D.fasta

# read lengths
cat zika.pass.2D.fastq | paste - - - - | awk '{print $3}' | awk '{print length($1)}' > zika.pass.2D.readlengths.txt

# extract metadata
extractMeta Zika_MP1751_PHE_Long_R9_2D/downloads_rnn_wf839/pass/ > zika.pass.meta.txt
extractMeta Zika_MP1751_PHE_Long_R9_2D/downloads_rnn_wf839/fail/ > zika.fail.meta.txt

Align to reference genome using BWA

# bwa index the genome
bwa index zika_genome.fa

# samtools index the genome
samtools faidx zika_genome.fa

# run bwa and pipe straight to samtools to create BAM
bwa mem -M -x ont2d zika_genome.fa zika.pass.2D.fastq | \
        samtools view -T zika_genome.fa -bS - | \
        samtools sort -T zika.pass_vs_zika_genome.bwa -o zika.pass_vs_zika_genome.bwa.bam -

samtools index zika.pass_vs_zika_genome.bwa.bam

Align to reference genome using LAST

# create a LAST db of the reference genome
lastdb zika_genome zika_genome.fa

# align high quaklity reads to reference genome with LAST
lastal -q 1 -a 1 -b 1 zika_genome zika.pass.2D.fasta > zika.pass_vs_zika_genome.maf

# convert the MAF to BAM with complete CIGAR (matches and mismatches)
python nanopore-scripts/maf-convert.py sam zika.pass_vs_zika_genome.maf | \
    samtools view -T zika_genome.fa -bS - | \
    samtools sort -T zika.pass_vs_zika_genome.last -o zika.pass_vs_zika_genome.last.bam -

samtools index zika.pass_vs_zika_genome.last.bam

Count errors using Aaron Quinlan’s excellent script:

# count errors
python nanopore-scripts/count-errors.py zika.pass_vs_zika_genome.last.bam > zika.pass_vs_zika_genome.last.txt

Produce a consensus sequence:

# produce consensus
samtools mpileup -vf zika_genome.fa zika.pass_vs_zika_genome.bwa.bam | bcftools call -m -O z - > allsites.vcf.gz
bcftools index allsites.vcf.gz
bcftools consensus -f zika_genome.fa allsites.vcf.gz > zika.minion.consensus.fasta

And finally, produce some QC images:

# QC images

This is all in a GitHub repo; I’m pretty sure the results produced here aren’t going to be used for anything serious, but this serves as a simple, exemplar, self-contained and reproducible analysis.

Converting from GenBank format to FASTA using Microsoft Excel

Should be a popular one this ūüôā

First of all we need an example. ¬†I’m using R23456, which we can download from NCBI. ¬†On that page, look towards the top-right, click “Send To”, choose “File”, leave format as “GenBank (full)” and click “Create File”. ¬†This should download to your computer.

In Excel, click File -> Open, navigate to the folder you downloaded the GenBank sequence to, make sure “All files (*.*)” is chosen in the File Open dialogue box and choose the recently downloaded file (it is almost certainly called sequence.gb)

This should immediately open up the Excel “Text Import Wizard”. ¬†Just hit Finish. ¬†You should see something like:


Now, scroll down to the sequence data, which in my version is in cells A55 to A61.  Select all of those cells.

Select the Data menu and then click “Text to columns”


This brings up another wizard, but again you can just click finish.  The sequence data is now in B55 through E61.

Now click in cell I55 and type “=B55&C55&D55&E55&F55&G55”. ¬†Hit enter. ¬†Now click on the little green square and drag down to I61 so we have sequence data for all of the cells.

Now we’re getting close!

Click cell A1 and then repeat the “Text to Columns” trick.

Now go back to down to cell I54 and type “=’>’&B1” and hit enter. ¬†We now have something that looks a lot like FASTA!

Final step – select all cells in the range I54:I61, hit Ctrl-C, then hit the “+” at the bottom to add a new sheet, and in the new¬†sheet hover over cell A1, right-click your mouse and choose Paste Special.

In the “Paste Special” dialogue click the radio button next to “Values” and hit OK.

Now click File -> Save As, navigate to a suitable folder, make sure “Save as type” is set to “Text (Tab delimited) (*.txt)”, give it a filename and hit Save. ¬†Click “OK” and “Yes” to Excel’s next two questions.

Go look at the file, it’s FASTA!


Awesome!  Who needs bioinformaticians eh?

(and if you use this, please¬†never, ever speak to me. ¬†Thanks ūüôā)

© 2019 Opiniomics

Theme by Anders NorenUp ↑