bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Category: bioinformatics (page 3 of 8)

How to extract FASTQ from the new MinION FAST5 format using poRe

A little while ago I demonstrated how to extract FASTQ from MinION FAST5 files using Rscript and the Linux command line.

In that article, I described how to extract the different FASTQ data using a parameter:

# 2D is the default
invisible(apply(array(f5files), 1, printfastq))
# template
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_template"))
# complement
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_complement"))

The f5path parameter is the address of the FASTQ data within the FAST5 file, and that’s been changed recently by nanopore.  Well, we can very easily simply edit our scripts to cope with the new format:

# 2D is the default
invisible(apply(array(f5files), 1, printfastq))
# template
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_1D_000/BaseCalled_template"))
# complement
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_1D_000/BaseCalled_complement"))

And that’s it! No new version of poRe required, simply a couple of edits to scripts – example scripts can be found on github.

The five habits of bad bioinformaticians

When ever I see bad bioinformatics, a little bit of me dies inside, because I know there is ultimately no reason for it to have happened.  As a community, bioinformaticians are wonderfully open, collaborative and helpful.  I genuinely believe that most problems can be fixed by appealing to a local expert or the wider community.  As Nick and I pointed out in our article, help is out there in the form of SeqAnswers and Biostars.

I die even more when I find that some poor soul has been doing bad bioinformatics for a long time.  This most often happens with isolated bioinformatics staff, often young, and either on their own or completely embedded within a wet-lab research group.  I wrote a blog post about pet bioinformaticians in 2013 and much of the advice I gave there still stands today.

There are so many aspects of bioinformatics that many wet lab PIs are simply incapable of managing them.  This isn’t a criticism.  There are very few superheros; few people who can actually span the gap of wet- and dry- science competently.  So I thought I’d write down the 5 bad habits of bad bioinformaticians.  If you are a pet bioinformatician, read these and figure out if you’re doing them.  If you’re a wet lab PI managing bioinformatics staff, try and find out if they have any of these habits!

Using inappropriate software

This can take many forms, and the most common is out-of-date software.  Perhaps they still use Maq, when BWA and other tools are now far more accurate; perhaps the software is a major version out of date (e.g. Bowtie instead of Bowtie2).  New releases come out for a reason, and major reasons are (i) new/better functionality; (ii) fixing bugs in old code.  If you run out of date software, you are probably introducing errors into your workflow; and you may be missing out on more accurate methods.

Another major cause of inappropriate software use is that people often use the software they can install rather than the software that is best for the job.  Sometimes this is a good reason, but most often it isn’t.  It is worth persevering – if the community says that a tool is the correct one to use, don’t give up on it because it won’t install.

Finally, there are just simple mistakes – using a non-spliced aligner for RNA-Seq, software that assumes the wrong statistical model (e.g. techniques that assume a normal distribution used on counts data) etc etc etc.

These aren’t minor annoyances, they can result in serious mistakes, and embed serious errors in your data which can result in bad science being published.  Systematic errors in analysis pipelines can often look like real results.

Not keeping up to date with the literature

Bioinformatics is a fast moving field and it is important to keep up to date with the latest developments.  A good recent example is RNA-Seq – for a long time, the workflow has been “align to the genome, quantify against annotated genes”.  However, there is increasing evidence that the alignment stage introduces a lot of noise/error, and there are new alignment free tools that are both faster and more accurate.  That’s not to say that you must always work with the most bleeding edge software tools, but there is a happy medium where new tools are compared to existing tools and shown to be superior.

Look for example here, a paper suggesting that the use of SAMtools for indel discovery may come with a 60% false discovery rate.  60%!  Wow….. of course that was written in 2013, and in 2014 a comparison of a more recent version of SAMtools shows a better performance (though still an inflated false positive rate).

Bioinformatics is a research subject.  It’s complicated.

All of this feeds back to point (1) above.  Keeping up to date with the literature is essential, otherwise you will use inappropriate software and introduce errors.

Writing terrible documentation, or not writing any at all

Bioinformaticians don’t document things in the same way wet lab scientists do.  Wet lab scientists keep lab books, which have many flaws; however they are a real physical thing that people are used to dealing with.  Lab books are reviewed and signed off regularly.  You can tell if things have not been documented well using this process.

How do bioinformaticians document things?  Well, often using things like readme.txt files, and on web-based media such as wikis or github.  My experience is that many bioinformaticians, especially young and inexperienced ones, will keep either (i) terrible or (ii) no notes on what they have done.  They’re equally as bad as one another.  Keeping notes, version controlled notes, on what happened in what order, what was done and by whom, is essential for reproducible research.  It is essential for good science.  If you don’t keep good notes, then you will forget what you did pretty quickly, and if you don’t know what you did, no-one else has a chance.

Not writing tests

Tests are essential.  They are the controls (positive and negative) of bioinformatics research and don’t just apply to software development.  Bad bioinformaticians don’t write tests.  As a really simple example, if you are converting a column of counts to a column of percentages you may want to sum the percentages to make sure they sum to 100.  Simple but it will catch errors.  You may want to find out the sex of all of the samples you are processing and make sure they map appropriately to the signal from sex chromosomes.  There are all sorts of internal tests that you can carry out when performing any analysis, and you must implement them, otherwise errors creep in and you won’t know about it.

Rewriting things that exist already

As Nick and I said, “Someone has already done this, find them!”.  No matter what problem you are working on, 99.9999% of the time someone else has already encountered it and either (i) solved it, or (ii) is working on it.  Don’t re-invent the wheel.  Find the work that has already been done and either use it or extend it.  Try not to reject it because it’s been written in the “wrong” programming language.

My experience is that most bioinformaticians, left to their own devices, will rewrite everything themselves in their preferred programming language and with their own quirks.  This is not only a waste of time, it is dangerous, because they may introduce errors or may not consider problems other groups have encountered and solved.  Working together in a community brings safety, it brings multiple minds to the table to consider and solve problems.


There we are.  That’s my five habits of bad bioinformaticians.  Are you doing any of these?  STOP.  Do you manage a bioinformatician?  SEND THEM THIS BLOG POST, and see what they say 😉

What does the PacBio Sequel mean for the future of sequencing?

PacBio have responded to intense competition in the sequencing market by releasing the Sequel, a machine that promises 7 times the throughput of their last machine (the RSII) at a price of $350k (the RSII cost more in the region of $750k). So they have a cheaper, higher throughput machine (though I haven’t seen cost-per-Gb figures).  That’s around 7 Gigabases of reads averaging around 15Kb in length.  Without doubt this is a very interesting platform and they will sell as many of them as they can produce.  A sweet spot for assembly is still 50-60X for PacBio, so think 12 SMRT cells to get a human genome: let’s say £12k per genome. Edit 01/10/2015 my maths at 7am was not that good!  50X human genome is 150Gb, so that’s 21 SMRT cells and £21K per human genome.   Much more expensive than Illumina’s $1000 genome, but far, far better.

I just want to say that at times I have been accused of being an Illumina- and a nanopore- fanboy; I am neither and both.  I am just a fan of cool technology, from microarray in the 90s to sequencing now.

In long reads, let’s be clear, we are talking about the promise of Oxford Nanopore vs the proven technology of PacBio.  And the Sequel changes the dynamics.  However, the MinION fast mode is capable of throughput in the region of 7Gb (like the Sequel) and the PromethION is capable of throughput on Illumina-scale.   Therefore, Oxford Nanopore are far from dead – though they need to respond.

So how does the new PacBio Sequel change the market?  A lot of initial reactions I have had are that the Sequel is a real threat to Oxford Nanopore.  It certainly ramps up the competition in the long read space, which is a really good thing.  But actually, high-throughput long read machines like the Sequel and the PromethION don’t spell the end for one another – they actually spell the beginning of the end for Illumina – as a sequencing platform.

As soon as you have high-throughput, cheap long reads, it is in fact Illumina who face a problem.  I love Illumina.  When I first arrived at Roslin, I walked into our lab and (honestly!) stroked our Illumina GAIIx.  Illumina have revolutionised biology.  However, short reads have limitations – they are bad for genome assembly, they are bad at complex genomes, they’re actually quite bad at RNA-Seq, they are pretty bad for structural variation, they are bad at haplotypes and SNP phasing, and they are not that great at metagenomics.  What has made Illumina the platform of choice for those applications is scale – but as soon as long read technologies reach a similar scale, Illumina looks like a poor choice.

The Sequel (and the PromethION) actually challenge Illumina – because in an era of cheap, long read sequencing, Illumina becomes a genotyping platform, not a sequencing platform.

Extracting MinION fastq on the command line using poRe

Hopefully most of you will be aware of our software for handling MinION data, poRe, published in bioinformatics this year:

Watson M, Thomson M, Risse J, Talbot R, Santoyo-Lopez J, Gharbi K, Blaxter M. (2015) 
poRe: an R package for the visualization and analysis of nanopore sequencing data. 
Bioinformatics 31(1):114-5.

poRe is based on R; many people consider R to be an interactive package, but it is very easy to write command-line scripts for R in very much the same way as you write Perl or Python scripts.  Below is a script that can be used in conjunction with R and poRe to extract 2D fastq data from a directory full of fast5 files:

#!/usr/bin/env Rscript

# get command line arguments as an array
args <- commandArgs(trailingOnly = TRUE)

# if we don't have any command line arguments
# print out usage
if (length(args) == 0) {
        stop("Please provide a directory containing fast5 files as the first (and only) argument")

# the first argument is the directory, stop if it doesn't exist
my.dir <- args[1]
if (! file.exists(my.dir)) {
        stop(paste(my.dir, " does not exist"))

# get a list of all of the files with a .fast5 file name extension
f5files <- dir(path = my.dir, pattern = "\.fast5$", full.names = TRUE)

# print to STDERR how many files we are going to attempt
write(paste("Extracting 2D fastq from", length(f5files), "files"), stderr())

# load poRe

# apply printfastq to every entry in f5files
# printfastq tests for 2D fastq and if found prints to STDOUT
invisible(apply(array(f5files), 1, printfastq))

The suppressMessages() function means that none of the output is printed when the poRe library is loaded, and the invisible() function suppresses the natural output of apply() (which would otherwise return an array of undefineds the same length as f5files (which we definitely do not want).

The above is based on R 3.1.2 and poRe 0.9.

We could extract template and complement fastq (respectively) by substituting in the following lines:

# template
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_template"))
# complement
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_complement"))


Finally, a definition for bioinformatics

Following the trend for ultra-short-but-to-the-point blog posts, I have decided to finally define bioinformatics:

bio-informatics (bi-oh-in-foh-shit-I-don’t-understand-how-that-works)

From the word bio, meaning “of or related to biology” and informatics, meaning “absolutely anything your collaborators or boss don’t understand about maths, statistics or computing, including why they can’t print and how the internet works”

Happy to finally put that one to bed!

Why nanopore sequencing errors aren’t actually errors

Let me start by saying that what I’ve written below are opinions formed after many conversations with many different people, including some from ONT, over the last 2-3 years.  I don’t want anyone to think I am stealing their ideas – but I think it’s important to get them out there.

Oxford Nanopore’s technology, currently available in the MinION device, is the only sequencing technology to directly sequence an actual strand of DNA.  Every other technology sequences incorporation events into a template strand (including Sanger, Illumina, Ion and PacBio) – and invariably those incorporation events involve the canonical bases, A, G, C and T.  However, in nature, DNA is much more complicated than that – there are base modifications and base analogues.  The MinION (and other ONT sequencers) is the only sequencer that directly detects those modifications and analogues.  However, what we do when we “base call” the raw ONT data is to compress it from natural-DNA-space into canonical-AGCT-space, and whenever we compress we lose signal, and that lost signal turns up in the system as noise – as so-called “error”.  People talk about the error rate of the MinION being around 10% – but not all of it is error, some of it is signal, signal for something we don’t quite understand yet so we’re interpreting it (wrongly) as AGCT.

That’s one of the exciting things about nanopore sequencing that no-one is talking about.  Oxford Nanopore’s sequencers could possibly reveal a whole new alphabet for DNA that we didn’t know about before – and my bet is that it’ll explain some of what we currently struggle to understand in genetics and genomics.

Assembling B fragilis from MinION and Illumina data

You may have seen our bioRxiv preprint about the sequencing and assembly of B fragilis using Illumina + MinION sequence data.  Well, here is how to do it yourself.

First get the data:

# MinION data (raw dast5 data; needs to be extracted)
wget ftp://ftp.sra.ebi.ac.uk/vol1/ERA463/ERA463589/oxfordnanopore_native/FAA37759_GB2974_MAP005_20150423__2D_basecalling_v1.14_2D.tar.gz
mkdir fragilis_minion
tar -xzf FAA37759_GB2974_MAP005_20150423__2D_basecalling_v1.14_2D.tar.gz -C fragilis_minion
rm fragilis_minion/*.md5

# MiSeq data (fastq data)
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR973/ERR973713/ERR973713_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR973/ERR973713/ERR973713_2.fastq.gz

Then, within R:


This will extract all sequences as FASTA into fragilis_minion/extracted.  Let’s put all the 2D reads into one file:

cat fragilis_minion/extracted/*.2D.fasta > fragilis_minion.2D.fasta

And finally we are ready to assemble it:

spades.py -o spades_fragilis -1 ERR973713_1.fastq.gz -2 ERR973713_2.fastq.gz --nanopore fragilis_minion.2D.fasta

That’s as far as I can go on my Ubuntu laptop, wiil update when I get to work!

Why FPKM makes sense

Simulating the perfect scenario

Let’s get our hands dirty in R.  I’m going to simulate 10 genes, each between 200 and 10000 nucleotides long and with between 0 and 50000 molecules in the mix:

gmat <- data.frame(id=paste("G",1:10,sep=""),
 length=as.integer(runif(10, min=200, max=10001)), 
 nmol=as.integer(runif(10, min=0, max=50001)),

This is what my data look like

> gmat
    id length  nmol
1   G1   5432  3206
2   G2   7177 43621
3   G3   6189 48670
4   G4   8529   635
5   G5   2576 27919
6   G6   8462 33280
7   G7   4316 18735
8   G8   3600 41457
9   G9   4973 21935
10 G10   9742 26207

What I want to do now is calculate the total amount of sequence space each gene represents, and use that to calculate a probability of RNA-Seq reads coming from that gene.  A simple example: if there are 10 copies of a gene in the mix, each of 1000 nucleotides long, then the sequence space of that gene is 10,000 nucleotides.

gmat$sspace <- gmat$length * gmat$nmol
ssum <- sum(gmat$sspace)
gmat$p <- gmat$sspace/ssum

My data now look like this:

> gmat
    id length  nmol    sspace          p
1   G1   5432  3206  17414992 0.01098634
2   G2   7177 43621 313067917 0.19750063
3   G3   6189 48670 301218630 0.19002544
4   G4   8529   635   5415915 0.00341666
5   G5   2576 27919  71919344 0.04537072
6   G6   8462 33280 281615360 0.17765861
7   G7   4316 18735  80860260 0.05101114
8   G8   3600 41457 149245200 0.09415216
9   G9   4973 21935 109082755 0.06881546
10 G10   9742 26207 255308594 0.16106284

My p-values should sum to one, and they do:

> sum(gmat$p)
[1] 1

Now, let’s assume that we have 30 million RNA-Seq reads, and that those reads have an equal chance of coming from any of the copies of any of the genes in the mix.  Therefore, 30million multiplied by my p-values will give me my RNA-Seq counts:

gmat$nreads <- round(30000000 * gmat$p)

And I can now calculate my FPKM values:

gmat$fpkm <- (gmat$nreads / (gmat$length / 1000)) / 30

My data now look like this:

> gmat
    id length  nmol    sspace          p  nreads       fpkm
1   G1   5432  3206  17414992 0.01098634  329590  2022.5209
2   G2   7177 43621 313067917 0.19750063 5925019 27518.5509
3   G3   6189 48670 301218630 0.19002544 5700763 30703.7388
4   G4   8529   635   5415915 0.00341666  102500   400.5941
5   G5   2576 27919  71919344 0.04537072 1361121 17612.8500
6   G6   8462 33280 281615360 0.17765861 5329758 20994.8719
7   G7   4316 18735  80860260 0.05101114 1530334 11819.0767
8   G8   3600 41457 149245200 0.09415216 2824565 26153.3805
9   G9   4973 21935 109082755 0.06881546 2064464 13837.8180
10 G10   9742 26207 255308594 0.16106284 4831885 16532.8309

Satisfyingly, I have a nice linear relationship between the number of molecules in the mix, and my calculated FPKM:


We can calculate the relationship between these (forcing the model through the origin):

> lm(nmol ~ 0 + fpkm, data = gmat)

lm(formula = nmol ~ 0 + fpkm, data = gmat)


So, in our model, nmol = 1.585 * fpkm, and we can add this line to the plot:

p + geom_point() + geom_abline(intercept=0, slope=1.585, col="red")


(please do not use this formula more generically – it really only works in this simple simulation!)

So for perfect simulated data, the number of molecules (transcripts) in the original mix correlates perfectly with FPKM.

(unless I have made any huge errors or incorrect assumptions, which is possible)

101 bioinformatics facts

I am trying to crowd-source 101 fun bioinformatics facts!  Please contribute in the comments and I will add them below:

  1. BLAST is so fast, the authors had to deliberately slow down the code so it doesn’t overheat the servers
  2. GCG, the old bioinformatics package, was named after the authors kept high-fiving each other, shouting “good code guys!”
  3. Bowtie is named so because “it is almost impossible to tie”, referring to code to avoid a “race condition” when using multiple processors
  4. TopHat is named do because it was the first spliced RNA-Seq aligner, and when it worked first time, the authors shouted “Top that!”
  5. Over 1 billion people have searched the NCBI protein database for their own name
  6. The EBI is an elaborate front-end to NCBI services
  7. The SRA (short read archive) is the best known of the archives, and not many people know or use the MRA (medium read archive), the KLRA (kinda long read archive) and the LRA (long read archive)
  8. Europe PubMed Central has only ever been accessed by people accidentally clicking on links.  100% of visitors immediately bounce to pubmed.com
  9. There are now more journals than papers
  10. The HGAP assembler is actually an elaborate front-end hiding three thousand slave labourers all running GAP4 (via @IanGoodhead)
  11. HPC actually means “Homunculus Powered Computing”, and all servers are actually just mechanical turks full of leprechauns (via @froggleston)
  12. The biosemantics.org group of LUMC is doing psipred protein folding on 1kWh household radiators (128 cores each) https://vimeo.com/122893200 (via Eric Feliksik)
  13. The ‘p’ in p-value actually stands for p-otentially interesting! (via )
  14. Velvet is so named because @dzerbino wore velvet gloves when coding it (via @pathogenomenick)
  15. The @PacBio machines are so large because inside’s an Illumina machine + a bioinformatician running assemblies (via )
  16. The Cloud is actually just a cloud. That’s it. A real cloud (via @froggleston)
  17. If you plug in a @nanopore MinION and hit left,right,up,up,A,B,down, it’ll transform into a lifesize statue of @Clive_G_Brown (via @froggleston)
  18. DDBJ have their data centre in a volcano, and are basically a front for Osato Chemicals (SPECTRE) (via @SCEdmunds)
  19. Hidden Markov Models were initially developed to find Waldo shawnhymel.com/portfolio/413/ (via )
  20. 99.5% of people who cite Altschul et al have never read the paper
  21. Bioinformatics Applications Notes have to be automatically generated by the software they describe
  22. BGI exclusively publish in Nature journals because their papers are first rejected by Gigascience
  23. BGI actually only have one HiSeq but made to look like hundreds by a set up of mirrors, like that bit in Enter the Dragon (via @froggleston)
  24. the consumption rate of coffee (+ beer 🍻) among Bioinformaticians from around the world is increasing every year. TRUE FACT! (via @NazeefaFatima)
  25. EBI” actually stands for “European bureau of investigation”. It’s a front of the EU secret service, collecting genomic info (via @klmr)
  26. There are only 3 facts in 101 (via @mcaccamo)
  27. If all you have is a hmmer everything looks like it can be resolved with Viterbi (via @mcaccamo)
  28. Hidden Markov Models are like the recipe for Kentucky Fried Chicken.  There are only three people in the world who understand small parts of how HMMs work, and only when they get together do they know the full picture
  29. The “e” in e-value stands for “excellent”, as in “that’s an excellent BLAST hit”
  30. The Burrows-Wheeler transform, used in BWA and Bowtie, saves memory by transforming the DNA sequence data into a parallel dimension, meaning it ceases to exist in 4D space/time in this Universe
  31. Base qualities are called “Phred” scores in honour of Fred Sanger who developed DNA sequencing. #101bioinfofunfacts (via @tostenseemann)
  32. In a recent public survey of the 100 most desirable jobs, bioinformatician was a close second to astronaut (via @dynomics)
  33. Heng Li writes all his code in x86 assembly language, and uses a C decompiler before releasing it. @lh3lh3 (via @torstenseemann)
  34. The EBI secretly funds the Perl Foundation to ensure its legacy internal software infrastructure won’t collapse (via @torstenseemann)
  35. Illumina reads are short as before the development of Basespace they were delivered via Twitter (via @RoyChaudhuri)
  36. Pet Bioinformaticians are paid with cuddling #101bioinfofunfacts (via )
  37. Python was conceived in the 1980’s by @gvanrossum & named after his favourite British comedy, Monty Python’s Flying Circus (via )
  38. the word “ELVIS” appears 35 times in human peps (GRch38). “ELVISLIVES” appears 0 times. The king has left the genome #slowday (via @rdemes)
  39. Tuxedo suit is so named that only ‘privileged’ know how to use it ! #bioinformaticsfun (via )
  40. It’s easy! You only have to download this database in which all the genes have only one ID and you can retrieve the IDs in the most important databases (via @jorjial)
  41. If you stand in front of a mirror and say ‘HiSeq’ 3 times, Illumina staff member will show up holding the HiSeq X Ten system (via @nazeetafatima)
  42. @BenLangmead wrote Bowtie while wearing a tuxedo but he did all the testing in zip-up onesie batman pajamas (via @coletrapnell)
  43. Spike-ins are like gold (via @nomad421)
  44. Do you need more hard disk space to store and do the analysis? Sure! Let’s buy 10 hard disk of 3 TB in the supermarket (via @jorjial)
  45. This could be the basis for 10.1 papers in PLOS Comp. Biol. (via @kbradnam)
  46. All bioinformatics problems can be solved through the medium of twitter, snide and ranting 😉 (via @guyleonard)
  47. Installing TopHat with option –reverse will install HotTap, a program that spews vapid results on a random science hot topic (via @CamLBerthelot)
  48. SOLiD sequencers generated colour-space sequence using an algorithm based on the once popular “Simon Says” hand held game (via @iandcalling)
  49. CriMap was called CriMap because users do an awful lot of crying before they get a half decent map (via @dj_de_koning)
  50. A single anonymous donor, RP11, accounts for 72 percent of the human reference genome (via CanGenom)
  51. If you amass the de-bugging tears of a bioinformatician it is enough to fill an Olympic size swimming pool annually (via @paulhoskisson)
  52. FASTA 80 character line wrapping was invented to standardise data sharing using MS Word (via @IanGoodhead)
  53. nine out of ten Bioinformaticians prefer Excel (via
  54. if you’ve never shown the NIH sequencing costs plot in talk/lecture you’re not a real bioinformatician pic.twitter.com/jQzG7MGosd (via @AliciaOshlack)
  55. Illumina is short for Illuminati, the shadowy organisation that controls sequencing worldwide. (via @neilfws)
  56. Every time you run a closed source bioinformatics tool, a PhD student’s soul is sacrificed to the Blood God. (via @froggleston)
  57. The number of replicates needed for your RNA-seq experiment equals the impact factor of the journal you want to publish in (via @torstenseemann)
  58. NCBI’s bacterial annotation takes 6 weeks because it’s done manually by work experience students pasting ORFs into web BLAST (via @torstenseemann)
  59. The majority of bioinformaticians can’t pronounce “de Bruijn” properly (see also thegenomefactory.blogspot.sg/2013/08/how-to… @torstenseemann) (via @rvaerle)
  60. Oxford Nanopore plans to introduce a new FASTQ encoding scheme using an ASCII offset of 48 with optional emoji (via @torstenseemann)
  61. The HMMer package was so named when someone asked how it worked, and the developers said “Hmmmm… errr….” (via @mgollery)
  62. 63% of Bioinformaticists were Biologists to start with, but they realized that the cold room is really COLD! (via @mgollery)
  63. It has been calculated that there are twice as many data formats as there are Bioinformaticians (via @mgollery)

Analysing MinION data with no bioinformatics expertise nor infrastructure

It has been a long standing aim of mine to enable easy analysis of Oxford Nanopore MinION data, preferably by people who don’t have a ton of bioinformatics skills.  This was one of the reasons we started to develop poRe, which is really easy to install, runs on any platform and is fairly easy to use.

Below is a proof-of-concept method for analysing MinION data without the need for bioinformatics infrastructure – I am doing this on my Windows laptop, with only R and an internet connection.  We are going to use the EBI web-services to do the heavy lifting.  Please do not abuse this!  EBI probably don’t want you to submit 20,000 jobs.  As I said, this is “proof-of-concept”.  It means you’re not supposed to actually use it in practice 😉

Some polite things to do:

  1. Use your own, real e-mail address when/if you run the code below
  2. Please stick to a small number of jobs! (unless EBI tell you it’s OK)

We’re going to do everything in R.

Make a directory on your computer where we will do the work.  You will need approx. 15Gb of space (I know, the data are big!).   My directory is going to be “C:/Users/Mick/Documents/MinION”.  We’re going to use data from Nick Loman’s group, published here and here.

# go to the working directory

# load the necessary packages

# download one of Nick's runs and unpack it
url <- "ftp://ftp.sra.ebi.ac.uk/vol1/ERA362/ERA362836/oxfordnanopore_native/Ecoli_R7_ONI_flowcell_17.tar.gz"
dest <- "Ecoli_R7_ONI_flowcell_17.tar.gz"
download.file(url=url, destfile=dest)

This will take some time, as there is about 10Gb of data there.  If I had a smaller dataset, I would use it.

Now, we use the poRe get_fast5_info() function to read statistics from the several thousand MinION reads, which again takes some time.  We then just select the reads that have a 2D sequence:

# read information from all .fast5 files in the current working directory
finfo <- read.fast5.info(dir=getwd())

# get a list of fast5 files where the "len2d" is greater than zero
reads.2d <- rownames(finfo)[finfo$len2d>0]

Now, we are going to use the EBI REST API for NCBI BLAST.  You can read about it at the link provided, but essentially it is about building and querying custom URLs to submit BLAST jobs and fetch the result.

We are only going to do this for the first ten fast5 sequences.  Please do not submit more, or the EBI may block you.  If you want to use this in anger, I suggest contacting the EBI directly and asking them politely.  The code blasts each sequence with default parameters against the EMBL bacterial database, saves the BLAST report to your current working directory, and prints out the top hit.

Here goes:

# iterate over the first ten 2D reads
for (f5 in reads.2d[1:10]) {

    # use poRe function get_fasta() to get the 2D sequence
    	fasta.2d <- get_fasta(f5)$'2D'
    # build a list of parameters for the EBI BLAST service
    	myl <- list(sequence=fasta.2d,

    # build the submission URL and submit the job
    runurl <- "http://www.ebi.ac.uk/Tools/services/rest/ncbiblast/run"
    	jobid <- postForm(runurl, .params=myl)

    # query the status URL for the job until the status
    # is no longer "RUNNING"
    staturl <- "http://www.ebi.ac.uk/Tools/services/rest/ncbiblast/status/"
    	surl <- paste(staturl, jobid, sep="")
    	status <- "RUNNING"
    	while(status == "RUNNING") {
          		status <- getURL(surl)

    # build the output URL and fetch the results
    outurl <- "http://www.ebi.ac.uk/Tools/services/rest/ncbiblast/result/"
    	ourl <- paste(outurl, jobid, "/out", sep="")
    	out <- getURL(ourl)

    # write the output to a file in the current working directory
    	bloutfile <- gsub(".fast5",".blast",f5)
    	write(out, bloutfile)

    # print out the top hit
    	print(strsplit(out, "n")[[1]][22])


Now, it’s not quick – it probably takes about 2-3 minutes to complete on my laptop.  It also isn’t very responsive (in that it looks a bit like it may not be doing anything!).  However, eventually the R terminal will show the top hits:


You can see from the above that the first few reads are a bit dodgy and don’t really hit anything, but pretty quickly the reads start hitting E coli K12, which is what they sequenced.

Also, in your directory you will find a whole bunch of BLAST reports:


Note, as “proof-of-concept” code, there is very little (i.e. nothing) in the way of error-checking in the above code!  I mean, come on, it’s only a blog post!


Older posts Newer posts

© 2021 Opiniomics

Theme by Anders NorenUp ↑