Opiniomics

bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Plot your own EU referendum poll results

Due to the unspeakable horror of the EU referendum, I have to find something to make me feel better.  This poll of polls usually does so, though it is way too close for comfort.

Anyway, I took their data and plotted it for myself.  Data and script are on github, and all you need is R.

Enjoy! #voteRemain

voteRemain

Open analysis of ZiBRA project MinION data

The ZiBRA project aims to travel around Brazil, collecting Zika samples as they go and sequencing them in “real time” using Oxford Nanopore’s MinION.  They released the first data yesterday, and I have put together some quick scripts to analyse that data and produce a consensus sequence.

First we get the data:


# get Zika reference
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NC_012532.1&rettype=fasta&retmode=txt" > zika_genome.fa

# get the data
wget -q http://s3.climb.ac.uk/nanopore/Zika_MP1751_PHE_Long_R9_2D.tgz

# surprise, it's not zipped
tar xvf Zika_MP1751_PHE_Long_R9_2D.tgz

Extract 2D FASTQ/A, calculate read lengths, extract metadata:


# extract 2D FASTQ
extract2D Zika_MP1751_PHE_Long_R9_2D/downloads_rnn_wf839/pass/ > zika.pass.2D.fastq 
extract2D Zika_MP1751_PHE_Long_R9_2D/downloads_rnn_wf839/fail/ > zika.fail.2D.fastq 

# convert to FASTA
porefq2fa zika.pass.2D.fastq > zika.pass.2D.fasta
porefq2fa zika.fail.2D.fastq > zika.fail.2D.fasta

# read lengths
cat zika.pass.2D.fastq | paste - - - - | awk '{print $3}' | awk '{print length($1)}' > zika.pass.2D.readlengths.txt

# extract metadata
extractMeta Zika_MP1751_PHE_Long_R9_2D/downloads_rnn_wf839/pass/ > zika.pass.meta.txt
extractMeta Zika_MP1751_PHE_Long_R9_2D/downloads_rnn_wf839/fail/ > zika.fail.meta.txt

Align to reference genome using BWA


# bwa index the genome
bwa index zika_genome.fa

# samtools index the genome
samtools faidx zika_genome.fa

# run bwa and pipe straight to samtools to create BAM
bwa mem -M -x ont2d zika_genome.fa zika.pass.2D.fastq | \
        samtools view -T zika_genome.fa -bS - | \
        samtools sort -T zika.pass_vs_zika_genome.bwa -o zika.pass_vs_zika_genome.bwa.bam -

samtools index zika.pass_vs_zika_genome.bwa.bam

Align to reference genome using LAST


# create a LAST db of the reference genome
lastdb zika_genome zika_genome.fa

# align high quaklity reads to reference genome with LAST
lastal -q 1 -a 1 -b 1 zika_genome zika.pass.2D.fasta > zika.pass_vs_zika_genome.maf

# convert the MAF to BAM with complete CIGAR (matches and mismatches)
python nanopore-scripts/maf-convert.py sam zika.pass_vs_zika_genome.maf | \
    samtools view -T zika_genome.fa -bS - | \
    samtools sort -T zika.pass_vs_zika_genome.last -o zika.pass_vs_zika_genome.last.bam -

samtools index zika.pass_vs_zika_genome.last.bam

Count errors using Aaron Quinlan’s excellent script:


# count errors
python nanopore-scripts/count-errors.py zika.pass_vs_zika_genome.last.bam > zika.pass_vs_zika_genome.last.txt

Produce a consensus sequence:


# produce consensus
samtools mpileup -vf zika_genome.fa zika.pass_vs_zika_genome.bwa.bam | bcftools call -m -O z - > allsites.vcf.gz
bcftools index allsites.vcf.gz
bcftools consensus -f zika_genome.fa allsites.vcf.gz > zika.minion.consensus.fasta

And finally, produce some QC images:


# QC images
./scripts/qc.R

This is all in a GitHub repo; I’m pretty sure the results produced here aren’t going to be used for anything serious, but this serves as a simple, exemplar, self-contained and reproducible analysis.

Converting from GenBank format to FASTA using Microsoft Excel

Should be a popular one this 🙂

First of all we need an example.  I’m using R23456, which we can download from NCBI.  On that page, look towards the top-right, click “Send To”, choose “File”, leave format as “GenBank (full)” and click “Create File”.  This should download to your computer.

In Excel, click File -> Open, navigate to the folder you downloaded the GenBank sequence to, make sure “All files (*.*)” is chosen in the File Open dialogue box and choose the recently downloaded file (it is almost certainly called sequence.gb)

This should immediately open up the Excel “Text Import Wizard”.  Just hit Finish.  You should see something like:

excel

Now, scroll down to the sequence data, which in my version is in cells A55 to A61.  Select all of those cells.

Select the Data menu and then click “Text to columns”

test2columns

This brings up another wizard, but again you can just click finish.  The sequence data is now in B55 through E61.

Now click in cell I55 and type “=B55&C55&D55&E55&F55&G55”.  Hit enter.  Now click on the little green square and drag down to I61 so we have sequence data for all of the cells.

Now we’re getting close!

Click cell A1 and then repeat the “Text to Columns” trick.

Now go back to down to cell I54 and type “=’>’&B1” and hit enter.  We now have something that looks a lot like FASTA!

Final step – select all cells in the range I54:I61, hit Ctrl-C, then hit the “+” at the bottom to add a new sheet, and in the new sheet hover over cell A1, right-click your mouse and choose Paste Special.

In the “Paste Special” dialogue click the radio button next to “Values” and hit OK.

Now click File -> Save As, navigate to a suitable folder, make sure “Save as type” is set to “Text (Tab delimited) (*.txt)”, give it a filename and hit Save.  Click “OK” and “Yes” to Excel’s next two questions.

Go look at the file, it’s FASTA!

fasta

Awesome!  Who needs bioinformaticians eh?

(and if you use this, please never, ever speak to me.  Thanks 🙂)

You don’t need to work 80 hours a week in academia…. but you do need to succeed

I’ve been thinking a lot lately about academic careers, chiefly because I happen to be involved in some way with three fellowship applications at the moment.  For those of you unfamiliar with the academic system at work here, the process is: PhD -> Post-Doc -> Fellowship -> PI (group leader) -> Professorship (Chair).  So getting a fellowship is that crucial jump from post-doc to PI and represents a person’s first chance to drive their own research programme.  Sounds grand doesn’t it?  “Drive your own research programme”.  Wow.  Who wouldn’t want to do that?

Well, be careful what you wish for.  Seriously.  I love my job; I love science and I love computers and I get paid to do both.  It’s amazing and possibly (for me) one of the best jobs in the world.  However, it comes with huge pressures; job insecurity; unparalleled and relentless criticism; failures, both of your ideas and your experiments; and occasionally the necessity of working with and for people who act like awful human beings.  It also requires a lot of hard work, and even then, that isn’t enough.  This THE article states very clearly and eloquently that very few people actually work an 80 hour week in academia, and you do not need to in order to succeed.   I would tentatively agree, though I have pointed out some of the things you need to do to succeed in the UK system, and one of them is working hard.

It’s true, you don’t need to work 80 hours a week in academia…. but you do need to succeed.

What does success look like?

Unfortunately, science lends itself very well to metrics: number of papers published; amount of money won in external grant funding; number of PhD students trained; feedback scores from students you teach; citation indices; journal indices.  And probably many more.  All of these count, I’m sorry but they do.  We may wish for a better world, but we don’t yet live in one, so believe me – these numbers count.

To succeed as a PI, even a new one such as a fellow, you will need to win at least one external grant.  Grant success rates are published: here they are for BBSRC, NERC and MRC.  I skim-read these statistics and the success rate for standard “response mode” grants seems to be somewhere between 10 and 25%.  However, bear in mind that this includes established professors and people with far better track records and reputations than new fellows have.   Conservatively, I would half those success rates for new PIs, taking your chances of success to between 5 and 12%.  What that means is you’re going to have to write somewhere between 8 and 20 grants just to win one.   I couldn’t find statistics for the UK, but the average age a researcher in the US gets their first R01 grant is 42.  Just take a moment and think about that.

It’s not all doom and gloom – there are “new investigator” schemes specifically designed for new PIs.  The statistics look better – success between 20-30% for NERC, and similar for BBSRC.   However, note the NERC grants are very small – £1.3M over 20 awards is an average of £65k per award, and that probably covers you for about 8 months at FEC costing rates.  The BBSRC new investigator awards have no upper limit, so there is a tiny speck of light at the end of the tunnel.  The statistics say that you will still need to write between 3 and 5 of these just to win one though.

What do grant applications look like?

I am most familiar with BBSRC, so what’s below may be specific to them, but I imagine other councils are similar.  Grant applications consist of the following documents:

  1. JeS form
  2. Case for support
  3. Justification of resources
  4. Pathways to impact
  5. Data management plan
  6. CV
  7. Diagrammatic workplan (optional)
  8. Letters of support (optional)

The JeS form is an online form containing several text sections: Objectives, Summary, Technical Summary, Academic Beneficiaries and Impact Statement.  I haven’t done a word count because they are PDFs, but that’s probably around 1000 words.

The case for support is the major description of the research project and stretches to at least 8 pages, depending on how much money you’re asking for.   Word counts for my last 4 are 4450, 4171, 3666, and 3830.

The JoR, DMP and PtI are relatively short, 1-2 pages, and mine are typically 300-500 words each, so let’s say 1000 words in total.

Therefore, each grant is going to need 6000 words (properly referenced, properly argued) over 5 documents.  They need to be coherent, they need to tell a story and they need to convince reviewers that this is something worth funding.

Given the success rates I mentioned above, there is every possibility that you need to write between 5 and 10 of these in any given period to be deemed a success.   In other words, for success, you’re going to need to write often, write quickly and write well.   Don’t come into academia if you don’t like writing.

(by the way, there is such a thing as a LoLa which stands for “longer, larger”.  These are, as you may guess, longer and larger grants – the last one I was involved in, the case for support was 24 pages and 15,400 words – about half a PhD thesis)

Failure is brutal

I’ll take you through a few of my failures so you can get a taste….

In 2013 the BBSRC put out a call for research projects in metagenomics.  We had been working on this since 2011, looking to discover novel enzymes of industrial relevance from metagenomics data.  What we found when we assembled such data was that we had lots of fragmented/incomplete genes.  I had a bunch of ideas about how to solve this problem, including  targeted assembly of specific genes, something we were not alone in thinking about.    Reviews were generally good (Exceptional, Excellent and Very Good scores), but we had one comment about the danger of mis-assemblies.  Now, I had an entire section in the proposal dealing with this, basically stating that we would use reads mapped back to the assembly to find and remove mis-assembled contigs.  This is a tried, tested, and established method for finding bad regions of assemblies, and we have used it very successfully in other circumstances.   Besides which, mis-assembled contigs in metagenomic assemblies are very rare, probably around 1-3%.  I explained all this and didn’t think anything of it.  Mis-assemblies really aren’t a problem, and we have a method for dealing with it anyway.

The grant was rejected.  I asked for feedback from the committee (which can take 3 months by the way, and is often just a few sentences).   The feedback was that we had a problem with mis-assemblies and we didn’t have a method for dealing with it.  Apparently, the method we proposed (a tried and tested method!) represented a “circular argument” i.e. using the same reads to both assembly and validation was wrong.   Anyone working in this area can see that argument doesn’t make sense.  So our grant was rejected, because of a problem that isn’t important, which we had a method for dealing with, by someone who demonstrated a complete lack of understanding of that problem.   Frustrating?  I had to take a long walk, believe me.

In 2015 I wrote a grant to the BBSRC FLIP scheme for a small amount of money (~£150k) to get various bioinformatics software tools and pipleines (e.g. BWA+GATK) working on Archer, the UK’s national supercomputer.   It’s a cray supercomputer, massively parallel but with relatively low RAM per core, and jobs absolutely limited to 48 hours.   The grant was rejected, with feedback containing such gems as “the PI is not a software developer” and “Roslin is not an appropriate place to do software development”.   It’s over a year ago and I am still angry.

The last LoLa I was involved in was the highest scoring LoLa from the committee that assessed it.  They fully expected it to be funded.  It wasn’t, killed at a higher level committee.  So even getting through review and committee approval, you can still lose out.  One of the reviewer’s comments was that better assembled and annotated animal genomes will only represent a “1% improvement” over SNP chips.   I can’t even….

Our Meta4 paper was initially rejected for being “just a bunch of Perl scripts”; our viRome paper similarly rejected for being “a collection of simple R functions”; our paper on identifying poor regions of the pig genome assembly got “it seems a bunch of bioinformaticians ran some algorithms with no understanding of biology”; whilst our poRe paper was initially rejected without review because it “contains no data” (at the time I knew the poretools paper was under review at the same journal and also contained no data).

What point am I trying to make?   That failure is common, criticism is brutal and often you will fail because of comments that are either incorrect, unfair or both.  And there is often no appeal.

Lack of success may mean lack of a job

It’s more and more common now for academic institutions to apply funding criteria when assessing the success of their PIs: there have been redundancies at the University of Birmingham, as expectations on grant income were set; staff at Warwick have been given financial targets; Dr Alison Hayman was sacked for not winning enough grants;  and the awful, tragic death of Stephen Grimm after Imperial set targets of £200k per annum.

To put that in context, the average size of a BBSRC grant is £240k.  So Imperial are asking their PIs to win approx. one RCUK grant per year.  Do the maths using the success rates I mention above.

Is the 80 hour week a myth?

Yes it is; but the 60 hour week is not.  You may have a family, mouths to feed, bills to pay, a mortgage.  To do all of that you need a job, and to keep that job you need to win grants.  Maybe you haven’t won one in a while.  Tell me, under those circumstances, how many hours are you working?

Working in academia (for me) is wonderful.  I absolutely love it and wouldn’t change it for anything else.  However, it’s also highly competitive and at times brutal.  There are nights I don’t sleep.  A few years ago, my dentist told me I had to stop grinding my teeth.

It’s a wonderful, wonderful job – but in the current system, believe me, it’s not for everyone.  I recommend you choose your career wisely.  You don’t need to work 80 hours a week, but you do need to succeed.

 

 

 

Your strongly correlated data is probably nonsense

Use of the Pearson correlation co-efficient is common in genomics and bioinformatics, which is OK as it goes (I have used it extensively myself), but it has some major drawbacks – the major one being that Pearson can produce large coefficients in the presence of very large measurements.

This is best shown via example in R:


# let's correlate some random data
g1 <- rnorm(50)
g2 <- rnorm(50)

cor(g1, g2)
# [1] -0.1486646

So we get a small, -ve correlation from correlating two sets of 50 random values. If we ran this 1000 times we would get a distribution around zero, as expected.

Let's add in a single, large value:


# let's correlate some random data with the addition of a single, large value
g1 <- c(g1, 10)
g2 <- c(g2, 11)
 
cor(g1, g2)
# [1] 0.6040776

Holy smokes, all of a sudden my random datasets are positively correlated with r>=0.6!

It's also significant.


> cor.test(g1,g2, method="pearson")

        Pearsons product-moment correlation

data:  g1 and g2
t = 5.3061, df = 49, p-value = 2.687e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3941015 0.7541199
sample estimates:
      cor 
0.6040776 

So if you have used Pearson in large datasets, you will almost certainly have some of these spurious correlations in your data.

How can you solve this? By using Spearman, of course:


> cor(g1, g2, method="spearman")
[1] -0.0961086
> cor.test(g1, g2, method="spearman")

        Spearmans rank correlation rho

data:  g1 and g2
S = 24224, p-value = 0.5012
alternative hypothesis: true rho is not equal to 0
sample estimates:
       rho 
-0.0961086 

Building a Kraken database with new FTP structure and no GI numbers

Progress is usually good, except when a reliable resource completely changes their data and file structures and mess up everything you have been trying to do.  The NCBI used to arrange bacterial genomes in such a very easy way to understand and download; now it’s a bit tougher.  So tough in fact that they had to create a new FAQ.

At the same time as moving around genome data, they also decided to retire GI numbers (thanks to Andreas for the link!).

This is a problem for Kraken, a metagenomic profiler, because it relies both on the old style genomic data structures and the GI numbers that come with them.  Below are my attempts to build a Kraken database, using the new FTP structure and no GIs.

Downloading the data

It is pretty easy to follow the FAQ above, but I have created some perl scripts here that should help.

You will need:

  • Perl
  • BioPerl
  • Git
  • A linux shell (bash) that has wget
  • Internet access
  • A folder that you have write access to
  • Maybe 40Gb of space

The magic in these scripts is that they download the data and then add the necessary “|kraken:taxid|xxxx” text that replaces GI number for building what Kraken refers to as “custom databases”

Note also that the script only downloads fasta for assemblies classed as “Complete Genome”.  Other options are:

  • Chromosome
  • Complete Genome
  • Contig
  • Scaffold

If you want these others then edit the scripts appropriately.


# clone the git repo
git clone https://github.com/mw55309/Kraken_db_install_scripts.git

# either put the scripts in your path or run them from that directory

# make download directory
mkdir downloads
cd downloads

# run for each branch of life you wish to download
perl download_bacteria.pl
perl download_archaea.pl
perl download_fungi.pl
perl download_protozoa.pl
perl download_viral.pl

# you should now have five directories, one for each branch

# build a new database 
# download taxonomy
kraken-build --download-taxonomy --db kraken_bvfpa_080416

# for each branch, add all fna in the directory to the database
for dir in fungi protozoa archaea viral bacteria; do
        for fna in `ls $dir/*.fna`; do
                kraken-build --add-to-library $fna --db kraken_bvfpa_080416
        done
done

# build the actual database
kraken-build --build --db kraken_bvfpa_080416


I haven’t tested the above database yet, but the output of the build script ends with:


Creating GI number to seqID map (step 4 of 6)...
GI number to seqID map created. [14.935s]
Creating seqID to taxID map (step 5 of 6)...
7576 sequences mapped to taxa. [1.188s]
Setting LCAs in database (step 6 of 6)...
Database LCAs set. [21m24.805s]
Database construction complete. [Total: 1h1m28.108s]

Which certainly makes it look like things have worked.

I’ll update the post with info from testing in due course.

 

Update 13/04/2016

I simulated some reads from a random set of fastas in the database, then ran Kraken, and it worked (plus/minus a few false positives)

kraken_test_png

We need to stop making this simple f*cking mistake

I’m not perfect.  Not in any way.  I am sure if anyone was so inclined, they could work their way through my research with clinical forensic attention-to-detail and uncover all sorts of mistakes.  The same will be true for any other scientist, I expect.  We’re human and we make mistakes.

However, there is one mistake in bioinformatics that is so common, and which has been around for so long, that it’s really annoying when it keeps happening:

It turns out the Carp genome is full of Illumina adapters.

One of the first things we teach people in our NGS courses is how to remove adapters.  It’s not hard – we use CutAdapt, but many other tools exist.   It’s simple, but really important – with De Bruijn graphs you will get paths through the graphs converging on kmers from adapters; and with OLC assemblers you will get spurious overlaps.  With gap-fillers, it’s possible to fill the gaps with sequences ending in adapters, and this may be what happened in the Carp genome.

Why then are we finding such elementary mistakes in such important papers?  Why aren’t reviewers picking up on this?  It’s frustrating.

This is a separate, but related issue, to genomic contamination – the Wheat genome has PhiX in it; tons of bacterial genomes do too; and lots of bacterial genes were problematically included in the Tardigrade genome and declared as horizontal gene transfer.

Genomic contamination can be hard to find, but sequence adapters are not.  Who isn’t adapter trimming in 2016?!

Fast parallel access to data within MinION FAST5 files

As readers of this blog will know, my group was one of the first groups to publish software for the MinION sequencer, and I have blogged a couple of times on how to extract data from FAST5 files using poRe.  Well BBSRC, my main funder, were kind enough to give us a grant to develop the software, and one thing we said we’d do is implement parallel processing.

Below is a summary of what we’ve done so far.

Parallel processing in R

One of the huge benefits of R, besides the amazing suite of statistical and visualization tools, and the thousands of add-on packages, is the fact that parallelisation is really easy.  In this context, what I mean by parallelisation is the use of multiple cores on your machine to run a single bit of code.  By using multiple cores, the code is speeded up, often linearly relative to the number of cores you use.

There are quite a lot of packages for R that help you run parallel code, including doParallel, foreach, snow etc. but we’re going to focus on the built in package, parallel.  To show the benefits of this, we have run various toolkits for the extraction of 2D FASTQ data from 25484 FAST5 files kindly shared by Nick Loman.

Benchmarking FASTQ extraction

Now, as Matt Loose has pointed out to me, focusing on FASTQ extraction is a little pointless as almost certainly this will be built in to future ONT base callers; however, it serves as a useful example and there are plenty of FAST5 files out there already that you may wish to extract FASTQ from.

The tools we compared are extract2D from poRe, poretools, fast5extract from c_fast5 (a simple C implementation) and extract2DParallel, a parallel version of extract2D (the parallel scripts are in the “misc” folder in GitHub).  We ran each tool 4 times on 1000, 5000, 10000, 15000, 20000, and 25000 randomly selected FAST5 files, clearing the cache completely between runs to ensure a fair comparison.

paratest

As can be seen, all tools perform linearly with the number of FAST5 files; poretools is the slowest, then extract2D, then c_fast5.  However, the best result is clearly from extract2DParallel (using 4 cores) which outperforms everything else by some distance.

Of course, this is not surprising, and doesn’t come without a cost – if you only have 4 cores then using all 4 will mean all other processes perform really slowly.  However, there are also benefits – if your machine has 8 or 24 cores, you can speed up processing even more.

The scripts on GitHub should be treated as an early release and have only been tested on one dataset, so I welcome your feedback!

Extracting FASTQ on Windows

We haven’t wrapped up the functionality into poRe yet, but we expect to in the next release.  Meanwhile for the impatient…


library(poRe)
library(parallel)

# find out how many cores you have
detectCores()
# [1] 4


I have 4 cores, and I probably don’t want to use all of them! I now need to redefine one of poRe’s internal functions so that it works with parallel:


printfastq <- function (f5, f5path = "/Analyses/Basecall_2D_000/BaseCalled_2D/", out="out.fastq")
{
    fid = tryCatch({
        H5Fopen(f5)
    }, warning = function(w) w, error = function(e) e)
    if (typeof(fid) == "S4" && H5Lexists(fid, f5path)) {
        did <- H5Dopen(fid, paste(f5path, "Fastq", sep = ""))
        fq <- (H5Dread(did))
        four <- strsplit(fq, "\n")[[1]]
        fq <- paste(paste(four[1], f5, sep = " "), four[2], four[3],
            four[4], sep = "\n")
        cat(fq, file = out, sep = "\n", fill = FALSE, append=TRUE)
        H5Dclose(did)
    }
    if (typeof(fid) == "S4") {
        H5Fclose(fid)
    }
}

Now let's create a vector of the fast5 files I want to extract:


my.dir <- "C:/MinION/FAST5/" # use your own directory here!

f5files <- dir(path = my.dir, pattern = "\\.fast5$", full.names = TRUE)

Here is the interesting part, where we set up our parallel "cluster", which is then used to execute the code. Each core within the cluster should be treated as a separate instance of R, so we need to load libraries and export variables to the cluster:


# use two cores
cl <- makeCluster(2)
 
# load poRe on the cluster
c <- clusterEvalQ(cl, suppressWarnings(suppressMessages(library(poRe))))

# export the printfastq function to the cluster
clusterExport(cl, "printfastq")

# run the code and stop the cluster
l <- parApply(cl, array(f5files), 1, printfastq, out="myfastq.fastq")
stopCluster(cl)

Extract metadata

Absolutely one of the slowest aspects of MinION data analysis and QC is extracting metadata, but we can also parallelise this!  For those of you who like the Linux command line, there is an extractMetaParallel script in the GitHub repo (again - use with caution and feedback!), but for those of you who want to use this within R:



library(poRe)
library(parallel)
# find out how many cores you have
detectCores()
# [1] 4

# get a list of files to extract data from
my.dir <- "C:/MinION/FAST5/" # use your own directory here!
f5files <- dir(path = my.dir, pattern = "\\.fast5$", full.names = TRUE)

# use two cores
cl <- makeCluster(2)
 
# load poRe on the cluster
c <- clusterEvalQ(cl, suppressWarnings(suppressMessages(library(poRe))))

# variables to set paths to template, complement and 2D data
# NOTE you will need to change these for the new FAST5 format
path.t <- "/Analyses/Basecall_2D_000/"
path.c <- path.t
path2d <- path.t

# run the code and stop the cluster
l <- parLapply(cl, 
          f5files, 
          get_meta_info, path2d, path.t, path.c)
stopCluster(cl)

# convert the resulting vector into a matrix with the correct column names
m <- matrix(unlist(l), ncol=12, byrow=TRUE)
colnames(m) <- c("filename","channel_num","read_num","read_start_time","status","tlen","clen","len2d","run_id","read_id","barcode","exp_start") 

Summary

Parallelisation is so easy within R it's almost criminal not to do it. We can see that just a bit of parallelisation speeds up the code incredibly - and we can also see that poRe is a pretty fast toolkit when used in the correct way, which it wasn't in this paper.

NOTE I am fairly new to benchmarking and parallelisation, if I have done anything wrong, I apologize - and please get in touch!

Shots fired – nanopore wars part II

For the record, and to try and avoid accusations of bias, I am a fan of both Illumina and Oxford Nanopore.  I have won grants to purchase, and I have encouraged our facility to purchase, millions of pounds worth of Illumina equipment; my lab has spent well over £100,000 on Illumina sequencing.   So there’s that.  Yet I am also a massive fan boy of Oxford Nanopore’s technology – anyone who reads my blog or my Twitter feed will realise that I see nanopore technology as the future and often get incredibly excited by the revolutionary nature of the MinION mobile sequencer.

So I am incredibly disappointed in this – Illumina Sues Oxford Nanopore for Patent Infringement.   It’s like having two of your friends fighting.

There are fractious relationships in genomics of course – that between Illumina and BGI perhaps most famous – and I have written before about the relationship between Illumina and ONT, and how they entered into arbitration in 2011/12.

Illumina suing ONT is the latest act in the “war” between the two companies and comes after Illimina licensed technology from the Gundlach lab in 2013.  Key for me is what Illumina’s intentions were when they did this:

  1. Illumina wanted to develop nanopore technology, and they now have something close to market.  This situation would put Illumina in the best light.  In this case suing ONT is probably something they have to do to try and create space in the market for their technology.  Not great for ONT, but it seems this is how business is done.
  2. Illumina wanted to develop nanopore technology, they’ve tried and failed.  This makes Illumina look worse – if they don’t have a competing technology, it’s a bit spiteful to try and spike a competitor anyway, though certainly not uncommon.  And in this situation at least they tried to develop tech.
  3. Illumina never wanted to develop nanopore technology, and they bought the Gundlach patents just to kill ONT.  Again not entirely uncommon, but this is essentially patent trolling and really doesn’t make Illumina look good at all.  This would be a business driven decision to the detriment of science and the detriment of humanity.  Not good.

I am guessing the only way we will ever know which of these is true is if Illumina eventually release nanopore technology – we’d then know/suspect (1) is true.

The timing of this really stinks too – given ONT are set to float on the stock market, it’s hard not to see this as a deliberate attempt to scupper that.

My view is that ONT have a pretty strong position – my understanding is that they have patents, stretching back many years, that cover most or all of their instruments.  Some of this may come down to the pore – Gundlach’s patents cover MspA; and Hagan Bayley’s work, on which ONT was originally based, involved alpha-hemolysin, a different pore.  Though we don’t know if ONT switched pores.  In addition to all of that, the pores in ONTs technologies have been engineered, mutated and developed a huge amount – so when does a pore stop being the original molecule and become something else?  Do the patents cover this?

These are very interesting times, but this is not a welcome development.

On preprints, open access and generational change

A bunch of things are happening/happened recently that are all tied together in my head so I thought writing some of these things down would be useful (for me at least!).  The “things” are:

Let me try and get this straight 😉

Generational change

Generational change is both inevitable and necessary.  Each new generation comes along, takes a look at the system, identifies problems with that system, and takes measures to fix those problems.  I don’t mean just in science, I mean across life in general – a good example might be our treatment of the environment.  Twenty years ago, no-one cared about the environment; in twenty years time pretty much everyone will care.  This is generational change in action, and often it has to involve the disruption of existing power structures.

The problem with disruption of power structures is that those in power don’t like it; they want to hold on to those structures, because they are the source of that power.  However, this only serves to slow down progress – change is inevitable, and the best thing those in power could do is to get out of the way and help enable the change to happen.   However, crucially, they cannot own that change; those in power cannot and should not take credit for it.  It doesn’t belong to them – the old system is the one that belonged to them, they reaped the benefits.  The new system belongs to and should be driven by the next generation.

This is important – it is important that we empower our younger generations, rather than taking their ideas and pretending they are our own.

Blogs and social media as the democratisation of opinions and power

Let me paint you a picture.  A young graduate says to an established professor “Hey, I love science and I want to be a researcher.  I have some great ideas about research, but I also want to influence how research is done.  How do I get in to it?”.  The answer is simple.  “First you need to do a PhD, which may mean you are effectively used as cheap labour to carry out some of your supervisor’s ideas that they couldn’t get funding to do elsewhere.  After four years, you will need to get a post-doc in a good lab, and probably 90% of people will drop out at that stage.  As soon as you are a post-doc be sure to publish in high impact journals (Nature/Science/Cell etc) because you will need those to get a second post-doc or fellowship – though you won’t have much influence on where you publish as your PI will decide that.   To be a PI/group leader you will need to apply for and win one of a very few, highly competitive fellowships.  Finally, as a fellow, you will be given a small budget and have the chance to explore ideas of your own.  You will have 5 years to prove you can ‘cut it’ as a PI i.e. win a grant.  If you win enough grants as a fellow, you can be a junior group leader.  However, this is not a secure position – you will have to constantly publish and win grants for many years before finally you will be given tenure.  Then people might start listening to you – you may get to be an expert on grant panels, you may get to have some tiny influence on strategy and the type of science that gets funded.  You’ll probably be at least 50 by then”

Are we surprised that young people might take one look at that and say “Fuck that” ?

Jingmai O’Connor’s assertion that critiques of published papers should only happen via similarly published papers means that probably 90% of the scientific work force would be unable to critique her work, because only PIs get to decide what gets published and when by their research groups.

Is anyone else looking at this system and thinking “the young have no voice”?  Is it any wonder that the next generation have taken to blogs and social media to find that voice?

Don’t get me wrong, blogs and social media are still biased – if a Nobel winner starts a blog, it’ll be read way more than if a PhD student starts one – but they are still a far more level playing field than the academic system – because if someone starts a blog or joins social media, by-and-large, if they say something interesting, engaging or useful, they will build up a following and they will become known, they will become influential in their own way, and this is an incredibly good thing.   It is a form of empowerment of the younger generation in a system that almost completely lacks it.

Anything that improves access to research outputs is a good thing

I must say I have the utmost respect for Mike Eisen.  He has been passionate about open access from the start, and now he is passionate about preprints.  You will find no criticism of him here.  I am 100% an open access advocate, and I believe preprints are an excellent idea.

However, Andy Fraser makes an excellent point:

As soon as established, superstar scientists adopt something, the story is no longer the story, the story is the superstar.  Take a look here:

This is the very same Randy Schekman who published countless papers in pay-walled glamour mags, but then started telling everyone to publish open access.  Well, the open access movement isn’t about Randy.

Is it a good thing that Novel laureates are putting out preprints and supporting open access?  Of course it is!

Does it annoy me that they are getting tons of credit and attention for something (open access) that I and others have been doing our entire careers?  Of course it does.  It annoys the shit out of me.  Because the story of revolution in academic publishing doesn’t belong to the guys who made the old system and then changed their minds; the story belongs to the people who made the new system – the Mike Eisens of the open access world and the countless PIs, post-docs and students who have never been anything other than open.

I am glad some established professors are changing their mind, but the credit and attention for the OA movement has to go to those who’ve committed their entire career to open access.

“They can’t win”

The obvious response to Andy’s tweet is that the established professors can’t win; they are damned if they do, damned if they don’t.  It’s an argument Mike made too:

I see the argument, I really do, but the point is that the established professors have already won.  They have tenure, they have funding, they have established reputations.  Don’t say they can’t win, because they already did win.

Of course it’s great that the establishment are embracing open access, and preprints, but somehow they need to make the story not about them.  They need to make the story about the people who drove the change – perhaps it was a student or post-doc who persuaded them to put up a preprint, or to submit to an OA journal.  If that’s the case, make the story about the student/post-doc.   Perhaps they just had an epiphany?  Well if that’s the case, a bit of humility wouldn’t go amiss.  Don’t ride in on your white horse and take all the credit for winning the war; instead, fall on your sword and apologize that you once fought for the other side.

The revolution in academic publishing isn’t about established professors, it’s about generational change and empowerment of a new generation of scientists.  Let’s not lose sight of that.  And let’s not take something away from the younger generation who have so little to begin with.

« Older posts

© 2016 Opiniomics

Theme by Anders NorenUp ↑