Opiniomics

bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Category: opinion (page 2 of 9)

You don’t need to work 80 hours a week in academia…. but you do need to succeed

I’ve been thinking a lot lately about academic careers, chiefly because I happen to be involved in some way with three fellowship applications at the moment.  For those of you unfamiliar with the academic system at work here, the process is: PhD -> Post-Doc -> Fellowship -> PI (group leader) -> Professorship (Chair).  So getting a fellowship is that crucial jump from post-doc to PI and represents a person’s first chance to drive their own research programme.  Sounds grand doesn’t it?  “Drive your own research programme”.  Wow.  Who wouldn’t want to do that?

Well, be careful what you wish for.  Seriously.  I love my job; I love science and I love computers and I get paid to do both.  It’s amazing and possibly (for me) one of the best jobs in the world.  However, it comes with huge pressures; job insecurity; unparalleled and relentless criticism; failures, both of your ideas and your experiments; and occasionally the necessity of working with and for people who act like awful human beings.  It also requires a lot of hard work, and even then, that isn’t enough.  This THE article states very clearly and eloquently that very few people actually work an 80 hour week in academia, and you do not need to in order to succeed.   I would tentatively agree, though I have pointed out some of the things you need to do to succeed in the UK system, and one of them is working hard.

It’s true, you don’t need to work 80 hours a week in academia…. but you do need to succeed.

What does success look like?

Unfortunately, science lends itself very well to metrics: number of papers published; amount of money won in external grant funding; number of PhD students trained; feedback scores from students you teach; citation indices; journal indices.  And probably many more.  All of these count, I’m sorry but they do.  We may wish for a better world, but we don’t yet live in one, so believe me – these numbers count.

To succeed as a PI, even a new one such as a fellow, you will need to win at least one external grant.  Grant success rates are published: here they are for BBSRC, NERC and MRC.  I skim-read these statistics and the success rate for standard “response mode” grants seems to be somewhere between 10 and 25%.  However, bear in mind that this includes established professors and people with far better track records and reputations than new fellows have.   Conservatively, I would half those success rates for new PIs, taking your chances of success to between 5 and 12%.  What that means is you’re going to have to write somewhere between 8 and 20 grants just to win one.   I couldn’t find statistics for the UK, but the average age a researcher in the US gets their first R01 grant is 42.  Just take a moment and think about that.

It’s not all doom and gloom – there are “new investigator” schemes specifically designed for new PIs.  The statistics look better – success between 20-30% for NERC, and similar for BBSRC.   However, note the NERC grants are very small – £1.3M over 20 awards is an average of £65k per award, and that probably covers you for about 8 months at FEC costing rates.  The BBSRC new investigator awards have no upper limit, so there is a tiny speck of light at the end of the tunnel.  The statistics say that you will still need to write between 3 and 5 of these just to win one though.

What do grant applications look like?

I am most familiar with BBSRC, so what’s below may be specific to them, but I imagine other councils are similar.  Grant applications consist of the following documents:

  1. JeS form
  2. Case for support
  3. Justification of resources
  4. Pathways to impact
  5. Data management plan
  6. CV
  7. Diagrammatic workplan (optional)
  8. Letters of support (optional)

The JeS form is an online form containing several text sections: Objectives, Summary, Technical Summary, Academic Beneficiaries and Impact Statement.  I haven’t done a word count because they are PDFs, but that’s probably around 1000 words.

The case for support is the major description of the research project and stretches to at least 8 pages, depending on how much money you’re asking for.   Word counts for my last 4 are 4450, 4171, 3666, and 3830.

The JoR, DMP and PtI are relatively short, 1-2 pages, and mine are typically 300-500 words each, so let’s say 1000 words in total.

Therefore, each grant is going to need 6000 words (properly referenced, properly argued) over 5 documents.  They need to be coherent, they need to tell a story and they need to convince reviewers that this is something worth funding.

Given the success rates I mentioned above, there is every possibility that you need to write between 5 and 10 of these in any given period to be deemed a success.   In other words, for success, you’re going to need to write often, write quickly and write well.   Don’t come into academia if you don’t like writing.

(by the way, there is such a thing as a LoLa which stands for “longer, larger”.  These are, as you may guess, longer and larger grants – the last one I was involved in, the case for support was 24 pages and 15,400 words – about half a PhD thesis)

Failure is brutal

I’ll take you through a few of my failures so you can get a taste….

In 2013 the BBSRC put out a call for research projects in metagenomics.  We had been working on this since 2011, looking to discover novel enzymes of industrial relevance from metagenomics data.  What we found when we assembled such data was that we had lots of fragmented/incomplete genes.  I had a bunch of ideas about how to solve this problem, including  targeted assembly of specific genes, something we were not alone in thinking about.    Reviews were generally good (Exceptional, Excellent and Very Good scores), but we had one comment about the danger of mis-assemblies.  Now, I had an entire section in the proposal dealing with this, basically stating that we would use reads mapped back to the assembly to find and remove mis-assembled contigs.  This is a tried, tested, and established method for finding bad regions of assemblies, and we have used it very successfully in other circumstances.   Besides which, mis-assembled contigs in metagenomic assemblies are very rare, probably around 1-3%.  I explained all this and didn’t think anything of it.  Mis-assemblies really aren’t a problem, and we have a method for dealing with it anyway.

The grant was rejected.  I asked for feedback from the committee (which can take 3 months by the way, and is often just a few sentences).   The feedback was that we had a problem with mis-assemblies and we didn’t have a method for dealing with it.  Apparently, the method we proposed (a tried and tested method!) represented a “circular argument” i.e. using the same reads to both assembly and validation was wrong.   Anyone working in this area can see that argument doesn’t make sense.  So our grant was rejected, because of a problem that isn’t important, which we had a method for dealing with, by someone who demonstrated a complete lack of understanding of that problem.   Frustrating?  I had to take a long walk, believe me.

In 2015 I wrote a grant to the BBSRC FLIP scheme for a small amount of money (~£150k) to get various bioinformatics software tools and pipleines (e.g. BWA+GATK) working on Archer, the UK’s national supercomputer.   It’s a cray supercomputer, massively parallel but with relatively low RAM per core, and jobs absolutely limited to 48 hours.   The grant was rejected, with feedback containing such gems as “the PI is not a software developer” and “Roslin is not an appropriate place to do software development”.   It’s over a year ago and I am still angry.

The last LoLa I was involved in was the highest scoring LoLa from the committee that assessed it.  They fully expected it to be funded.  It wasn’t, killed at a higher level committee.  So even getting through review and committee approval, you can still lose out.  One of the reviewer’s comments was that better assembled and annotated animal genomes will only represent a “1% improvement” over SNP chips.   I can’t even….

Our Meta4 paper was initially rejected for being “just a bunch of Perl scripts”; our viRome paper similarly rejected for being “a collection of simple R functions”; our paper on identifying poor regions of the pig genome assembly got “it seems a bunch of bioinformaticians ran some algorithms with no understanding of biology”; whilst our poRe paper was initially rejected without review because it “contains no data” (at the time I knew the poretools paper was under review at the same journal and also contained no data).

What point am I trying to make?   That failure is common, criticism is brutal and often you will fail because of comments that are either incorrect, unfair or both.  And there is often no appeal.

Lack of success may mean lack of a job

It’s more and more common now for academic institutions to apply funding criteria when assessing the success of their PIs: there have been redundancies at the University of Birmingham, as expectations on grant income were set; staff at Warwick have been given financial targets; Dr Alison Hayman was sacked for not winning enough grants;  and the awful, tragic death of Stephen Grimm after Imperial set targets of £200k per annum.

To put that in context, the average size of a BBSRC grant is £240k.  So Imperial are asking their PIs to win approx. one RCUK grant per year.  Do the maths using the success rates I mention above.

Is the 80 hour week a myth?

Yes it is; but the 60 hour week is not.  You may have a family, mouths to feed, bills to pay, a mortgage.  To do all of that you need a job, and to keep that job you need to win grants.  Maybe you haven’t won one in a while.  Tell me, under those circumstances, how many hours are you working?

Working in academia (for me) is wonderful.  I absolutely love it and wouldn’t change it for anything else.  However, it’s also highly competitive and at times brutal.  There are nights I don’t sleep.  A few years ago, my dentist told me I had to stop grinding my teeth.

It’s a wonderful, wonderful job – but in the current system, believe me, it’s not for everyone.  I recommend you choose your career wisely.  You don’t need to work 80 hours a week, but you do need to succeed.

 

 

 

We need to stop making this simple f*cking mistake

I’m not perfect.  Not in any way.  I am sure if anyone was so inclined, they could work their way through my research with clinical forensic attention-to-detail and uncover all sorts of mistakes.  The same will be true for any other scientist, I expect.  We’re human and we make mistakes.

However, there is one mistake in bioinformatics that is so common, and which has been around for so long, that it’s really annoying when it keeps happening:

It turns out the Carp genome is full of Illumina adapters.

One of the first things we teach people in our NGS courses is how to remove adapters.  It’s not hard – we use CutAdapt, but many other tools exist.   It’s simple, but really important – with De Bruijn graphs you will get paths through the graphs converging on kmers from adapters; and with OLC assemblers you will get spurious overlaps.  With gap-fillers, it’s possible to fill the gaps with sequences ending in adapters, and this may be what happened in the Carp genome.

Why then are we finding such elementary mistakes in such important papers?  Why aren’t reviewers picking up on this?  It’s frustrating.

This is a separate, but related issue, to genomic contamination – the Wheat genome has PhiX in it; tons of bacterial genomes do too; and lots of bacterial genes were problematically included in the Tardigrade genome and declared as horizontal gene transfer.

Genomic contamination can be hard to find, but sequence adapters are not.  Who isn’t adapter trimming in 2016?!

On preprints, open access and generational change

A bunch of things are happening/happened recently that are all tied together in my head so I thought writing some of these things down would be useful (for me at least!).  The “things” are:

Let me try and get this straight 😉

Generational change

Generational change is both inevitable and necessary.  Each new generation comes along, takes a look at the system, identifies problems with that system, and takes measures to fix those problems.  I don’t mean just in science, I mean across life in general – a good example might be our treatment of the environment.  Twenty years ago, no-one cared about the environment; in twenty years time pretty much everyone will care.  This is generational change in action, and often it has to involve the disruption of existing power structures.

The problem with disruption of power structures is that those in power don’t like it; they want to hold on to those structures, because they are the source of that power.  However, this only serves to slow down progress – change is inevitable, and the best thing those in power could do is to get out of the way and help enable the change to happen.   However, crucially, they cannot own that change; those in power cannot and should not take credit for it.  It doesn’t belong to them – the old system is the one that belonged to them, they reaped the benefits.  The new system belongs to and should be driven by the next generation.

This is important – it is important that we empower our younger generations, rather than taking their ideas and pretending they are our own.

Blogs and social media as the democratisation of opinions and power

Let me paint you a picture.  A young graduate says to an established professor “Hey, I love science and I want to be a researcher.  I have some great ideas about research, but I also want to influence how research is done.  How do I get in to it?”.  The answer is simple.  “First you need to do a PhD, which may mean you are effectively used as cheap labour to carry out some of your supervisor’s ideas that they couldn’t get funding to do elsewhere.  After four years, you will need to get a post-doc in a good lab, and probably 90% of people will drop out at that stage.  As soon as you are a post-doc be sure to publish in high impact journals (Nature/Science/Cell etc) because you will need those to get a second post-doc or fellowship – though you won’t have much influence on where you publish as your PI will decide that.   To be a PI/group leader you will need to apply for and win one of a very few, highly competitive fellowships.  Finally, as a fellow, you will be given a small budget and have the chance to explore ideas of your own.  You will have 5 years to prove you can ‘cut it’ as a PI i.e. win a grant.  If you win enough grants as a fellow, you can be a junior group leader.  However, this is not a secure position – you will have to constantly publish and win grants for many years before finally you will be given tenure.  Then people might start listening to you – you may get to be an expert on grant panels, you may get to have some tiny influence on strategy and the type of science that gets funded.  You’ll probably be at least 50 by then”

Are we surprised that young people might take one look at that and say “Fuck that” ?

Jingmai O’Connor’s assertion that critiques of published papers should only happen via similarly published papers means that probably 90% of the scientific work force would be unable to critique her work, because only PIs get to decide what gets published and when by their research groups.

Is anyone else looking at this system and thinking “the young have no voice”?  Is it any wonder that the next generation have taken to blogs and social media to find that voice?

Don’t get me wrong, blogs and social media are still biased – if a Nobel winner starts a blog, it’ll be read way more than if a PhD student starts one – but they are still a far more level playing field than the academic system – because if someone starts a blog or joins social media, by-and-large, if they say something interesting, engaging or useful, they will build up a following and they will become known, they will become influential in their own way, and this is an incredibly good thing.   It is a form of empowerment of the younger generation in a system that almost completely lacks it.

Anything that improves access to research outputs is a good thing

I must say I have the utmost respect for Mike Eisen.  He has been passionate about open access from the start, and now he is passionate about preprints.  You will find no criticism of him here.  I am 100% an open access advocate, and I believe preprints are an excellent idea.

However, Andy Fraser makes an excellent point:

As soon as established, superstar scientists adopt something, the story is no longer the story, the story is the superstar.  Take a look here:

This is the very same Randy Schekman who published countless papers in pay-walled glamour mags, but then started telling everyone to publish open access.  Well, the open access movement isn’t about Randy.

Is it a good thing that Novel laureates are putting out preprints and supporting open access?  Of course it is!

Does it annoy me that they are getting tons of credit and attention for something (open access) that I and others have been doing our entire careers?  Of course it does.  It annoys the shit out of me.  Because the story of revolution in academic publishing doesn’t belong to the guys who made the old system and then changed their minds; the story belongs to the people who made the new system – the Mike Eisens of the open access world and the countless PIs, post-docs and students who have never been anything other than open.

I am glad some established professors are changing their mind, but the credit and attention for the OA movement has to go to those who’ve committed their entire career to open access.

“They can’t win”

The obvious response to Andy’s tweet is that the established professors can’t win; they are damned if they do, damned if they don’t.  It’s an argument Mike made too:

I see the argument, I really do, but the point is that the established professors have already won.  They have tenure, they have funding, they have established reputations.  Don’t say they can’t win, because they already did win.

Of course it’s great that the establishment are embracing open access, and preprints, but somehow they need to make the story not about them.  They need to make the story about the people who drove the change – perhaps it was a student or post-doc who persuaded them to put up a preprint, or to submit to an OA journal.  If that’s the case, make the story about the student/post-doc.   Perhaps they just had an epiphany?  Well if that’s the case, a bit of humility wouldn’t go amiss.  Don’t ride in on your white horse and take all the credit for winning the war; instead, fall on your sword and apologize that you once fought for the other side.

The revolution in academic publishing isn’t about established professors, it’s about generational change and empowerment of a new generation of scientists.  Let’s not lose sight of that.  And let’s not take something away from the younger generation who have so little to begin with.

Did you notice the paradigm shift in DNA sequencing last week?

There is a very funny tweet by Matthew Hankins about the over-use of “paradigm shift” in articles in Scopus between 1962-2014 which clearly suggest we over-use the term:

However, last week there was a genuine shift in DNA sequencing published in bioRxiv by Matt Loose called “Real time selective sequencing using nanopore technology“.  So what makes this paper special?  Why is this a genuine paradigm shift?

Well, this is the first example, ever, of a DNA sequencer selecting regions of the input genome to sequence.   To be more accurate, Matt demonstrates the MinION sequencing DNA molecules and analyzing them in real time; testing whether they come from a region of the genome he is interested in; and if they are not, rejecting them (“spitting them out”) and making the nanopore available for the next read.

This has huge applications from human health through to pathogen genomics and microbiome research.  There are many applications in genomics where you are only interested in a subset of the DNA molecules in your sample.  For example, exome sequencing, or other sequence capture experiments – at the moment a separate capture reaction needs to take place and this can actually be more expensive than the sequencing.   Or pathogen diagnostics – here most of what you sequence would be host DNA, which you could ask the sequencer to reject, leaving only the pathogen to sequence; or flip it round – human saliva samples can have up to 40% oral microbiome DNA, which is wasted if what you want to sequence is the host.  I have worked with piRNA and siRNA datasets from insects where less than 1% of the Illumina reads are those I actually want.  If you want to know if a particular bacterial infection is resistant to antibiotics, you only actually need to sequence the antibiotic-resistance genes/islands/plasmids, not the entire genome. etc etc etc

Of course, we still have a bit of a way to go – Matt demonstrates so-called “read until” on Lambda, a relatively small, simple genome – but this is an amazing proof-of-principle of an incredible idea – that you can tell your sequencer what to sequence.  The paper deserves your attention.  Please read it and fire up your imagination!

Doing the maths on PromethION throughput

In case you don’t know, the PromethION is the bigger brother of the MinION, Oxford Nanopore’s revolutionary mobile single-molecule DNA sequencer.  The PromethION is about to go into early access, where I believe the access fee is just $75k, most of which can subsequently be spent on reagents/flowcells in the ONT shop.  The hidden costs are of course the bioinformatics, but that’s not what this post is about.

What I wanted to do is predict PromethION throughput using current MinION stats.  So let’s do that.

I will be deliberately conservative with the MinION throughput stats.  It’s a matter of record, I think, that a MinION 48hr run is easily capable of producing 100-200Mbase of 2D reads, most of which are ~8Kb in length.  I’m going to use this figure as a basic MinION throughput, though many will argue that SQK-MAP-006 runs produce far more.

I want you to focus on three parameters: the rate at which DNA translocates through the pores, currently ~70b/s on MinION; the number of channels per flowcell (512 on MinION); and the number of flowcells per run (1 on MinION).   So 1 flowcell with 512 channels running at 70b/s will produce 100-200Mbase of high quality 2D sequence data.

PromethION flowcells are 4 times the size of MinION flowcells, with 2048 channels.  You can also run 48 flowcells in parallel.  Therefore, if all PromethION ends up being is 48 large MinIONs running in parallel, we can predict 4 * 48 * 100-200Mbase == between 19.2 and 38.4Gbase per run.

However, let’s go back to that translocation speed.   I believe the aim is for the PromethION to run at 400b/s.  There is a huge caveat here in that we haven’t seen any data from ONT running at 400b/s; however, logically the faster the DNA moves through the pore, the more sequence data you will get.  The potential is therefore for PromethION to produce 5 times the above – so between 96Gbase and 192Gbase of high quality 2D reads (note – many more template and complement reads will be produced). (2nd note – there is no guarantee the 2D protocol will be around long term, the aim is to produce 1D data at a  much higher quality, which will also increase throughput)

These are obviously “back of an envelope” calculations, and almost certainly wrong in some respect, but hopefully this puts the PromethION in context with the other sequencers on the market.

Update 1: via Nick Loman, the F1000 data was SQK-MAP-005 and ran at 30b/s, not 70

Update 2: via Matt Loose, PromethION flowcells are 3000 channels not 2048

Update 3: via Nick Loman, PoreCamp (SQK-MAP-006) run produced 460Mbase 2D data

 

* bad maths is a product of lack of sleep

** the blog is called opiniomics for a reason

Dear NEJM…

Dear NEJM

After your terribly judged piece on research parasites, it was inevitable that you would have to apologize, though a retraction would have been better.

Here’s the thing – you came so close!  You came so close to making your apology meaningful!  Paragraphs one and three are just perfect and I cannot find a single thing I disagree with.  Those paragraphs are great.  The one in the middle?  Not so great…

Here is a reminder of what you said:

In the process of formulating our policy, we spoke to clinical trialists around the world. Many were concerned that data sharing would require them to commit scarce resources with little direct benefit. Some of them spoke pejoratively in describing data scientists who analyze the data of others.3 To make data sharing successful, it is important to acknowledge and air those concerns.3In our view, however, researchers who analyze data collected by others can substantially improve human health.

First of all, there was no indication in the original piece that when you referred to data scientists as “research parasites” that you were simply reflecting the concerns of others.  However, what’s really wrong with the above paragraph is that you do not say that those concerns should be tackled, that they should be rebutted, with evidence where appropriate; no – you simply say they should be acknowledged and aired.  Well let me just say that there are a lot of pejorative opinions expressed by humanity, on the internet and otherwise, and a large number of those definitely do not deserve to be acknowledged or aired.

If you genuinely feel, as you state, that data scientists make a valuable contribution, then you need to go back to those clinical trialists, the ones with the pejorative opinions, and say “I acknowledge your opinion, but you are wrong”.

Instead, you wrote an editorial that justified them.  Please make this right

Cheers

Mick

The age of the dinosaurs is almost over; NEJM publishes one of their last dying roars

Only this morning I was chatting to my wife in the car and we were discussing how important generational change is, how it is a force for good.  Take climate change – every single child is now being taught about the environment, about climate change, about how we are destroying our own planet.  Sure, the old white men that are in power now are raping the planet and ripping out every last drop of oil they can lay their hands on; but they’ll die soon, and the children of today will grow up and take their place, and be more careful with our planet.   I expect similar stories for sexism and racism – they will disappear as long as we teach our kids the right values.

Of course, our kids will also do some awful things too, things that our grandkids will need to fix, but let’s not get into that.

How serendipitous then, that a pair of out-of-date, soon to be extinct dinosaurs vomited up this stinking editorial in NEJM about data sharing.  Had it been April 1st I’d be convinced it was a parody; alas no, these guys genuinely seem to believe that others who download and use your data for science are “parasites”, and even express fear (yes, that’s what it is, fear) that someone might use their own data to disprove their hypotheses.  The whole piece is dripping with disdain for these “parasites”, and I think this editorial (if it remains up!) will be remembered as a dark day for NEJM.

Sympathy for the devil?

Is there any way we can empathize with the authors?  Where does this feeling of data ownership come from?  It is of course the product of a broken system.  We are judged on the impact and number of our papers, which is weakly correlated with the amount of money we spend on our research; the amount we spend is strongly correlated with the amount of money we win in grants, which in turn is dependent on those papers being published.  So our entire system of reward is based on getting those publications so we can win more money and publish more papers.  Naively, there seems few incentives to share data, and many incentives to keep data to private so we can wring out every last publication for ourselves.

Wrong on many levels

Of course the reasoning above doesn’t stand up – as I explained here, the data do not belong to the scientist, they belong to the funder and should be released as soon as possible, for the good of science and the good of the public.  It is extremely arrogant to assume that you, as data generator, have the ability to extract all possible useful information from the data.  Others of course will have ideas, many better than your own, and those ideas deserve your data, they deserve life.  Perhaps most importantly, rather than damaging your reputation, releasing data does the opposite and will almost certainly result in more collaborations, better science and more impact for the data generator.   Contrary to the NEJM bile, open science is far better – for you, for science and for humanity in general.

The last roar of dying dinosaurs

The views in the piece are so out dated it’s almost funny.  Attitudes have been changing for years and will continue to do so.  Treat the article with the disdain it deserves – this is your grandparents moaning about how TV and the internet have spoiled everything and life was so much better in their day; this is the biplane telling everyone that spaceflight is bad; it’s the bicycle desperately hoping that no-one wants to travel 100mph in racing cars.

It’s a pair of dinosaurs looking at an approaching comet and thinking: “oh shit!”

Which sequencer should you buy?

My aim for this post is a quick, pithy review of available sequencers, what they’re good for, what they’re not and under which circumstances you should buy one.   However, your first thought should be: do I need to buy a sequencer?  I know it’s great to have your own box (I love all of ours), but they are expensive, difficult to run, temperamental and time-consuming.  So think on that before you spend millions of pounds of your institute’s money – running the sequencers may cost millions more.  My blog post on choosing an NGS supplier is still valid today.

Illumina HiSeq X Ten

To paraphrase Quentin Tarantino, you buy one of these “When you absolutely, positively got to sequence every person in the room, accept no substitutes”.  The HiSeq X Ten is actually ten HiSeq X instruments; each instrument can run two flowcells and each flowcell has 8 lanes.  Each lane will produce 380-450million 150PE reads (120Gbase or data or 40X of a human genome).  Runs take 3 days.  Expect to pay an extra £1M on computing to cope with the data streams.  Ordered flowcells are quite difficult to deal with and can result in up to 30% “optical duplicates” (actually spill over from one well to adjacent wells).  You can producing 160 genomes every 3 days.  Essentially now used as a very cheap, whole-genome genotyping system, cost per genome is currently £850+VAT at Edinburgh Genomics.  Limited to 30X (or greater) genome sequencing.  I have checked with Illumina and this is definitely still true.

Illumina HiSeq X Five

Do not buy one of these.  The reagents are 40% more expensive for no good reason.  Simply out source to someone with an X Ten.

Illumina HiSeq 4000

The baby brother of the HiSeq X, I actually think it’s the same machine, except with smaller flowcells (possibly a different camera or laser).  Expect 300million 150PE reads per lane (same setup, two flowcells, each with 8 lanes, 3.5 day run time).  That’s 90Gbase per lane.  Same caveats apply – ordered flowcells are tricky and it’s easy to generate lots of “optical duplicates”.  No limitations, so you can run anything you like on this.  The new workhorse of the Illumina stable.  Buy one of these if you have lots and lots of things to sequence, and you want to run a variety of different protocols.

Illumina HiSeq 2500

One of the most reliable machines Illumina has ever produced.  Two modes: high output has the classic 2 flowcell, 8 lane set-up and takes 6 days; rapid is 2 flowcell, 2 lanes and takes less time (all run times depend on read length).  High output V4 capable of 250million 125PE reads, and rapid capable of 120million 250PE reads.   Increased throughput of the 4000 makes the 2500 more expensive per Gb and therefore only buy a 2500 if you can get a good one cheap second-hand, or get offered a really great deal on a new one.  Even then, outsourced 4000 data is likely to be cheaper than generating your own data on a 2500

Illumina NextSeq 500

I’ve never really seen the point – small projects can go on MiSeq, and medium- to large- projects fit better and are cheaper as a subset of lanes on the 2500/4000.  The machine only measures 3 bases, with the 4th base being an absence of signal.  This means the runs are ~25% quicker.  I am told V1 data was terrible, but V2 data much improved.  NextSeq flowcells are massive, the size of an iPad mini, and have four lanes, each capable of 100million 150PE reads.  Illumina claim these are good for “Everyday exome, transcriptome, and targeted resequencing“, but realistically these would all be better and cheaper run multiplexed on a 4000.

Illumina MiSeq

A great little machine, one lane per run, V2 is capable of 12million 250PE reads per run; V3 claims 25million 300PE reads but good luck getting those, there has been a problem with V3 300PE for as long as I can remember – it just doesn’t work well.  Great for small genomes and 16S.

Illumina MiniSeq

I suspect Illumina will sell tons of these as they are so cheap (< $50k), but no-one yet knows how well it will run.  Supposedly capable of 25million 150PE reads per run, that’s 7.5Gbase of data.  You could just about run a single RNA-Seq sample on there, but why would you?  A possible replacement for MiSeq if they get the read length up.  Could be good for small genomes and possibly 16S.  Illumina claim it’s for targetted DNA and RNA samples, so could work well with e.g. gene panels for rapid diagnostics.


 

One interesting downside of Illumina machines is that you have to fill the whole flowcell before you can run the machine.  What this means is that despite the fact Illumina’s cost-per-Gb is smaller, small projects can be cheaper and faster on other platforms.


Ion Torrent and Ion Proton

The people who I meet who are happy with their Ion* platforms are generally diagnostic labs, where run time is really important (they are faster than Illumina machines) and where absolute base-pair accuracy is not important.   Noone I know who works in genomics research uses Ion* data – it’s just not good enough.  Major indel problem and Illumina data is cheaper and better.

PacBio Sequel

No-one has seen any data but this looks like an impressive machine.   There are 1 million ZMWs per SMRT cell and about 30-40% will return useable data.  Useable data will be 10-20Kb reads at 85% raw accuracy, but correctable to 99% accuracy.  Output at launch is 5-10Gbase per SMRT cell, and PacBio expect to produce 20Kb and 30Kb library protocols in 2016.  Great for genome assembly and structural variation, not quite quantitative for RNA-Seq bu fantastic for gene discovery.  Link this up to NimbleGen’s long fragment capture kits and you can target difficult areas of the genome with long reads.  Machine cost is £300k so good value compared to the RSII.  These will fly off the shelf.

PacBio RSII

The previous workhorse of PacBio, capable of 2Gbase of long reads per SMRT cell.  Cool machine, but over-shadowed by Sequel, I wouldn’t recommend buying one.

Oxford Nanopore MinION

The coolest sequencer on the planet, a $1000 access fee gets you a USB sequencer the size of an office stapler.   Each run on the mark I MinION can produce several hundred Mb of 2D data, and fast mode (in limited early access) promises to push this into the Gbases.  Read lengths are a mean of 10Kb with raw 2D accuracy at 85% and a range of options for correction to 98-99% accuracy.  We use for scaffolding bacterial genomes, and also for pathogen detection.  Should you buy one?  You should have one already!

Oxford Nanopore PromethION

The big brother of the MinION, this is best imagined as 48 bigger, fast-mode MinIONs run in parallel.  If fast mode MinION can produce 1Gbase per run, the PromethION will produce 300Gbase per run.  This machine is in limited early access, but offers the possibility of long-read, population scale sequencing.  Access fee is $75,000 but expect to spend ten times that on compute to deal with the data.  Get one if you can deal with the data.

 

Travel tips for highly organised and important people

Karthik Ram (@_inundata on Twitter) wrote down recently some of his “travel hacks“, and so I thought I would repeat the kind of sage advice he offers here. Here it is:

1. Luggage: choose the bag that smells the least bad. Remember that the cat likes to pee in them so check that first; also, the baby might have puked in there. The best bag is one that still contains all the stuff from your last trip that you can simply use again (to prevent the use of inappropriate clothes, simply never ever dress seasonally).  Also, pack at the very last minute, to make sure you don’t waste any precious time.

2. Power: go to the drawer that contains all of the screwdrivers, old keys and travel adapters and put every single one of the travel adapters you own in the bag.  Every. Single. One. Then buy another one at the airport.  Take the external battery you bought, but is never charged.  Load up at least 3 broken USB cables, and make sure to put it in your checked bag so you can’t use them in the airport, even though you really want to.  Simply buy a whole new set of chargers in every airport you go to.

3. Sleep: never, ever refuse the opportunity to have an espresso in the airport, no matter what time of day or night, so that you’re wired and sweating for the entire flight.  Do not watch sad movies.  Remember that sad things in movies come back to you late at night when you can’t sleep, and you imagine them happening to you and your family and you cry into your pillow (and that really didn’t work out that time in China when the pillow was made of rice).

4. WiFi: always run up at least a £100 roaming 3G bill on your personal mobile phone by checking your emails and Twitter constantly.  Never take your work mobile, it is two models out of date and soooo uncool.

5. Health: ensure you maintain your health by somehow sitting next to the person who has the worst cold you have ever witnessed.  Every. Single. Time.

6. Accessories: buy those tiny expensive toiletries at the airport and never claim them on expenses.  They’re so cute!  They look exactly like the grown up ones, but teeny – squee!

7. Apps: NO NEVER USE YOUR PHONE THE BATTERY WILL RUN OUT, STOP STOP STOP SWITCH IT OFF TO SAVE BATTERY.  YES I KNOW THE SEAT IN FRONT HAS A USB PORT BUT YOU DON’T HAVE THE RIGHT CABLE!

And with that, I will wish you a happy new year!

Mick

 

 

The five habits of bad bioinformaticians

When ever I see bad bioinformatics, a little bit of me dies inside, because I know there is ultimately no reason for it to have happened.  As a community, bioinformaticians are wonderfully open, collaborative and helpful.  I genuinely believe that most problems can be fixed by appealing to a local expert or the wider community.  As Nick and I pointed out in our article, help is out there in the form of SeqAnswers and Biostars.

I die even more when I find that some poor soul has been doing bad bioinformatics for a long time.  This most often happens with isolated bioinformatics staff, often young, and either on their own or completely embedded within a wet-lab research group.  I wrote a blog post about pet bioinformaticians in 2013 and much of the advice I gave there still stands today.

There are so many aspects of bioinformatics that many wet lab PIs are simply incapable of managing them.  This isn’t a criticism.  There are very few superheros; few people who can actually span the gap of wet- and dry- science competently.  So I thought I’d write down the 5 bad habits of bad bioinformaticians.  If you are a pet bioinformatician, read these and figure out if you’re doing them.  If you’re a wet lab PI managing bioinformatics staff, try and find out if they have any of these habits!

Using inappropriate software

This can take many forms, and the most common is out-of-date software.  Perhaps they still use Maq, when BWA and other tools are now far more accurate; perhaps the software is a major version out of date (e.g. Bowtie instead of Bowtie2).  New releases come out for a reason, and major reasons are (i) new/better functionality; (ii) fixing bugs in old code.  If you run out of date software, you are probably introducing errors into your workflow; and you may be missing out on more accurate methods.

Another major cause of inappropriate software use is that people often use the software they can install rather than the software that is best for the job.  Sometimes this is a good reason, but most often it isn’t.  It is worth persevering – if the community says that a tool is the correct one to use, don’t give up on it because it won’t install.

Finally, there are just simple mistakes – using a non-spliced aligner for RNA-Seq, software that assumes the wrong statistical model (e.g. techniques that assume a normal distribution used on counts data) etc etc etc.

These aren’t minor annoyances, they can result in serious mistakes, and embed serious errors in your data which can result in bad science being published.  Systematic errors in analysis pipelines can often look like real results.

Not keeping up to date with the literature

Bioinformatics is a fast moving field and it is important to keep up to date with the latest developments.  A good recent example is RNA-Seq – for a long time, the workflow has been “align to the genome, quantify against annotated genes”.  However, there is increasing evidence that the alignment stage introduces a lot of noise/error, and there are new alignment free tools that are both faster and more accurate.  That’s not to say that you must always work with the most bleeding edge software tools, but there is a happy medium where new tools are compared to existing tools and shown to be superior.

Look for example here, a paper suggesting that the use of SAMtools for indel discovery may come with a 60% false discovery rate.  60%!  Wow….. of course that was written in 2013, and in 2014 a comparison of a more recent version of SAMtools shows a better performance (though still an inflated false positive rate).

Bioinformatics is a research subject.  It’s complicated.

All of this feeds back to point (1) above.  Keeping up to date with the literature is essential, otherwise you will use inappropriate software and introduce errors.

Writing terrible documentation, or not writing any at all

Bioinformaticians don’t document things in the same way wet lab scientists do.  Wet lab scientists keep lab books, which have many flaws; however they are a real physical thing that people are used to dealing with.  Lab books are reviewed and signed off regularly.  You can tell if things have not been documented well using this process.

How do bioinformaticians document things?  Well, often using things like readme.txt files, and on web-based media such as wikis or github.  My experience is that many bioinformaticians, especially young and inexperienced ones, will keep either (i) terrible or (ii) no notes on what they have done.  They’re equally as bad as one another.  Keeping notes, version controlled notes, on what happened in what order, what was done and by whom, is essential for reproducible research.  It is essential for good science.  If you don’t keep good notes, then you will forget what you did pretty quickly, and if you don’t know what you did, no-one else has a chance.

Not writing tests

Tests are essential.  They are the controls (positive and negative) of bioinformatics research and don’t just apply to software development.  Bad bioinformaticians don’t write tests.  As a really simple example, if you are converting a column of counts to a column of percentages you may want to sum the percentages to make sure they sum to 100.  Simple but it will catch errors.  You may want to find out the sex of all of the samples you are processing and make sure they map appropriately to the signal from sex chromosomes.  There are all sorts of internal tests that you can carry out when performing any analysis, and you must implement them, otherwise errors creep in and you won’t know about it.

Rewriting things that exist already

As Nick and I said, “Someone has already done this, find them!”.  No matter what problem you are working on, 99.9999% of the time someone else has already encountered it and either (i) solved it, or (ii) is working on it.  Don’t re-invent the wheel.  Find the work that has already been done and either use it or extend it.  Try not to reject it because it’s been written in the “wrong” programming language.

My experience is that most bioinformaticians, left to their own devices, will rewrite everything themselves in their preferred programming language and with their own quirks.  This is not only a waste of time, it is dangerous, because they may introduce errors or may not consider problems other groups have encountered and solved.  Working together in a community brings safety, it brings multiple minds to the table to consider and solve problems.

—————————————————

There we are.  That’s my five habits of bad bioinformaticians.  Are you doing any of these?  STOP.  Do you manage a bioinformatician?  SEND THEM THIS BLOG POST, and see what they say 😉

Older posts Newer posts

© 2019 Opiniomics

Theme by Anders NorenUp ↑