bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Ten things to consider when choosing an NGS supplier

Top ten lists have taken a bit of a battering in recent times, and so I kind of apologise for adding another one, but I’m not really that sorry.

Some of you will be aware that I manage a genomics facility, ARK-Genomics.  I choose my words carefully – I don’t run it, the excellent staff who work in the facility do that.  If I dropped off the face of the planet, the facility would still “run”.  No, I “manage” the facility, which at the end of the day means that I write documents and submit them to various people – our management team, our funders, our finance team etc.

I may come across as cynical at times, but it’s only because I care.  A lot.  I love science.  I love genomics.  I want to see it done well and I get angry when it isn’t.  So I care about ARK-Genomics.  I care about the data we produce, and I care about the customers who come to us.

This post is really about caring, and why you should take care when you choose an NGS supplier.

As a quick aside, you may be surprised at my use of the word “customer”.  You shouldn’t be.  Whilst ARK-Genomics is an entirely academic operation, we do not make any profit and we prefer to work as a scientific collaborator, make no mistake, the relationship benefits if we think of our collaborators as customers.  They have money, we have genomics expertise.  They pay us and they have expectations.

Meeting those expectations is what keeps me awake at night!

So here are my top ten things to consider when chooding an NGS supplier:

1. Quality – not all sequencing is equal.

If you get some Illumina data from there it won’t necessarily be as good as there.  These machines are temperamental.  It’s not just a case of doing a bit of pipetting followed by the push of a button.  There is a huge amount of skill, expertise and experience involved in QC-ing and preparing NGS libraries for sequencing.  Then, when the sequencer is running, it needs to be monitored pretty closely – Illumina have a habit of producing pokey batches of reagents, and if a problem develops, you need to act fast.  So lots of skill is involved in the process.

Below is a quality plot from a publicly available dataset.  Would you be happy with that?  You shouldn’t be.bad_qual

How about this one?  This isn’t a public dataset, nor is it one of ours, but it is one I have come across.  Someone I know paid for the sequencing before checking the quality of the data.  Do I have your attention now? 🙂


So not all data is the same, not all facilities are the same.  Some are better than others, quite frankly.  So choose well!

2.  Delivery – can they do it?

This is similar to point (1) above.  Sequencing DNA is different to mRNA, which is different again from microRNA.  ChIP-Seq, MeDIP-Seq, Bisulfite sequencing, mate-pairs etc etc etc.  They all take skill.  They’re all different.  Just because a facility has a certain sequencing technology, it doesn’t mean they’ll be able to do the type of sequencing you want.  Once a facility buys a machine, it can take them years to perfect the various different library preparation techniques.  Ask yourself, and ask them – do they have a track record in producing the type of data you want?  If not, then your project might be the one where they find out they can’t make that type of library after all.  This isn’t necessarily a problem, but if it ends up being a “development project” for the facility, you should know about it first!

3. Technology – the sequence is not the DNA

Not all sequencing is equal, and not all technologies are equal.  The sequence you get is the result of a long series of steps involving QC, PCR, library prep, sequencing and bioinformatics.  Each technology has pros and cons.  PacBio have high raw indel ratesSOLiD data is not great for genome assembly454 and Ion Torrent tend to have homopolymer errorsIllumina sequencing also can produce artefacts.  So, think about what you want to do, and choose the appropriate technology.  Be prepared for the fact that the cheapest per-base technology might not be the one you need.

4. Bioinformatics

Bioinformatics is now the most rate-limiting step in genomics research.  Can you handle the data yourself?  If not, what kind of bioinformatics support is available and how much does it cost?  You might find the bioinformatics component costs more than the sequencing – did you budget for that?  Note, not all bioinformaticians are equal, either.  Just because they know Linux and can get the tools to run, it doesn’t mean they are good bioinformaticians.  There is more to bioinformatics than running a few tools and e-mailing you the results to interpret.  Does your chosen facility have a good research track record of producing bioinformatics publications?  If so, it’s probably a good sign.

In my very early NGS years (in my previous role) we commissioned some Illumina sequencing, and the bioinformatics support was “Buy Lasergene”. Hmmmmm, this was not good advice!

5. Time

This is a fairly obvious one – how quickly will your project be run?  And where does it fit in with their priorities?  Is your project ranked as highly as, perhaps, their internal projects?  Will your £5k project be treated the same as the £5million project they just won to sequence 10,000 frightened badgers?  Or will your project be silently forgotten as they chase the next big grant?

6. Customer service

You probably won’t hear this term mentioned by many academic facilities, but this is important.  If you have an issue, or a question, who answers your call or your e-mail?  Does it get answered at all?  Or is it forgotten about/ignored?  What happens when you have a complaint?  At ARK-Genomics, we have a complaints procedure, and I encourage anyone who has an issue to e-mail me directly.  How many facilities encourage you to e-mail the director when things go wrong?

This is really important. You won the grant, you have the money, you have every right to expect your queries and complaints to be dealt with rapidly and professionally.

7. Responsiveness

So you want to know how much something costs, or you want advice on experimental design or choice of technology.  Who do you ask?  How long does it take for an answer?  Who answers you and do they know what they’re talking about?  Is it a sales person, who will basically tell you what you want to hear to make the sale?  Or a scientist, who is honest with you about the limitations?  Which would you prefer, honesty or the hard sell?  Will people who have a £1million consumables budget get a quicker response than someone who has £10k?

8. Processes

Some call this quality assurance, others call them SOPs (standard operating procedures).  Essentially they are talking about the same thing: processes, what they are, and documenting them.  I believe everything should have an SOP – project management, data management, customer complaints, bioinformatics, lab techniques, etc, etc:  everything.  It ensures that you get a good, standard product, no matter who carries out your project.  At ARK-Genomics, we are continuously developing our SOPs, and plan on obtaining ISO-9001 accreditation this year.  Does your chosen facility have processes and QA?  They should do!  If you’re working on human subjects, you may need CLIA accreditation, or you might want to insist on GLP.  These are all the same thing (albeit with different demands) – defining and documenting processes to ensure a consistent standard.

9. Knowledge, research, collaboration

All you need is a sequencing provider that knows how to sequence, right?  Right?! Wrong! Why not choose an NGS supplier that knows about your research interests too?

I’ll give you an example.  Take the pig genome, recently published in Nature.  The first authors are Martien Groenen and Alan Archibald, both internationally recognized experts in pig research and pig genomics.  Martien works at Wageningen University in the Netherlands, and Alan is my boss, and works at The Roslin Institute – home of ARK-Genomics.  You won’t be surprised to know that both institutions run next-generation sequencing facilities.

We know the pig genome well.  We’ve studied it for years.  We helped design the Illumina porcine SNP50 chip and the Affymetrix Porcine expression array (Alan is name-dropped here), and we recently published a Porcine Gene Expression atlas.

So if you’re doing research into pigs, the only question is why you wouldn’t do it in collaboration with a sequencing provider that knows the pig genome inside out?  Why would you choose anyone else?  Other providers might not know crucial details – for example, they might not be aware that the IFITM gene cluster, which encodes flu resistance, is poorly assembled, or that the growth regulating IGF2 is missing from the assembly entirely.  If you choose a provider who knows about the biology you are interested in, that can only benefit your research.

This is starting to sound like an advert, so I’ll mention others.  Mark Blaxter is a World-renowned nematode and evolutionary biologist, and he also happens to run an NGS facility – the GenePool.  If you’re into worms and you’re not doing your NGS with Mark, then you’re doing it wrong.  Likewise, Neil Hall is last author on the Wheat genome paper.  He also runs Liverpool’s Centre for Genomic Research.  Can you guess what I’m going to suggest?!

10. Comparing like with like

Show me a quote, show me any quote, and I’ll beat it.  It’s easy.  I’ll just do single-end instead of paired-end; or I’ll give you less reads;  or shorted read-lengths.  Actually, I won’t, because I have ethics, but you get the general idea.  As an example, for RNA-Seq, we recommend a minimum of 30M reads per sample, and 100bp paired-end sequencing as standard (the longer reads are better at defining splice junctions, and the pairs help reconstruct transcripts).  Now this experiment will incur a certain cost, and any other NGS facility can undercut our quote if they offer you 50bp paired-end, or 36bp single-end etc.  They may miss out crucial details, such as that TopHat, the most popular spliced aligner, is optimized for reads longer than 75bp.  Alternatively, they might give you 15M reads per sample instead of 30M, failing to inform you that ENCODE recommend 30M as a minimum.

You get my point, though – when choosing an NGS supplier, make sure you compare like with like!


So there we have it, my top ten things to consider when choosing an NGS supplier 🙂

The big, fat elephant in the room is what I haven’t mentioned – price.  To be honest, if price is more important to you than any of the above issues, then you’re doing it wrong – step away from the science and come back when you’ve thought about it a little bit more.  Sure, price is important up to a point, no-one likes to be ripped off and we all love a bargain, but in my opinion, it comes a long way down the list compared to the issues mentioned above.

Because good science and good data is the highest priority, right?

Update 11:31am 21/01/13

I should have said, the above points are in no particular order.


  1. Good post. Covers pretty much all the aspects I try to instill in our collaborators when they ask where to get sequencing done. Sadly, they often still go with your number 11 – price 🙁

    As a bioinformatician, I find points 6 & 7 to be most important as they’re the ones I don’t have much control over. Point 5 is key, but often quoted artificially low. A centre that shall not be mentioned has more or less black-listed itself here due to very poor estimates of timescales.

    Final point, regarding your RNA-seq recommended minimum I presume you’re talking about higher eukaryote/mammalian samples. For bacterial or yeast samples 30M 100bp PE reads is going to be serious overkill.

  2. Nice post Mick. I’d add one more thing though too:

    11. Experimental Design. Will they help you design the experiment you want to do so that it actually answers the question you want answered, or will they just sequence whatever you give them?

    Like you said, the sequencing centre have the expertise and getting them onside in the experimental design can only help IMHO.

  3. Yes, we see this a lot (RE price). We are competitive on price, I make sure of that, but we are not the cheapest – in part because certain centres do not recover their full costs, and so it is not a level playing field. ARK-Genomics has been around for 10 years though, and runs as a sustainable operation, whereas those not recovering costs won’t be here in two years time.

    We often see customers go elsewhere with cheaper quotes and then come back when they need help, or when things didn’t work out.

    I digress, however….

    One thing I hope came across is that these machines are temperamental. I know facilities that lost their sequencing capability for almost a year when the HiSeq 2000 was introduced. This was Illumina doing their testing in the field. A lot of reputations took a big hit, and Illumina made a big profit. More often than not, whichever facility you’re working with is doing the best they can.

    And yes, you are right RE RNA-Seq 🙂

  4. Thanks, I’m glad you liked it.

    Experimental design is a funny one – often people ask (great!), and sometimes they already have an experimental design in mind and don’t take too kindly to being told it’s not a good one….

  5. Great post and want to add the experimental design as the 11th one. Totally agree that Experimental design is still the most neglected aspect of NGS. a while ago I blogged on good experimental design for NGS experiments here http://nextgenseek.com/2012/10/tips-for-next-gen-sequencing-experiment-design-randomization/. readers of this post might find it useful.

  6. Thanks! So, experimental design is becoming a bit of a theme here 🙂 I agree, whole-heartedly, but we really see a split in our collaborators – some welcome the advice and really appreciate it, others – well, lets just say that price can get in the way of good experimental design!

  7. The problem with bioinformatics is top down ineptitude. I can think of a few places with great publication track records in bioinformatics who I wouldn’t ask to compute my Tesco shopping list. Harvard comes to mind as top of the ten (George is in the news again) but it would be a tough one as there are many groups of ontologists and pipeline knowledge flow computational groups who could fight it out. I have seen too many talks about how algorithm X is the best thing ever on the synthetic model data – oh but it does not work in the real world (an ISMB favorite). The best test is do they say they will have the analysis done for tomorrow? If they do go somewhere else.

  8. This is a great post, Mick. I’ll add a couple more things that I’ve found to be crucial as a bioinformatician. Apologies up front; this has ended up being a fairly long comment/rant 🙂

    Most of my points relate to my experience with analysing exome-sequencing data in human and mouse, but many points are applicable to RNA-seq, CHIP-seq etc. as well.

    Sample Checks
    This is the one I find the scariest – are the sequencing data you get back from the sample you think? I’ve encountered problems with this at all stages of the process. Not all of these are the fault of the NGS supplier, for example, sample swaps and mixups when pulling samples out of storage that occur prior to even sending the DNA to be sequenced. However, some of these errors certainly are the fault of the supplier. For instance, I was analysing some human exome-sequencing data for a family with a rare Mendellian disease. I downloaded the FASTQ files from the NGS supplier’s website. About one week after downloading the files (and having nearly completed my analysis) I received an email from the supplier saying in effect, “Oops, those samples you downloaded aren’t what you think they are. In fact, they don’t even belong to you”. Holy crap.

    It’s now standard for our lab, where possible, to compare genotypes called from the exome-sequencing data to that from SNP chip data on the same sample. We generally have such SNP chip data as most of our samples have previously been genotyped on SNP chips for linkage analysis. If there is a discrepancy between the NGS genotypes and SNP chip genotypes then that should set all kinds of alarm bells ringing.

    In another case the FASTQ files I downloaded for a single sample contained data from __two__ samples. About 80% of the data was from the intended sample but I still have no idea where the other 20% came from. This was discovered due to my second point: Check the read IDs and barcodes in your FASTQ files.

    Check the Reads IDs and Barcodes in FASTQ Files
    I was alerted to the issue of multiple samples in the FASTQ file because there were multiple combinations of instrument name/run ID/flowcell ID/lane ID/index sequence in the FASTQ file. Many (most?) NGS suppliers will multiplex samples, particularly for exome-sequencing and RNA-seq on the HiSeq, and then de-multiplex the reads by parsing the @SEQ_ID of each read in the FASTQ file. Unfortunately, this isn’t always done properly.

    We initially introduced a check of @SEQ_IDs in the FASTQ files into our analysis pipeline due to an NGS supplier sequencing our samples across multiple runs without ever bothering to inform us! There are well known batch effects in NGS data and by sequencing across multiple runs these effects are exacerbated. Of course, sometimes sequencing across multiple runs or lanes of a flowcell is inevitable and I have no issue with that. However, if it is necessary, then the customer should be informed, at a bare minimum, and preferably informed before the sequencing is run as this is effectively changing the experimental design.

    Clarify the Capture Technology for Exome Sequencing
    You may prefer one capture platform over another for a particular project because it better targets a region you are particularly interested in. Make sure you articulate this to your NGS supplier because they are likely to have their own preferred capture platform for their own reasons, such as owing to a simpler lab protocol or a better deal from the manufacturer.

    Minimum Coverage
    Many exome-sequencing providers quote prices for a given coverage, e.g. 40x. Clarify whether this is a theoretical pre-alignment coverage based on the number of reads generated or on an empirical post-alignment coverage based on the number of aligned/uniquely aligned/duplicate-free reads. Also, clarify how they compute this coverage – is it based on the CDS regions or on the regions targeted by the capture platform? Is it the average or median coverage? These small tweaks can result in drastically different coverages. For example, I’ve received figures of 40x average coverage based on aligned data. “Fantastic”, I thought, only to discover that 80% of these reads were PCR and optical duplicates.

    Know Your Rights and Get Them in Writing
    This relates to your points (6) and (7). Whilst many of the above sound like horror stories, some of these errors aren’t necessarily the fault of the NGS supplier – NGS isn’t perfect and sequencing runs do go bad through no-one’s particular fault. It’s how they are dealt with by the NGS supplier that can make or break a business relationship. Get your quote in writing in as much detail as possible so that if something does go wrong you know where you stand. Many of the bad sequencing runs I’ve described above were re-done at no cost by the NGS supplier because we had this written into the contract. Treat your customers well and they are likely to forgive mistakes. Treat them badly and they’ll quickly jump ship to another supplier.

  9. “Oops, those samples you downloaded aren’t what you think they are. In fact, they don’t even belong to you”… wow. Just, wow.

  10. Yep – and I am not naming names, but we too have had a situation where we sent one thing to an NGS supplier and what we got sent back was most definitely not what we sent.

    I see this from both sides, really. What happens in the lab is a very complex process. Things are taken out of original tubes, put into other tubes, or into plates, PCR-ed, diluted, concentrated, adapters added, pooled etc etc. There is a lot of scope for problems to occur. This is why (8) is so, so important. Then (4) is very important to catch any potential errors.

  11. I like the blog an eyeopener for the reserchers involved in NGS:
    With personal experience I do agree with timely and accuarte data delivery. When you requested sequence of X-organism, do you get X in return or X+50% (control reads) or reads from organism-Y. This definitely delays your analysis by 6months or even more in this competitive world of science.

Leave a Reply

© 2018 Opiniomics

Theme by Anders NorenUp ↑