bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

HPC and biology; a marriage made in hell

A quick micro-post!  I was just having  coffee with a guy from EPCC (Edinburgh Parallel Computing Centre), and he said: “We have lots of compute power, and you guys have lots of data, it’s a marriage made in heaven!” (when he said “you guys”, he meant biologists)

And of course he is right; but also sadly wrong.  This is how conversations tend to go between HPC and biologists:

  • Biologist: I have lots of data, please help!  Can I use your HPC?
  • HPC: Yes, of course you can.  What resources do you need?  How much RAM?  How many processors?
  • Biologist: Erm, I don’t know, and I won’t know until I run the software and complete the analysis
  • HPC: Hmmm, well you can’t have access until you know how much you want…..

Note: This is not a criticism of EPCC!  I have had the above conversation about 30 times with different HPC providers, going back to when the cloud was called the grid.  It’s a perpetual problem.

So, what’s the solution?  Why isn’t this a marriage made in heaven?  Comments please!


  1. When the “biologists” (i.e. us) reply to HPC’s questions, we should clearly just say “all of it”. Job done!

  2. If with HPC, you mean in-memory MPI-based SPMD or MPMD applications and infrastructures that are optimized for that, then I think you will continue to run into that problem, as the game for such HPC infrastructures is to maximize processor performance. However, the big data universe starts from the other end: it tries to maximize I/O concurrency, and their the questions are exactly the opposite, and are of the ilk “How much storage will you need”.

    The HPC community needs to do what it needs to do since most of the efforts there are in capability computing universe. The big data community is a better starting point for computational biology as data management is the key operational aspect of big data efforts, and computes are discovered as the programs run. The interesting and somewhat sad observation is that the big data folks are rediscovering all the processor optimization tricks that the HPC community has discovered over the past thirty years, but since schedulers and programs are so different they get laid bare in different ways. This is to say that the big data community is integrating the HPC learning of the past thirty years to make their computes more efficient. The reverse is not true, the HPC community is not integrating the big data data management techniques because they don’t help their core applications, which need to run mostly in-memory to truly compute at supercomputing levels.

    One final, somewhat self-serving, observation: with computational biology there is also an opportunity to leverage ‘computing-at-the-leaves’. The reason for this is that a lot of genomics and proteomics is driven by instrument-based data collection, and the data reduction that can be had by computing a the instrument can reduce the overall cost by three orders of magnitude. This requires a work flow that is foreign to both HPC and big data communities, but is starting to take hold in the mobile and IoT communities.

  3. One solution is to give users trial access for free but queue their jobs at the lowest priority so they’re essentially using idle resources without disrupting the paying customers. This allows them to get a sense of how much resource they would need to complete their analysis in a reasonable time. Obviously, this depends on how much idle capacity you have on your cluster.

    If you don’t know beforehand, can’t the provider just bill you on a PAYG basis for the RAM and CPU you actually use? If your usage is putting on a strain on the available capacity, they can politely suggest that you buy some more upfront.

  4. Maybe I’ve always had to come at it from a biologist angle, but my experience has always been that the compute folks are rarely either willing or able (or both) to take the time to understand the problems and the software that the biologists run. Sure, you could easily put it around the other way, and say the biologists are ignorant of what they’re running or the details of how it could be done efficiently. But my data was never “BIG” big data, so after a conversation or three I tended to throw up my hands and let my stuff on an 8-core desktop. It took a week instead of a few hours, but I had grants, papers, and lectures to write anyway.

    So, there really needs to be staff in place who can bridge disciplines and understand the problems and how to solve them. Unfortunately, that’s a very tall order today.

  5. I don’t think that’s a very tall order at all. Where I work, it’s considered to be part of my job to help biologists use the cluster. Obviously it depends on me having the time to do it but here it’s an accepted (and expected) part of the role. I think you’ve got to have that kind of support in place to get the best value from a cluster. It’s not just about the hardware.

Leave a Reply

© 2018 Opiniomics

Theme by Anders NorenUp ↑