What with the REF looming, and the rather ridiculous notion that I need to choose 4 papers, and those 4 papers will be used to judge the last 4 years of my career, my thoughts have been turning to impact, and how we judge the impact of our research.

Is bioinformatics any different to other biological research, I wonder?


The traditional method of assessing the impact of any research is the number of citations a paper attracts, and of course we can do that for bioinformatics papers too.

My immediate thought is that bioinformatics papers can attract two types of citation:

  1. Other papers that use your software / method to gain a biological insight
  2. Other software tools and methods that claim to be better than yours

Intuitively, more of (1) and less of (2) seems better, though of course it is not necessarily a bad thing that other tools come out that are better than your own – that’s simply progress.

There are similarities in traditional biological research too – your paper on e.g. cancer genomics may be cited many times because it is an excellent basis for further research (similar to (1) above) or it may be cited because it is completely incorrect and the results are not reproducible (similar to (2)).

I have a major gripe about citations – from personal experience, and I know others have suffered from the same – software we published in 2009, CORNA, was used extensively in the Zebrafinch genome paper, and was cited in the supplementary material – but pubmed and various other citation trackers do not track citations in supplementary material.  Doh!

Overall, the number of citations appears a very poor proxy for actual impact.

Access statistics

On publisher websites such as BioMedCentral, it’s possible to access statistics about how many times your paper has been read/downloaded.  Obviously these numbers far out-strip the number of citations, but I imagine they are strongly correlated.  It would be interesting to identify papers that have been read lots, but cited little, to see what insight we can gain.

For bioinformatics, though, I imagine there are better statistics we can access.

Software downloads

This has to be one of best ways of assessing the impact of a particular bioinformatics software tool – after all, every time the software is downloaded, that’s because someone wants to use it and therefore the tool is having a very real impact.  Not every use will result in a citation, and I would argue that this kind of metric is far more important in bioinformatics than the number of citations.

Of course, many people might download the tool and never install it; or install it and never use it; or use it once and decide it’s awful.  It’s more difficult to assess those kind of use-cases without spying on your users.

Tracking downloads is fairly easy on sites such as SourceForge, but one of my main gripes with Github is that “clone” events are not tracked, therefore 1000s of people may clone your repository and you have no way of knowing.  Trust me, when you want to apply for funding to keep your software going, or you need to submit a grant report, those statistics are going to be important.  I struggle to understand the obsession with github for just this reason.

Software Use

Of course, tracking this would be one of the best ways of assessing bioinformatics impact – but how to do that?  You can ask your users if it’s OK to collect usage statistics and send those home to the “mother ship” periodically, but what if they say no?  And often, for privacy reasons, they will.  Alternatively, they may be behind a firewall which forbids communication of the results.

I’d be very interested to hear from people who have ideas on how to collect statistics on how many times a bioinformatics tool is used.

Number of users/support requests/mailing list size

The above are all proxies for software use, right?  If you know you have 1000 users, your software is probably being used more than if you have 10.  Similarly the size of the mailing list would surely correlate with usage.  Number of support requests might be a little biased though – well documented and well written tools will get less requests for support than poorly documented, buggy tools.

Perhaps the best way to track and prove impact in this way is to release a really essential tool that has no documentation, and track the support requests you receive? 😉

Website uses

Those who deliver bioinformatics tools over the web have it easiest, as it is incredibly simple to track use of your tool using web logs and various other statistics you can build into the actual web code.  Easy and simple.  I wish it was so easy for every other tool!

Website visits

Whether or not your software is delivered over the web, or simply downloaded, tracking how many people visit your website, how long they stay for and what they read, can all be good measures of impact and will probably correlate very well with software use.

This is why I think a lot of people host their own software page on their own website (rather than use SourceForge or github etc) – they just have more control over what access statistics they can track.


So what have I missed? Probably lots!

The list goes on and on….!

Overall, I have to say, I find the whole REF process leaves a bad taste in my mouth.  As a bioinformatician (specifically) and as a researcher (generally) I do so much that has impact that is not tracked in any way by REF.  I haven’t even mentioned training, or support, e.g. the number of times a student comes into my office and I explain something, or show them something.  I’ve found picking out just those 4 papers to be quite demoralising, to be honest – and I’m sure the main aim of REF is not to demoralise the entire UK academic sector? (sometimes I wonder)

So – should we be proposing better ways of tracking bioinformatics impact?