bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Assessing bioinformatics impact

What with the REF looming, and the rather ridiculous notion that I need to choose 4 papers, and those 4 papers will be used to judge the last 4 years of my career, my thoughts have been turning to impact, and how we judge the impact of our research.

Is bioinformatics any different to other biological research, I wonder?


The traditional method of assessing the impact of any research is the number of citations a paper attracts, and of course we can do that for bioinformatics papers too.

My immediate thought is that bioinformatics papers can attract two types of citation:

  1. Other papers that use your software / method to gain a biological insight
  2. Other software tools and methods that claim to be better than yours

Intuitively, more of (1) and less of (2) seems better, though of course it is not necessarily a bad thing that other tools come out that are better than your own – that’s simply progress.

There are similarities in traditional biological research too – your paper on e.g. cancer genomics may be cited many times because it is an excellent basis for further research (similar to (1) above) or it may be cited because it is completely incorrect and the results are not reproducible (similar to (2)).

I have a major gripe about citations – from personal experience, and I know others have suffered from the same – software we published in 2009, CORNA, was used extensively in the Zebrafinch genome paper, and was cited in the supplementary material – but pubmed and various other citation trackers do not track citations in supplementary material.  Doh!

Overall, the number of citations appears a very poor proxy for actual impact.

Access statistics

On publisher websites such as BioMedCentral, it’s possible to access statistics about how many times your paper has been read/downloaded.  Obviously these numbers far out-strip the number of citations, but I imagine they are strongly correlated.  It would be interesting to identify papers that have been read lots, but cited little, to see what insight we can gain.

For bioinformatics, though, I imagine there are better statistics we can access.

Software downloads

This has to be one of best ways of assessing the impact of a particular bioinformatics software tool – after all, every time the software is downloaded, that’s because someone wants to use it and therefore the tool is having a very real impact.  Not every use will result in a citation, and I would argue that this kind of metric is far more important in bioinformatics than the number of citations.

Of course, many people might download the tool and never install it; or install it and never use it; or use it once and decide it’s awful.  It’s more difficult to assess those kind of use-cases without spying on your users.

Tracking downloads is fairly easy on sites such as SourceForge, but one of my main gripes with Github is that “clone” events are not tracked, therefore 1000s of people may clone your repository and you have no way of knowing.  Trust me, when you want to apply for funding to keep your software going, or you need to submit a grant report, those statistics are going to be important.  I struggle to understand the obsession with github for just this reason.

Software Use

Of course, tracking this would be one of the best ways of assessing bioinformatics impact – but how to do that?  You can ask your users if it’s OK to collect usage statistics and send those home to the “mother ship” periodically, but what if they say no?  And often, for privacy reasons, they will.  Alternatively, they may be behind a firewall which forbids communication of the results.

I’d be very interested to hear from people who have ideas on how to collect statistics on how many times a bioinformatics tool is used.

Number of users/support requests/mailing list size

The above are all proxies for software use, right?  If you know you have 1000 users, your software is probably being used more than if you have 10.  Similarly the size of the mailing list would surely correlate with usage.  Number of support requests might be a little biased though – well documented and well written tools will get less requests for support than poorly documented, buggy tools.

Perhaps the best way to track and prove impact in this way is to release a really essential tool that has no documentation, and track the support requests you receive? 😉

Website uses

Those who deliver bioinformatics tools over the web have it easiest, as it is incredibly simple to track use of your tool using web logs and various other statistics you can build into the actual web code.  Easy and simple.  I wish it was so easy for every other tool!

Website visits

Whether or not your software is delivered over the web, or simply downloaded, tracking how many people visit your website, how long they stay for and what they read, can all be good measures of impact and will probably correlate very well with software use.

This is why I think a lot of people host their own software page on their own website (rather than use SourceForge or github etc) – they just have more control over what access statistics they can track.


So what have I missed? Probably lots!

The list goes on and on….!

Overall, I have to say, I find the whole REF process leaves a bad taste in my mouth.  As a bioinformatician (specifically) and as a researcher (generally) I do so much that has impact that is not tracked in any way by REF.  I haven’t even mentioned training, or support, e.g. the number of times a student comes into my office and I explain something, or show them something.  I’ve found picking out just those 4 papers to be quite demoralising, to be honest – and I’m sure the main aim of REF is not to demoralise the entire UK academic sector? (sometimes I wonder)

So – should we be proposing better ways of tracking bioinformatics impact?


  1. Interesting point about highly read, but low-cited articles. It shouldn’t be too hard to get some data. What might be some testable and objectively answerable questions about those?

    Here’s some suggestions:

    Time since publication. A trivial hypothesis: Reads can accumulate much faster than citations, so highly profiled articles especially would expect to have a high read to citation ratio shortly after publication.

    Software articles: Since you mentioned it. Are software-presenting articles in fact likely to have higher read to citation ratios than other articles?

    How about retracted publications? Do those tend to have a different read to citation ratio?

    Is there a pattern to the read/citation ratio with respect to the number of citations (i.e is it constant, or do highly cited articles get disproportionally more or fewer reads)?

    Any other? Thoughts?

  2. PyPI (python package index) recently stopped counting downloads. They reason that it’s a meaningless number. They are correct. Github clones would also be meaningless. Some downloads are automated, which greatly inflates your number. OTOH, there may be third party sources for your software (if it becomes widely used, there will be packages for specific operating systems/distributions).

    I still agree with you that it’s good to have a meaningless high number to plug in, but it’s all pretty meaningless.

  3. I take your point, but I don’t think we can equate “inaccurate” with “meaningless” – those numbers reflect actual events, and it’s better to track them than not, IMO.

  4. You could always make your script/binary report anonymous usage statistics every time this is run. Probably most people would be happy with this if you explained why you did it, and it didn’t affect the running of the software.

  5. Also, I wonder how many people download and use a piece of software without ever viewing the associated publication and just rely on the documentation for information?

    How about collecting email addresses at download time and then firing off a feedback link some (short) time later asking if they have found the software useful / would they use it again ? Or would that be too intrusive?

  6. Great Post. Lots to think about. I’ve started throwing up some of your blog posts on my bioinformatics newsletter on scoop.it for my bioinformatics colleagues. http://www.scoop.it/t/bioinformatics-by-mel-melendrez-vallard

  7. Richard Emes @rdemes

    10th July 2013 at 2:41 pm

    Worth keeping in mind that REF, (flawed as it may be. I could also highlight the inequalities of core funded institutes being directly “scored” against relatively poorly funded universities but probably better to keep that separate) is about institutions not individuals, so it doesn’t matter how many people use our software tools as long as they are used by someone to get science done. Preferably this science would also get done in our own institutes and those researchers publish in a nice big Nature paper (edit for your traditional high impact journal of choice here). That way they keep the funds rolling in to the institute, keep the lights on etc.
    I think the problem becomes more acute in promotion panels etc where you are directly up against someone more often from a more traditional field with a more tractable measure of impact?

    Just realized i am building your Klout score as I type. Delete my post immediately 😉

  8. So I was thinking – take for example a paper that presents a LIMS – that may be read lots, as we all need LIMS – but not often cited, because, well who cites their LIMS?

  9. People hate this though! What they want is just to clone from Github or download from sourceforge with little or no effort

  10. Weeeeeell, the University gets money as part of the REF exercise – is that not akin to “core” funding? The fact you don’t see it and it goes into the University coffers is something you might want to take up with your managers 😉 But more seriously, yes, institutes do have a cushier number, but they are not usually assessed as part of the REF – we are because we were moved into a University. Others (e.g. Pirbright) will not undergo REF, they have the IAE (institute assessment exercise). We have both (Doh!).

    I have a klout score?

  11. What we really need is (1) code with publications and (2) automatic detection of when the code uses somebody else’s code. This would be a code citation!

    You import my library, I get a code citation. Run my binary in your script, another code citation!

  12. For example, I just realised that every time I push to github, the automated testing environment at travis (which monitors my projects) does 3 clones (one for each supported version of Python; probably I’ll add another once version 3.4 comes out). Every pull request by others does the same. Over time, these are thousands and thousands of clones. They are not new users.

  13. Well, they don’t have to be new users to be counted as an impact.

    The fact is that if someone clones your repo for Python 2.7.3 and then clones it again for 3.3 then that is impact: (i) they liked it so much at version 2.7.3 as soon as 3.3 came out they cloned it again, because they’re still using it; or (ii) the code is so useful that users of multiple Python versions need to be maintained; or (iii) your effort in making your code available is of higher impact than someone who releases code that will only work in Cobol

  14. A category of bioinformatic tools has a simple and effective way to asses usage: web servers. You can collect access statistics, compare it against the number of jobs effectively launched or ask for an email upon job launch to track unique users, as well as geographical location and all sorts of metadata; it is particularly handy also for debugging purposes, as you may be able to inspect crashes and corner cases that come up with strange inputs you hadn’t thought of.

  15. I totally agree – one problem though – I don’t trust web server code 😉

  16. And you are of course right in doing so: one thing that may change your mind are those web-servers running open-source code, but that doesn’t mean that the server runs exactly that code or doesn’t track something using something else.

    It’s a matter of trust, it depends on who is running the server (which reminds me that I should publish the code running our only web-server :p).

  17. I don’t think server-logged downloads are useful measures of impact, for two reasons.

    Firstly, I download most software packages I use many times over. For example, I’ve cloned the khmer repo probably over 40 times on different VMs and installs. I’ve got scripts that setup temporary environments that download bowtie2 and eXpress – inflating the download count but not the impact. I suspect many people who use cloud instances have similar setups.

    Secondly, downloads are massively inflated as part of the normal testing process (for anyone using automated code intel). My transrate gem (https://rubygems.org/gems/transrate) has been downloaded about 900 times, but I know there are at least three downloads for every commit I make, and another one for every time someone makes a commit on a gem that requires my gem. The same goes for any language’s package repository. Picking out the user download count from that is hard.

    Phoning home is really the only reliable way to measure actual usage. I think the way to manage that without scaring the user is to make it useful. For instance, I plan to eventually make my assemblotron gem (https://github.com/Blahah/assemblotron) optionally collect usage statistics and use them to improve its own performance. The user gets something in return for sending back anonymised data.

    Another useful metric could be user-group members. If you have an email list or a Google Group for support, you could use membership as a proxy for usage. Although this might penalise making good documentation 🙂

Leave a Reply

© 2018 Opiniomics

Theme by Anders NorenUp ↑