Biologists: this is why bioinformaticians hate you…

The Wellcome Trust have released a data set showing article processing charges paid in 2012-2013, and you can download the results here.  I’d like to join with everyone else in congratulating the Wellcome Trust on collecting and releasing these figures, and I think MRC, BBSRC, NERC and other funding bodies should follow.

Having said that, this dataset represents everything that’s bad about science – just not in the way you might think.  Biologists: this is why bioinformaticians hate you:

1) the text in the spreadsheet is not enclosed by quotation marks, except one of the paper titles, on line 18.  Why?!!!  Why must you do this?!!

2) To add insult to injury, the title of the paper on line 1127 only contains one set of quotes!  ONE!  Do you realise how many problems that creates?!!! Gah!

3) Spelling: oh my…. where do we start on this point….!

There are 4 entries for the American Society for Biochemistry and Molecular Biology:

17 American Soc for Biochemistry and Molecular Biology 1100.00
18 American Society for Biochemistry and Molecular Biolgy 2259.64
19 American Society for Biochemistry and Molecular Biology 24404.45
20 American Society for Biochemistry and Molecular Biology 1166.60

(yes the last two look identical, but the bottom entry has an inexplicable (and inexcusable) space at the end of it)

There are 6 entries for BioMed Central!

44 Biomed Central 10645.11
45 BioMed central 4348.80
46 BioMed Central 56561.05
47 BioMed Central 10891.04
48 BioMed Central Limited 8650.00
49 BioMed Central Ltd 2529.30
54 BMC 33933.62

You have no idea how to classify the British Medical Journal 🙁

55 BMJ 44872.80
56 BMJ 3540.00
57 BMJ group 4080.00
58 BMJ Group 28230.00
59 BMJ Group 1700.00
60 BMJ Journals 2040.00
61 BMJ Publishing Group 16215.00
63 BMJ Publishing Group Ltd 12000.00
64 BMJ Publishing Group Ltd & British Thoracic Society 2340.00

Or Elsevier (I’m crying now)

98 Elsevier 936601.48
99 ELSEVIER 12083.68
100 Elsevier 18629.66
101 Elsevier (Cell Press) 7830.62
102 Elsevier / Cell Science 3895.64
103 Elsevier B.V. 5322.79
104 Elsevier Ltd 1428.68
105 Elsevier/Cell Press 4226.04

There’s no need to shout (or spell correctly)

263 The company of Biolgists 3444.00
264 The company of Biologists 1044.00
265 The Company of Biologists 4020.00
267 The Company of Biologists Ltd 1620.00

And don’t get me started on Wiley old Wiley!

139 John Wiley 9329.79
140 John Wiley & Sons 13394.41
141 JOHN WILEY & SONS 4555.12
142 John Wiley & Sons Inc 4581.55
143 John Wiley & Sons Ltd 14591.34
144 John Wiley & Sons, Inc. 1870.74
145 John Wiley and Sons 1852.01
146 John Wiley and Sons Ltd 2868.08
280 Wiley 264427.26
281 Wiley-Blackwell 121650.59
282 Wiley-Blackwell, John Wiley & Sons 1900.70
283 Wiley-VCH 12084.92
284 Wiley 22170.21
285 Wiley & Son 1800.00
286 Wiley Blackwell 9308.41
287 Wiley Online Library 1896.64
288 Wiley Subscription Services 17572.08
289 Wiley Subscription Services Inc. 23880.24
290 Wiley Subscription Services Inc 3724.65
291 Wiley Subscription Serviices Inc 1533.71
292 Wiley VCH 1635.00
293 Wiley/Blackwell 1513.73
294 Wliey-Blackwell 2400.00

I now have no hair left; I’ve torn it all out.  My teeth are just stumps from excessive gnashing.  My faith in humanity has been destroyed!

Dear Science

Every time you annotate things using free text and without using structured vocabularies, a kitten bioinformatician dies.  Every time you make a spelling mistake, computers explode in silent fury.  Your brain is amazing, it can recognise patterns in words and associate them together; alas there is no computer in existence that can match that ability.

Please – for the love of all that is good in the world – stop using free text and start using structured vocabularies, drop-down lists and ontologies.

Many thanks




  1. While I agree with the gist of this article, the quotes are actually a red herring: quotes are delimiters, and they are completely redundant as such in the XLSX file format. Text in here should not be quoted. Rather, the quotes should be added when converting the file to a CSV or similar.
    Concerning the different spellings of journal names, this is truly irritating and has caused me to add a data normalisation step in front of many of my analyses. It’s actually not too much hassle, it’s just irritating because it’s so unnecessary – and of course it creates work not just for one person but for every user of the data.

  2. Indeed, the quotes are redundant, which begs the question: why are they in the xlsx document?

    I struggle to imagine any “normalisation” step that can fix those journal names, other than a poor bioinformatician sitting there and fixing them by hand….

  3. It’s not just bioinformaticians, it’s also other biologists!
    Wouldn’t it be easier if they just stuck a PubMed ID down too? Then we could all ignore the terrible typos and just pull the data from one common source with a standardised ontology (either programmatically or manually)

    And of course you forgot to mention the American English spellings, eg. standardised vs standardized, foetus vs fetus, oestrogen vs estrogen, etc, which double up any literature searches……..

  4. At best it’s going to require a load of regexes. Ugh.

  5. The good news is that they did include pubmed IDs. The bad news is that they’re are equally unreliable as the rest. Many are empty fields.

  6. Mick, you’re being overly harsh on ‘Biologists’. This is not specific to them, it is typical of any human-generated data that is free text. Give anyone the chance to write something not quite computer parseable and they will.

    As mentioned on twitter OpenRefine is your friend here. It corrects most of these issues – short acronyms (e.g. BMJ) excepted.

    You may want to have look at my corrected list on figShare:

  7. Yes, Regex, or outright dictionaries with mappings from all possible forms to the canonical form. In principle, if one just assumes spelling errors, a basic spell check (there are libraries for that) could also be applied. Of course the existence of technical solutions for this particular problem does not make this sloppiness acceptable.

  8. Stop using excel as a database should be the message.

  9. I completely agree. Maybe something like https://www.sysmo-db.org/rightfield may be helpful if more people used it? (http://bioinformatics.oxfordjournals.org/content/27/14/2021.long)

  10. It’s worrying that just formatting even something as simple as these IDs doesn’t seem to be able to be standardised/standardized. Could someone explain the logic behind having different PMID and PMCID numbers for the same reference? Isn’t that a step backwards from having a ‘unified’ system – or maybe I’m missing something ?

  11. From now on I’m using a different convention everytime I have to put metadata in, just to upset the statisticians.

  12. Even a bioinformaticion should realise biological data is seldom clean, glad to see we’ve been keeping you sharp!

