bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

How many hours have been wasted parsing unstructured text in biological databases?

I cannot possibly do this subject justice, and to be honest, I am just letting off steam.

Around 2008 my group won funding to extract public datasets from ArrayExpress and look at co-expression across various experiments in farm animal genomics and genetics.  Far and away the biggest problem was turning essentially unstructured metadata into something a computer could understand.

Now, 5 years later, we’re looking at doing the same with SRA data, and guess what?  Here are just a few simple examples of where we have failed completely as scientists:

Describing tissues as:

  • embryo vs embryonic vs embyonal
  • hippocampus vs hippocampal
  • stem cell vs stem_cell
  • renal_cell vs renal vs renal cell vs kidney vs kidney cell vs kidney_cell etc
  • leukemia vs leukaemia
  • fetal vs foetal
  • etc etc etc

This is just a tiny, tiny selection 🙁

Here’s the thing – every experiment in ArrayExpress or SRA is understandable individually by a human being.  However, try and automate the analysis of 100s or 1000s of datasets and you run bang into the problem of unstructured text and a lack of controlled vocabularies and ontologies.

Just imagine what a powerful dataset SRA and ArrayExpress could have been.  Oh, and here comes personalised genomics – and we’ll make the same mistakes all over again.

*big sigh*


  1. The optimistic view is that pulling these data together, at least in a simplistic fashion, is actually not that tricky when the resources exist to do so. However, the resources don’t often exist, there are so very few projects that are funded to build and maintain useful applications around controlled vocabularies, and getting buy-in from the researcher who actually deposit the data is hard. Getting the bioscience community (i.e. not the ontology generators themselves) to actively engage in creating, promoting and using ontologies is harder still.

    Metadata attribution is seen as a time sink for researchers. We may all talk about how the open science revolution has been a long time coming and how wonderful it is (not that I disagree at all), but until the value of describing data is seen at the deposition level, nothing will change.

  2. You make good points, but we have to start somewhere. For me this starts with the database, the repository, it *has* to. So the DB enforces use of controlled vocabularies, and where those appear deficient, then they are expanded. We cannot expect communities to rock up at the EBI or NCBI and say “here are our ontologies, can you please use them?”. That’s never going to happen, so it has to start with the DB provider.

  3. How familiar this is. On starting one of my jobs, I was given the task of “mining GEO”. Great, I thought, there’s this thing called “MIAME”, should be quite easy.

    I assumed, for example, that samples from a particular experiment might have clear, simple labels such as “normal” and “tumour”. That it would be clear whether reported values were normalized or log-scaled and if so, how. That probeset identifiers would map to genes. It would simply be a case of studying a few records to get the general idea, then writing appropriate code.

    How naive I was; the truth of course is that GEO (and everyone else) merely encourages compliance, rather than enforcing it. Any kind of automated mining + analysis is more or less hopeless. Not to mention grave concerns regarding data QC, batch effects etc.

    I gave up and moved on. Other groups (e.g. the Butte lab) have had more success. Frankly, I wonder how they do it.

  4. Hi Neil

    I’ve met about 4 different people now who claim to have tried the same thing – mine ArrayExpress and GEO. All have ended in a nightmare of regular expressions in some desperate attempt to create something useful

    We live and (don’t) learn


  5. Have they tried something more advanced, than plain regular expressions? Just curious which methods failed, there are plenty of them.

Leave a Reply

© 2017 Opiniomics

Theme by Anders NorenUp ↑