I cannot possibly do this subject justice, and to be honest, I am just letting off steam.

Around 2008 my group won funding to extract public datasets from ArrayExpress and look at co-expression across various experiments in farm animal genomics and genetics.  Far and away the biggest problem was turning essentially unstructured metadata into something a computer could understand.

Now, 5 years later, we’re looking at doing the same with SRA data, and guess what?  Here are just a few simple examples of where we have failed completely as scientists:

Describing tissues as:

  • embryo vs embryonic vs embyonal
  • hippocampus vs hippocampal
  • stem cell vs stem_cell
  • renal_cell vs renal vs renal cell vs kidney vs kidney cell vs kidney_cell etc
  • leukemia vs leukaemia
  • fetal vs foetal
  • etc etc etc

This is just a tiny, tiny selection 🙁

Here’s the thing – every experiment in ArrayExpress or SRA is understandable individually by a human being.  However, try and automate the analysis of 100s or 1000s of datasets and you run bang into the problem of unstructured text and a lack of controlled vocabularies and ontologies.

Just imagine what a powerful dataset SRA and ArrayExpress could have been.  Oh, and here comes personalised genomics – and we’ll make the same mistakes all over again.

*big sigh*