I cannot possibly do this subject justice, and to be honest, I am just letting off steam.
Around 2008 my group won funding to extract public datasets from ArrayExpress and look at co-expression across various experiments in farm animal genomics and genetics. Far and away the biggest problem was turning essentially unstructured metadata into something a computer could understand.
Now, 5 years later, we’re looking at doing the same with SRA data, and guess what? Here are just a few simple examples of where we have failed completely as scientists:
Describing tissues as:
- embryo vs embryonic vs embyonal
- hippocampus vs hippocampal
- stem cell vs stem_cell
- renal_cell vs renal vs renal cell vs kidney vs kidney cell vs kidney_cell etc
- leukemia vs leukaemia
- fetal vs foetal
- etc etc etc
This is just a tiny, tiny selection 🙁
Here’s the thing – every experiment in ArrayExpress or SRA is understandable individually by a human being. However, try and automate the analysis of 100s or 1000s of datasets and you run bang into the problem of unstructured text and a lack of controlled vocabularies and ontologies.
Just imagine what a powerful dataset SRA and ArrayExpress could have been. Oh, and here comes personalised genomics – and we’ll make the same mistakes all over again.
*big sigh*