I wanted to tell the story of our new software, poRe, which is an R package written to help users deal with Oxford Nanopore’s MinION sequencer. We recently put the pre-print on bioRxiv, and it is getting plenty of attention. So let’s get into it!
The MinION device
Whichever way you look at it, the MinION is a revolutionary device. About 6 inches long, it plugs into the USB port of a Windows laptop (which is mandatory, to enable the proper running of the device). As you will know, the MinION measures single strands of DNA as they pass through protein nanopores, and is capable of ultra-long reads – over 100Kb have been reported.
It is very easy to imagine applications of this device which involve mobile sequencing – for example, picture a vet sat in a barn next to a sick animal, taking blood and sequencing DNA in real time to discover which pathogen is causing the illness.
However, as an early acecss user of the MinION, it became very clear that we would need to develop some software to begin to approach this vision. The MinION is clearly aimed at non-experts, biologists who want to use it for rapid sequencing “in the field”. To enable this, we need to give them software they can use, and that software needs to work on the WIndows laptop itself. Relying on uploading the data to a server and analysing it there is not a solution that can work for “in the field” sequencing (people have said to me that you can access the internet anywhere now; they are often city folk, who don’t realise that 3G/4G coverage doesn’t extend to huge swathes of the countryside, nevermind 3rd world countries who don’t have the infrastructure)
The MinION workflow
Once sample has been applied to the MinION, and DNA molecules pass through nanopores on the flowcell, data collection begins. If a channel passes a number of QC metrics, each sequence read is written to a file. Due to a hairpin adapter, each DNA molecule can be read twice, termed “template” and “complement” reads. Whether or not the molecule is read only once or twice, the raw data are written to a file on the laptop’s hard disk drive.
This directory is monitored by an agent, and once new files are discovered, they are queued for upload to metrichor, a cloud-based base-caller which takes the raw data and calls the nucleotide sequence. The base-called files are downloaded and stored in a sub-folder.
The organisation challenge
Here the challenge begins. All files produced by the MinION are in HDF5 format, a binary hierarchical data format. These files require specialised software to open. In addition, the MinION dumps all data files from multiple runs into a single directory, embedding run information within the HDF5 file (called .fast5 files). Each run can produce 10,000-30,000 files, and therefore very quickly users are presented with a directory with 100,000s of files in them and no way or organising them into run folders. Finally, there have been and continue to be multiple versions of metrichor (the base-caller). Therefore each data file can be base-called multiple times. This is quite a large and complex data set.
The data extraction challenge
Lots of information is embedded within each .fast5 file, including run metrics and the nucleotide sequences that most people are interested in, and these data can only be extracted programmatically using the HDF5 library. This represents a challenge for bioinformaticians, nevermind biologists (or vets!).
As a package for R, poRe runs on Windows and has an incredibly simple installation path (just two additional libraries need to be installed manually). poRe enables users to organise the MinION directory into run folders, and use the version of the metrichor base-caller as a sub-folder. poRe also allows extraction of the sequence data as fastq, and collects the run statistics into an easy-to-use data.frame within R. A number of plots are built in, and of course users can plot their own graphs if they wish. poRe has also been tested on Linux, and at least one user is using it on Mac.