Apologies for the title, this is basically the story of using Titus Brown’s digital normalisation technique on mRNA-Seq data, but I like snappy titles. I hope Titus doesn’t mind me saying this, but I think we all know that he is a little shouty. However, he is also a good scientist.  Titus thinks I am occasionally wrong – luckily I don’t have a problem admitting this.

The problem

Lately I have had to do some transcriptome assembly, and I have tweeted (angrily) quite a few times before that this is not a solved problem.  Yes, yes, I know – you’ve done it before and it worked.  You probably did it on something small. Yes, yes, I know, you wrote a tool and it’s better than everything else, and I should try that. But I won’t.  And no, I didn’t try Trinity; but yes, last time I tried Trinity the results were, ahem, a bit crappy.  And no, I didn’t correct the data first – because that takes ages, and yes sometimes the results are, ahem, a bit crappy.

Now that we’ve got all of that out of the way, here is the problem.  I have half a lane of HiSeq 2500 data, so it is 150bp paired-end data, approximately 63 million reads.

First, I tried Velvet/Oases – on our cluster with 512Gb of RAM.  I set off a job using the oases_pipeline.py script and forgot about it.  Five days later, I realised my cluster was in swap, reporting over 760Gb of RAM being used and I killed it – crucially, not a single “transcripts.fa” file had been produced. Fail with a capital F.

Secondly, I tried SOAPdenovo-Trans.  Running the “all” pipeline, I had a core dump within a few hours.  Running the “pregraph,contig” commands separately seemed to work, but no output file or directory was ever produced, and the software just seemed to hang, using about 20-30Gb of RAM for about 4 days.  Then I killed it. No assembly. No beans.

Note: I have half a lane of HiSeq data.

I’m incredibly frustrated at this point, as I’ve wasted a lot of time, I don’t have a huge amount of data, yet two published algorithms have failed on my significant computing power. I decided to try Shouty Titus Brown’s Digital Normalisation.

One thing I would say is that the installation instructions for khmer/screed (needed for digital normalisation) assume i) you know a bit about python and are probably a python developer, and ii) you have root/install access to your server.  Now, at Edinburgh, we are not allowed root/install access to our Linux servers, so I have to install all Python libraries (numpy etc) into non-standard directories.  We also have woefully out of date Python in /usr/bin/ (2.4.3) and therefore I also have Python 2.7.3 installed in a non-standard place.

Try it sometime – this does make installations pretty difficult. Go back and look at your install.py script and tell me how many assumptions you made about where Python and it’s libraries are/should be.

So I gave up installing it locally and went to the cloud.  This is the story so far.

Running Digital Normalisation in the Cloud

Using Amazon’s EC2, I span up an m1.xlarge instance of Ubuntu 12.04 LTS 64bit (one of Amazon’s standard AMIs).  This m1.xlarge has 4 CPU cores and 15Gb of RAM, more than enough for what I want to do.  If you don’t know how to do any of this, see my guide here.

First of all, we need to install a bunch of stuff (shown here one per line for clarity)

sudo apt-get update
sudo apt-get install git
sudo apt-get install make
sudo apt-get install gcc
sudo apt-get install g++
sudo apt-get install python-dev
sudo apt-get install python-nose

I have to admit I was a bit surprised I had to install “make”, but anyway, I digress.

The next steps are pretty simple, and largely follow the tutorial here.  We clone the git repos:

git clone git://github.com/ged-lab/screed.git
git clone git://github.com/ged-lab/khmer.git

And we install screed and khmer:

cd screed
sudo python setup.py install
cd ../khmer
sudo make test
cd ..
export PYTHONPATH=$PYTHONPATH:/home/ubuntu/khmer/python/

You should now be able to run the main digital normalisation script:

~/khmer/scripts/normalize-by-median.py -h

Which produces:

usage: normalize-by-median.py [-h] [-q] [--ksize KSIZE] [--n_hashes N_HASHES]
                              [--hashsize MIN_HASHSIZE] [-C CUTOFF] [-p]
                              [-s SAVEHASH] [-l LOADHASH] [-R REPORT_FILE]
                              input_filenames [input_filenames ...]

Build & load a counting Bloom filter.

positional arguments:
  input_filenames

optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet
  --ksize KSIZE, -k KSIZE
                        k-mer size to use
  --n_hashes N_HASHES, -N N_HASHES
                        number of hash tables to use
  --hashsize MIN_HASHSIZE, -x MIN_HASHSIZE
                        lower bound on hashsize to use
  -C CUTOFF, --cutoff CUTOFF
  -p, --paired
  -s SAVEHASH, --savehash SAVEHASH
  -l LOADHASH, --loadhash LOADHASH
  -R REPORT_FILE, --report-to-file REPORT_FILE

Note that the documentation is a little fragmented, and you may have to dig around in the tutorial, the docs and a few other places to figure this all out.  One thing that I was very interested in was the option to use paired reads – my data is paired, but the diginorm paper explicitly states:

“… the digital normalization algorithm does not take into account paired-end sequences…”

The first “gotcha” with paired reads is that, unless you clone the “bleeding-edge” branch from github, then normalize_by_median.py doesn’t recognise the paired reads.  Whilst modern versions of the Illumina software output reads with the same name and rely on order within the R1 and R2 files to identify read pairs:

@DJB775P1:C0E96ACXX:8:1101:1527:2229 1:N:0:TGACCA
@DJB775P1:C0E96ACXX:8:1101:1527:2229 2:N:0:TGACCA

old versions named them like this:

@DJB775P1:C0E96ACXX:8:1101:1527:2229#0/1
@DJB775P1:C0E96ACXX:8:1101:1527:2229#0/2

Therefore, to use the paired feature with Digital Normalisation you have to write a script that renames your reads.  It’s not difficult, but it has to be done.  Otherwise clone bleeding-edge (I didn’t so can’t comment).

The second “gotcha” is that normalize_by_median.py expects your reads to be interleaved, much like Velvet.  So once you have renamed your reads, you then have to interleave them.

This provoked a tweet:

Just as a quick aside – why can’t we just have one standard?! Paired files or interleaved, I don’t care, but let’s pick one and stick with it!

Note that at this point I actually have no idea what the effect of the paired option is, as I couldn’t find the documentation for it.  This is probably because I am lazy and have no patience, so I didn’t look hard enough.  Anyway, my reads were renamed and interleaved, I am ready to go:

./khmer/scripts/normalize-by-median.py -C 20 -k 21 -N 4 -x 2e9 &
                                       -p il.rn.fastq.gz 1>khmer.out 2>khmer.err

Most of those options I lifted straight from the tutorial, but I did change -k to 21 as I am allergic to even Kmers (due to palindromes).  I went to sleep and the next morning it was finished.  I like algorithms like this!

The output file of sequences has the same name as the input, but with .keep appended to the end, and in my case it was unzipped.  The first thing I wanted to know was what had happened?  Here the output is quite handy:

:~$ more khmer.out
making hashtable
... kept 199800 of 200000 , or 99 %
... in file il.rn.fastq.gz
... kept 399418 of 400000 , or 99 %
... in file il.rn.fastq.gz
... kept 598696 of 600000 , or 99 %
... in file il.rn.fastq.gz
... kept 796760 of 800000 , or 99 %
... in file il.rn.fastq.gz
... kept 992206 of 1000000 , or 99 %

So we can use this. After installing R (sudo apt-get install r-base; then installing the reshape2 and ggplot2 packages within R):

:~$ grep "... kept" khmer.out | awk '{print $3" "$5}' kept.txt
:~$ R

library(ggplot2)
library(reshape2)

d <- read.table("kept.txt", header=FALSE)
colnames(d) <- c("total","kept")
d$stage <- as.numeric(rownames(d))
md <- melt(d, id.vars="stage", value.name="kmer")

p <- ggplot(md, aes(x=stage, y=kmer, group=variable, colour=variable))
p + geom_line(size=1.5) + scale_colour_manual(values=c("blue","red"))

And I get this:

diginorm

From the output, I can tell I have approximately 15% of my data.

The results

So, the first thing I wanted to know was – what is the effect of the -p (–paired) option?  Well, I can tell you – all of my kept data is in pairs, there are no orphans.  Therefore, I assume the digital normalisation algorithm is the same, except instead of deciding to keep or discard reads, it decided to keep or discard read-pairs.  This works for me 🙂

I have transferred the kept reads back to our local server and re-run the velvet/oases script.  It’s running as we speak.  For each Kmer (I have given Oases a range), approximately ~80Gb of RAM is used, which is taken in peaks (as one kmer finishes) and troughs (as another starts).  Each Kmer is taking a few hours, rather than 1 day+.  I’ll come back and report on the success of the assembly later.

Initial conclusions

It looks like Digital Normalisation will allow me to do something that I was unable to do, so this is a big win for the software.  I would say that the documentation needs a little work, both for non-standard installation and for some of the newer options (e.g. –paired) – but hopefully this post will help some people who want to try it out.

Bioinformaticians: I realise my use of Oases/SOAPdenovo-Trans was not exhaustive, but please see my snarky second paragraph 🙂 I know you are itching, positively dying to comment that I should have used x because of y and when you did it worked, but that kind of misses the point – which is that Oases and SOAPdenovo-Trans are published software, and Digital Normalisation helps when they fail.