bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Shouting at my data in the cloud

Apologies for the title, this is basically the story of using Titus Brown’s digital normalisation technique on mRNA-Seq data, but I like snappy titles. I hope Titus doesn’t mind me saying this, but I think we all know that he is a little shouty. However, he is also a good scientist.  Titus thinks I am occasionally wrong – luckily I don’t have a problem admitting this.

The problem

Lately I have had to do some transcriptome assembly, and I have tweeted (angrily) quite a few times before that this is not a solved problem.  Yes, yes, I know – you’ve done it before and it worked.  You probably did it on something small. Yes, yes, I know, you wrote a tool and it’s better than everything else, and I should try that. But I won’t.  And no, I didn’t try Trinity; but yes, last time I tried Trinity the results were, ahem, a bit crappy.  And no, I didn’t correct the data first – because that takes ages, and yes sometimes the results are, ahem, a bit crappy.

Now that we’ve got all of that out of the way, here is the problem.  I have half a lane of HiSeq 2500 data, so it is 150bp paired-end data, approximately 63 million reads.

First, I tried Velvet/Oases – on our cluster with 512Gb of RAM.  I set off a job using the oases_pipeline.py script and forgot about it.  Five days later, I realised my cluster was in swap, reporting over 760Gb of RAM being used and I killed it – crucially, not a single “transcripts.fa” file had been produced. Fail with a capital F.

Secondly, I tried SOAPdenovo-Trans.  Running the “all” pipeline, I had a core dump within a few hours.  Running the “pregraph,contig” commands separately seemed to work, but no output file or directory was ever produced, and the software just seemed to hang, using about 20-30Gb of RAM for about 4 days.  Then I killed it. No assembly. No beans.

Note: I have half a lane of HiSeq data.

I’m incredibly frustrated at this point, as I’ve wasted a lot of time, I don’t have a huge amount of data, yet two published algorithms have failed on my significant computing power. I decided to try Shouty Titus Brown’s Digital Normalisation.

One thing I would say is that the installation instructions for khmer/screed (needed for digital normalisation) assume i) you know a bit about python and are probably a python developer, and ii) you have root/install access to your server.  Now, at Edinburgh, we are not allowed root/install access to our Linux servers, so I have to install all Python libraries (numpy etc) into non-standard directories.  We also have woefully out of date Python in /usr/bin/ (2.4.3) and therefore I also have Python 2.7.3 installed in a non-standard place.

Try it sometime – this does make installations pretty difficult. Go back and look at your install.py script and tell me how many assumptions you made about where Python and it’s libraries are/should be.

So I gave up installing it locally and went to the cloud.  This is the story so far.

Running Digital Normalisation in the Cloud

Using Amazon’s EC2, I span up an m1.xlarge instance of Ubuntu 12.04 LTS 64bit (one of Amazon’s standard AMIs).  This m1.xlarge has 4 CPU cores and 15Gb of RAM, more than enough for what I want to do.  If you don’t know how to do any of this, see my guide here.

First of all, we need to install a bunch of stuff (shown here one per line for clarity)

sudo apt-get update
sudo apt-get install git
sudo apt-get install make
sudo apt-get install gcc
sudo apt-get install g++
sudo apt-get install python-dev
sudo apt-get install python-nose

I have to admit I was a bit surprised I had to install “make”, but anyway, I digress.

The next steps are pretty simple, and largely follow the tutorial here.  We clone the git repos:

git clone git://github.com/ged-lab/screed.git
git clone git://github.com/ged-lab/khmer.git

And we install screed and khmer:

cd screed
sudo python setup.py install
cd ../khmer
sudo make test
cd ..
export PYTHONPATH=$PYTHONPATH:/home/ubuntu/khmer/python/

You should now be able to run the main digital normalisation script:

~/khmer/scripts/normalize-by-median.py -h

Which produces:

usage: normalize-by-median.py [-h] [-q] [--ksize KSIZE] [--n_hashes N_HASHES]
                              [--hashsize MIN_HASHSIZE] [-C CUTOFF] [-p]
                              [-s SAVEHASH] [-l LOADHASH] [-R REPORT_FILE]
                              input_filenames [input_filenames ...]

Build & load a counting Bloom filter.

positional arguments:

optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet
  --ksize KSIZE, -k KSIZE
                        k-mer size to use
  --n_hashes N_HASHES, -N N_HASHES
                        number of hash tables to use
                        lower bound on hashsize to use
  -C CUTOFF, --cutoff CUTOFF
  -p, --paired
  -s SAVEHASH, --savehash SAVEHASH
  -l LOADHASH, --loadhash LOADHASH
  -R REPORT_FILE, --report-to-file REPORT_FILE

Note that the documentation is a little fragmented, and you may have to dig around in the tutorial, the docs and a few other places to figure this all out.  One thing that I was very interested in was the option to use paired reads – my data is paired, but the diginorm paper explicitly states:

“… the digital normalization algorithm does not take into account paired-end sequences…”

The first “gotcha” with paired reads is that, unless you clone the “bleeding-edge” branch from github, then normalize_by_median.py doesn’t recognise the paired reads.  Whilst modern versions of the Illumina software output reads with the same name and rely on order within the R1 and R2 files to identify read pairs:

@DJB775P1:C0E96ACXX:8:1101:1527:2229 1:N:0:TGACCA
@DJB775P1:C0E96ACXX:8:1101:1527:2229 2:N:0:TGACCA

old versions named them like this:


Therefore, to use the paired feature with Digital Normalisation you have to write a script that renames your reads.  It’s not difficult, but it has to be done.  Otherwise clone bleeding-edge (I didn’t so can’t comment).

The second “gotcha” is that normalize_by_median.py expects your reads to be interleaved, much like Velvet.  So once you have renamed your reads, you then have to interleave them.

This provoked a tweet:

Just as a quick aside – why can’t we just have one standard?! Paired files or interleaved, I don’t care, but let’s pick one and stick with it!

Note that at this point I actually have no idea what the effect of the paired option is, as I couldn’t find the documentation for it.  This is probably because I am lazy and have no patience, so I didn’t look hard enough.  Anyway, my reads were renamed and interleaved, I am ready to go:

./khmer/scripts/normalize-by-median.py -C 20 -k 21 -N 4 -x 2e9 &
                                       -p il.rn.fastq.gz 1>khmer.out 2>khmer.err

Most of those options I lifted straight from the tutorial, but I did change -k to 21 as I am allergic to even Kmers (due to palindromes).  I went to sleep and the next morning it was finished.  I like algorithms like this!

The output file of sequences has the same name as the input, but with .keep appended to the end, and in my case it was unzipped.  The first thing I wanted to know was what had happened?  Here the output is quite handy:

:~$ more khmer.out
making hashtable
... kept 199800 of 200000 , or 99 %
... in file il.rn.fastq.gz
... kept 399418 of 400000 , or 99 %
... in file il.rn.fastq.gz
... kept 598696 of 600000 , or 99 %
... in file il.rn.fastq.gz
... kept 796760 of 800000 , or 99 %
... in file il.rn.fastq.gz
... kept 992206 of 1000000 , or 99 %

So we can use this. After installing R (sudo apt-get install r-base; then installing the reshape2 and ggplot2 packages within R):

:~$ grep "... kept" khmer.out | awk '{print $3" "$5}' kept.txt
:~$ R


d <- read.table("kept.txt", header=FALSE)
colnames(d) <- c("total","kept")
d$stage <- as.numeric(rownames(d))
md <- melt(d, id.vars="stage", value.name="kmer")

p <- ggplot(md, aes(x=stage, y=kmer, group=variable, colour=variable))
p + geom_line(size=1.5) + scale_colour_manual(values=c("blue","red"))

And I get this:


From the output, I can tell I have approximately 15% of my data.

The results

So, the first thing I wanted to know was – what is the effect of the -p (–paired) option?  Well, I can tell you – all of my kept data is in pairs, there are no orphans.  Therefore, I assume the digital normalisation algorithm is the same, except instead of deciding to keep or discard reads, it decided to keep or discard read-pairs.  This works for me 🙂

I have transferred the kept reads back to our local server and re-run the velvet/oases script.  It’s running as we speak.  For each Kmer (I have given Oases a range), approximately ~80Gb of RAM is used, which is taken in peaks (as one kmer finishes) and troughs (as another starts).  Each Kmer is taking a few hours, rather than 1 day+.  I’ll come back and report on the success of the assembly later.

Initial conclusions

It looks like Digital Normalisation will allow me to do something that I was unable to do, so this is a big win for the software.  I would say that the documentation needs a little work, both for non-standard installation and for some of the newer options (e.g. –paired) – but hopefully this post will help some people who want to try it out.

Bioinformaticians: I realise my use of Oases/SOAPdenovo-Trans was not exhaustive, but please see my snarky second paragraph 🙂 I know you are itching, positively dying to comment that I should have used x because of y and when you did it worked, but that kind of misses the point – which is that Oases and SOAPdenovo-Trans are published software, and Digital Normalisation helps when they fail.


  1. Question re “I didn’t correct my data first” — meaning illumina adapter trimming and say fastx trimming or the latest and greatest app, seqtk? wherefore crappy? I personally hate chucking bases from my hard won 100 bp paired end reads but everyone says do it. And I obey.

    now WARNING: long song and dance about to ensue, relating to interleaving, split files, and different flavors of normalization (yes, with a z). First things first, I have 289 mil paired end reads or about 64GB of data (one HiSeq lane). I wrestled with fastx installation and running thereof on a 32 bit machine which gave me no end of pain in the usual anatomical place. So then I switched to seqtk in order to trim and process reads at an insanely high speed even though Titus Brown kept shouting at me not to do it. I ran seqtk with its default trimming algorithm on the interleaved PE reads, as I had already interleaved then for fastx, and why waste all that time and effort. Then I split them for good old CTB normalization and did it in paired mode –so far everything hunky dory, reducing my data down to about 5.3 GB. Feeling ruthless, I even did a second pass of normalization (trim-low-abund.py) that reduced my reads down to 1.9GB. Yowzah!

    Then I turned to trinity normalization and things got weird. I split the seqtk normalized reads into 1 and 2 files, and ran Trinity normalization in both paired and unpaired modes, and every time, now matter how I change max coverage, Trin norm pretty much obliterated my data, e.g. 34 and 31 GB files getting reduced to 164 and 101MBs. It was normalized alright, right out of existence. This led to much head scratching. So then on a whim, I said OK, I am going back to seqtk and rather than running it on the combined file, I will split the data first (which was kind of dumb since I could have just gone back to the PE reads but who was thinking clearly at this point, not me) and then run seqtk on the split files. Then I fed those separately seqtk trimmed files to Trinity and it ran like a champ, though I did not use paired mode. I ended up with 1.9 and 1.1 GB files. Still not the faintest idea why Trinity normalization was behaving so brutally towards the trimmed combined file, that was subsequently split.

    So yes dammit! let’s keep it all split or all interleaved. A person could spend a lifetime going back and forth (yawn). I am actually redoing (AGAIN) the trimming and splitting and normalizing for both because I want to make absolutely certain that pairs remain in both cases all nice and linked.

    The end.

  2. FYI, I have yet to grok why this happened ;). But I’m on it.

  3. Liz, We could verify and cross-verify until the end of days, but still be unsure of how consistent the reads (and their pairing) are post diginorm (CTB/Trinity). I seriously need to check the code and the log files of the diginorm procedures and check what they remove and how they choose what to remove!

  4. Maybe your Trinity assembly was crappy because of user error ;).. Just kidding. In any case, in many of the published benchmarks, Trinity comes out on top. This jives with many of my tests as well.. Here I’m talking not about the raw assembly but instead the assembly after post-processing (steps which I can share with you if desired). So, I’m surprised you’d had such bad luck. How are you evaluating ‘crappy’ anyway?

    Also, for khmer, I made a lab intended for grad students to use khmer on AWS. Maybe the installation instructions would be helpful (though it seems like you’ve already figured it out). http://genomebio.org/Gen711/?p=227

  5. Doesn’t seqtk just do subsampling, not normalization?

Leave a Reply

© 2018 Opiniomics

Theme by Anders NorenUp ↑