bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Generate a single contig, hybrid assembly of E coli using MiSeq and MinION data

If you have MiSeq and ONT MinION data, it is now very easy to combine them into a single-contig, high quality assembly.  Here I will start with a base Ubuntu 14.04 LTS distribution and take you through every step, including software installation.  I am starting with an Amazon EC2 m4.xlarge AMI with 4 vCPUs and 16Gb RAM.  If you want to repeat the work below, you don’t need to use EC2, any computer running Ubuntu and with enough disk/RAM should work.

For MinION data, we will use Nick’s SQK-MAP-006 data at the EBI.  Using a simple trick, and because I know the internal structure of the tar archive, we can download only the base-called data:

curl ftp://ftp.sra.ebi.ac.uk/vol1/ERA540/ERA540530/oxfordnanopore_native/MAP006-1.tar | tar -xf - MAP006-1/MAP006-1_downloads &

We can also get some MiSeq data from ENA from the same strain, E coli K-12 MG1655:

wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR262/005/SRR2627175/SRR2627175_2.fastq.gz &
wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR262/005/SRR2627175/SRR2627175_1.fastq.gz &

Now, while that is downloading, we need to install some software on the server:

sudo apt-get update
sudo apt-get install r-base-core git samtools

This installs R which we will use to extract the MinION data (using poRe), git and samtools.  Just reply “Y” to any questions.

We now need to install the poRe dependencies in R, which is very easy:


R may ask if you want to install into a local library, just say Y and accept defaults.  We need to download poRe from sourecforge and we are using version 0.16

Once downloaded, and back at the Linux command line:

R CMD INSTALL poRe_0.16.tar.gz

The fastq extraction scripts for poRe are in github, so let’s go get those:

git clone https://github.com/mw55309/poRe_scripts.git

We will assemble using SPAdes, so let’s go get that:

wget http://spades.bioinf.spbau.ru/release3.6.2/SPAdes-3.6.2-Linux.tar.gz
gunzip < SPAdes-3.6.2-Linux.tar.gz | tar xvf -

Now, we are ready to go.  First off, let’s extract the 2D sequence data as FASTQ from the MinION data.  Nick’s SQK-MAP-006 data are in the old FAST5 format so we use the script in “old_format”:

./poRe_scripts/old_format/extract2D MAP006-1/MAP006-1_downloads/pass/ > minion.pass.2D.fastq &

Finally we can use SPAdes to assemble the data.  SPAdes is a swiss-army knife of genome assembly tools, and by default includes read correction.  This takes up lots of RAM, so we are going to skip it.  We will also only use 3 kmers to save time:

./SPAdes-3.6.2-Linux/bin/spades.py --only-assembler 
				   -t 4 -k 21,51,71 
				   -1 SRR2627175_1.fastq.gz 
				   -2 SRR2627175_2.fastq.gz 
				   --nanopore minion.pass.2D.fastq 
				   -o SPAdes_hybrid & 

Use samtools to extract the top contig:

head -n 1 SPAdes_hybrid/contigs.fasta
samtools faidx SPAdes_hybrid/contigs.fasta
samtools faidx SPAdes_hybrid/contigs.fasta NODE_1_length_4620446_cov_135.169_ID_22238 > single_contig.fa

Finally, a quick comparison to the reference:

sudo apt-get install mummer 
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NC_000913.3&rettype=fasta&retmode=txt" > NC_000913.3.fa
nucmer NC_000913.3.fa single_contig.fa
mummerplot -png out.delta
display out.png &


And that is a single global alignment between reference and query 🙂

Simple huh?

1 Comment

  1. Aren’t you downloading the entire dataset but saving only the baseballs? In other words, saving local storage but not download time?

    Nice run through!

Leave a Reply

Your email address will not be published.




© 2017 Opiniomics

Theme by Anders NorenUp ↑