Progress is usually good, except when a reliable resource completely changes their data and file structures and mess up everything you have been trying to do.  The NCBI used to arrange bacterial genomes in such a very easy way to understand and download; now it’s a bit tougher.  So tough in fact that they had to create a new FAQ.

At the same time as moving around genome data, they also decided to retire GI numbers (thanks to Andreas for the link!).

This is a problem for Kraken, a metagenomic profiler, because it relies both on the old style genomic data structures and the GI numbers that come with them.  Below are my attempts to build a Kraken database, using the new FTP structure and no GIs.

Downloading the data

It is pretty easy to follow the FAQ above, but I have created some perl scripts here that should help.

You will need:

  • Perl
  • BioPerl
  • Git
  • A linux shell (bash) that has wget
  • Internet access
  • A folder that you have write access to
  • Maybe 40Gb of space

The magic in these scripts is that they download the data and then add the necessary “|kraken:taxid|xxxx” text that replaces GI number for building what Kraken refers to as “custom databases”

Note also that the script only downloads fasta for assemblies classed as “Complete Genome”.  Other options are:

  • Chromosome
  • Complete Genome
  • Contig
  • Scaffold

If you want these others then edit the scripts appropriately.


# clone the git repo
git clone https://github.com/mw55309/Kraken_db_install_scripts.git

# either put the scripts in your path or run them from that directory

# make download directory
mkdir downloads
cd downloads

# run for each branch of life you wish to download
perl download_bacteria.pl
perl download_archaea.pl
perl download_fungi.pl
perl download_protozoa.pl
perl download_viral.pl

# you should now have five directories, one for each branch

# build a new database 
# download taxonomy
kraken-build --download-taxonomy --db kraken_bvfpa_080416

# for each branch, add all fna in the directory to the database
for dir in fungi protozoa archaea viral bacteria; do
        for fna in `ls $dir/*.fna`; do
                kraken-build --add-to-library $fna --db kraken_bvfpa_080416
        done
done

# build the actual database
kraken-build --build --db kraken_bvfpa_080416


I haven’t tested the above database yet, but the output of the build script ends with:


Creating GI number to seqID map (step 4 of 6)...
GI number to seqID map created. [14.935s]
Creating seqID to taxID map (step 5 of 6)...
7576 sequences mapped to taxa. [1.188s]
Setting LCAs in database (step 6 of 6)...
Database LCAs set. [21m24.805s]
Database construction complete. [Total: 1h1m28.108s]

Which certainly makes it look like things have worked.

I’ll update the post with info from testing in due course.

 

Update 13/04/2016

I simulated some reads from a random set of fastas in the database, then ran Kraken, and it worked (plus/minus a few false positives)

kraken_test_png