Opiniomics

bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Building a Kraken database with new FTP structure and no GI numbers

Progress is usually good, except when a reliable resource completely changes their data and file structures and mess up everything you have been trying to do.  The NCBI used to arrange bacterial genomes in such a very easy way to understand and download; now it’s a bit tougher.  So tough in fact that they had to create a new FAQ.

At the same time as moving around genome data, they also decided to retire GI numbers (thanks to Andreas for the link!).

This is a problem for Kraken, a metagenomic profiler, because it relies both on the old style genomic data structures and the GI numbers that come with them.  Below are my attempts to build a Kraken database, using the new FTP structure and no GIs.

Downloading the data

It is pretty easy to follow the FAQ above, but I have created some perl scripts here that should help.

You will need:

  • Perl
  • BioPerl
  • Git
  • A linux shell (bash) that has wget
  • Internet access
  • A folder that you have write access to
  • Maybe 40Gb of space

The magic in these scripts is that they download the data and then add the necessary “|kraken:taxid|xxxx” text that replaces GI number for building what Kraken refers to as “custom databases”

Note also that the script only downloads fasta for assemblies classed as “Complete Genome”.  Other options are:

  • Chromosome
  • Complete Genome
  • Contig
  • Scaffold

If you want these others then edit the scripts appropriately.


# clone the git repo
git clone https://github.com/mw55309/Kraken_db_install_scripts.git

# either put the scripts in your path or run them from that directory

# make download directory
mkdir downloads
cd downloads

# run for each branch of life you wish to download
perl download_bacteria.pl
perl download_archaea.pl
perl download_fungi.pl
perl download_protozoa.pl
perl download_viral.pl

# you should now have five directories, one for each branch

# build a new database 
# download taxonomy
kraken-build --download-taxonomy --db kraken_bvfpa_080416

# for each branch, add all fna in the directory to the database
for dir in fungi protozoa archaea viral bacteria; do
        for fna in `ls $dir/*.fna`; do
                kraken-build --add-to-library $fna --db kraken_bvfpa_080416
        done
done

# build the actual database
kraken-build --build --db kraken_bvfpa_080416


I haven’t tested the above database yet, but the output of the build script ends with:


Creating GI number to seqID map (step 4 of 6)...
GI number to seqID map created. [14.935s]
Creating seqID to taxID map (step 5 of 6)...
7576 sequences mapped to taxa. [1.188s]
Setting LCAs in database (step 6 of 6)...
Database LCAs set. [21m24.805s]
Database construction complete. [Total: 1h1m28.108s]

Which certainly makes it look like things have worked.

I’ll update the post with info from testing in due course.

 

Update 13/04/2016

I simulated some reads from a random set of fastas in the database, then ran Kraken, and it worked (plus/minus a few false positives)

kraken_test_png

21 Comments

  1. Hey Mick,

    thanks for digging into it and writing it up!
    Here’s a link for NCBI’s abolishing of GI numbers:
    https://www.ncbi.nlm.nih.gov/news/03-02-2016-phase-out-of-GI-numbers/

    Cheers,
    Andreas

  2. “At the same time as moving around genome data, they also decided to retire GI numbers”

    Relevant link:

    http://www.ncbi.nlm.nih.gov/news/03-02-2016-phase-out-of-GI-numbers/

  3. Hi!

    When working with human metagenomic samples, I’ve found it very important to make sure that the database includes Eukaryotic pathogen genomes as well as human genomes. Without the human genome, a lot of human reads will falsely identify as eukaryotic pathogens (a lot of my human reads from blood turned up as malaria, which shocked us a lot).

    Would you be willing to let us know how we can add the human genome from RefSeq along with this code?

    Thanks a bunch!

    Cheers,

    Henry

  4. Bonnie,

    I looked into the EuPathDB paper and found that it only covers around a couple dozen species or so. On RefSeq, there are 184 fungi sequences (not species, I didn’t get to count the number of species), with just 6 complete sequences. There are 72 total protozoa sequences, with just 2 complete protozoa sequences.

    Meanwhile, there are 5,493 complete sequences for viruses, and 4,787 complete sequences for bacteria in RefSeq.

    The Lowest Common Ancestor (LCA) method dictates that if you don’t have all (or most) the eukaryotic pathogens out there in your database, you’ll inevitably come up with false positives when your sample has a pathogen not in your database–the sample’s reads could align to conserved sequences across another related species that’s in the database. That’s a problem for bacteria/viruses too, but not nearly to the extent that I thin it is for eukaryotic, since there are so few eukaryotic pathogens in the database, at least I think. You can read more about it in this paper: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0788-5

    My question to you, Bonnie, is whether you think the number of sequences in RefSeq (or EuPathDB) covers the actual amount of diversity needed to make adequate assignments to eukaryotic pathogens? My hunch is no. I’ve left in eukaryotic pahogens in the RefSeq database I made for Kraken, but once I analyze my Kraken data, I ignore all the eukaryotic assignments because I can’t trust them.

    Cheers,

    Henry

  5. Hi,

    Thanks for the script. however I am receiving this error

    Unable to download assembly_summary.txt (wget works fine)

    Could you please help.

    thanks
    Sid

  6. Hi Mark,
    I noticed the add-to-library step just copies the genome files into library folder with the file names changed to a some kind of code. That would results in two hard copy of all genomes, which will cost double space. Because I or someone else want to keep the original genomes with the proper name, can I just move the e.g. viral folder under/to library folder? or make soft links. Or maybe you have some better ideas. Thanks!

  7. Hi,

    I run your scripts and they worked fine. Thank you!

    However, when I went through the fungi/protozoa directory, I found that the representativeness of the sequences was very low. I mean, the number of sequences retrieved to build the database was quite low for fungi and protozoa.

    I looked at the scripts, and they look like great. So, I don`t see any explanation for that result. Can you please help me to figure out what happened?

    Thank you,

    Andre

    • biomickwatson

      19th August 2016 at 5:05 pm

      The scripts filter for complete genomes I think and there are very few of those for those categories. Try changing scripts to download draft?

      • Why do some papers only download complete genomes, and some of them consider all the available seqeunces (plamisd, contigs, scaffold) from refseq(bacterial species-42000, viral- 5400)?

  8. Hi, Mick.

    Thanks again.

    I have downloaded the chromosomes, scaffold and contigs as you advised me.
    I think we need the jellyfish software to build the database, right?
    Do I need to include it in your script or just load it separately?

    Thank you.

  9. Hi, Mick!

    Just wanted to let you know that everything run well! I`ve created all the databases I needed based on your scripts. Did just some editings to adapt them to my goals, and all I have to say is to thank you for your outstanding job! God bless you!

    Thanks a million,

  10. Hi Mick!

    Any experience adding the plasmid data from refseq to the larger bvfpa dataset you describe above? I noticed when I ran some of my data through, that only some of the plasmids are included in the bacterial refseq data, and I was hoping to include the rest.

Leave a Reply

Your email address will not be published.

*

*

code

© 2017 Opiniomics

Theme by Anders NorenUp ↑