Kraken (benchmark)

Kraken (benchmark)
Name	Kraken
Caption	Kraken taxonomic sequence classifier
Developer	Derrick Wood/University of Maryland School of Medicine
Released	2014
Latest release	2.x
Operating system	Linux, macOS, Windows Subsystem for Linux
Programming language	C (programming language)
License	BSD license

Contents

Overview
Methodology
Datasets and Input Sources
Performance Metrics and Scoring
Results and Comparisons
Limitations and Criticism
Adoption and Impact on Bioinformatics

Kraken (benchmark) Kraken is a high-throughput taxonomic sequence classifier originally described in 2014 that assigns taxonomic labels to short DNA reads using exact k-mer matches. It was developed to accelerate metagenomic analysis for projects associated with Human Microbiome Project, MetaSUB, International Nucleotide Sequence Database Collaboration, and clinical Centers for Disease Control and Prevention surveillance workflows.

Overview

Kraken was introduced in a publication from groups at University of Maryland School of Medicine and draws on algorithmic ideas from Burrows–Wheeler transform, hash table, k-mer indexing and concepts used in tools like MEGAN (software), Centrifuge (bioinformatics), and Kaiju (bioinformatics). The software constructs a database of k-mer to taxon mappings based largely on references from RefSeq, GenBank, and curated collections used by Genome Taxonomy Database. Kraken aims to provide rapid classification comparable to pipelines using BLAST or DIAMOND but with orders-of-magnitude speed improvements suited for projects like Earth Microbiome Project and outbreak response coordinated with World Health Organization or Public Health England partners.

Methodology

Kraken's core method builds a k-mer index from reference sequences linked to nodes in the NCBI Taxonomy tree, then labels reads by querying constituent k-mers and resolving lowest common ancestor assignments among taxonomic hits; this strategy contrasts with alignment-based approaches seen in BLAST and probabilistic models used in BayesHammer or KrakenUniq. The software relies on exact matching of fixed-length k-mers similar to hashing strategies in Jellyfish and leverages minimizer concepts akin to those in Minimap2 and Mash for database compaction in successor versions; memory optimization techniques draw on approaches used in Bloom filter implementations such as BFast and Squeakr.

Datasets and Input Sources

Typical Kraken deployments build databases from public repositories including RefSeq Microbial, RefSeq Viral, GenBank, and specialized collections like Human Microbiome Project reference sets, with sample reads originating from sequencing platforms such as Illumina, Oxford Nanopore Technologies, and Pacific Biosciences. Benchmarking and validation often use synthetic or mock-community datasets from efforts like Critical Assessment of Metagenome Interpretation (CAMI), Mockrobiota, and simulated reads generated by tools like ART (simulator) and wgsim; environmental studies may use data from National Center for Biotechnology Information Sequence Read Archive or project consortia including Tara Oceans and Global Ocean Sampling Expedition.

Performance Metrics and Scoring

Evaluations of Kraken use standard classification metrics—precision, recall, F1 score—computed against taxonomic ranks defined by NCBI Taxonomy and validated with benchmark suites such as CAMI and curated mock communities from Human Microbiome Project; speed and memory are measured in runtime seconds and peak RAM consumption on platforms like Amazon Web Services EC2 instances and institutional clusters managed by SLURM Workload Manager. Comparative studies report metrics alongside tools like Centrifuge (bioinformatics), Kaiju (bioinformatics), MetaPhlAn, Clark (bioinformatics), and alignment-based standards such as BLAST+ or DIAMOND to contextualize trade-offs between sensitivity and computational cost in scenarios relevant to Centers for Disease Control and Prevention outbreak investigations and European Nucleotide Archive scale analyses.

Results and Comparisons

Original reports showed Kraken achieving higher throughput than BLAST and comparable accuracy to alignment-based classifiers on short reads from Illumina MiSeq and HiSeq platforms, often outperforming signature-based tools like MetaPhlAn in speed while matching or exceeding taxonomic resolution at species and genus ranks compared with Centrifuge (bioinformatics) and Clark (bioinformatics). Later iterations and related tools such as Kraken 2, Bracken, and KrakenUniq addressed abundance estimation, memory reduction, and unique k-mer counting issues highlighted in comparative benchmarks performed by groups involved with CAMI and community-driven evaluations by International Society for Computational Biology workshops.

Limitations and Criticism

Critiques of Kraken emphasize dependency on comprehensive reference databases like RefSeq Microbial and susceptibility to false positives or misclassification when reference representation is sparse or contaminated, issues also noted in evaluations of GenBank-derived databases and in discussions involving NIH data management practices. Memory footprint concerns prompted development of compressed indices and alternatives such as Centrifuge (bioinformatics) and Bloom filter–based classifiers; taxonomic assignment based on exact k-mers can struggle with highly divergent sequences from novel taxa encountered in Tara Oceans and deep biosphere studies, paralleling limitations discussed in reviews by Nature Methods and community benchmarking in CAMI.

Adoption and Impact on Bioinformatics

Kraken and its successors have influenced metagenomics workflows in academic centers like Broad Institute, public-health laboratories such as Public Health England, and consortia including Human Microbiome Project and Earth Microbiome Project, integrating with pipelines that use Nextflow, Snakemake, and container platforms like Docker and Singularity. The tool's emphasis on speed and database-driven classification shaped practices in pathogen surveillance, environmental microbiology, and clinical metagenomics, informing standards discussed at forums like International Conference on Research in Computational Molecular Biology and in guidelines referenced by World Health Organization and national public-health agencies.

Category:Bioinformatics tools