PopovIILab · iliapopov17 · Jun 9, 2025 · Jun 1, 2025 · Jun 1, 2025 · Jun 1, 2025
diff --git a/.github/workflows/draft-pdf.yml b/.github/workflows/draft-pdf.yml
@@ -0,0 +1,28 @@
+name: Draft PDF
+on:
+  push:
+    paths:
+      - paper/**
+      - .github/workflows/draft-pdf.yml
+
+jobs:
+  paper:
+    runs-on: ubuntu-latest
+    name: Paper Draft
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+      - name: Build draft PDF
+        uses: openjournals/openjournals-draft-action@master
+        with:
+          journal: joss
+          # This should be the path to the paper within your repo.
+          paper-path: paper/paper.md
+      - name: Upload
+        uses: actions/upload-artifact@v4
+        with:
+          name: paper
+          # This is the output path where Pandoc will write the compiled
+          # PDF. Note, this should be the same directory as the input
+          # paper.md
+          path: paper/paper.pdf
diff --git a/README.md b/README.md
diff --git a/paper/paper.bib b/paper/paper.bib
@@ -0,0 +1,150 @@
+@article{wood2014kraken,
+  author    = {Derrick E. Wood and Steven L. Salzberg},
+  title     = {Kraken: ultrafast metagenomic sequence classification using exact alignments},
+  journal   = {Genome Biology},
+  volume    = {15},
+  number    = {3},
+  pages     = {R46},
+  year      = {2014},
+  doi       = {10.1186/gb-2014-15-3-r46}
+}
+
+@article{wood2019kraken2,
+  author    = {Derrick E. Wood and Jennifer Lu and Ben Langmead},
+  title     = {Improved metagenomic analysis with Kraken 2},
+  journal   = {Genome Biology},
+  volume    = {20},
+  pages     = {257},
+  year      = {2019},
+  doi       = {10.1186/s13059-019-1891-0}
+}
+
+@article{lu2022kraken,
+  author    = {Jennifer Lu and Natalia Rincon and Derrick E. Wood and Florian P. Breitwieser and Christopher Pockrandt and Ben Langmead and Steven L. Salzberg and Martin Steinegger},
+  title     = {Metagenome analysis using the Kraken software suite},
+  journal   = {Nature Protocols},
+  volume    = {17},
+  pages     = {2815--2839},
+  year      = {2022},
+  doi       = {10.1038/s41596-022-00738-y}
+}
+
+@article{lu2017bracken,
+  author    = {Jennifer Lu and Florian P. Breitwieser and Peter Thielen and Steven L. Salzberg},
+  title     = {Bracken: estimating species abundance in metagenomics data},
+  journal   = {PeerJ Computer Science},
+  volume    = {3},
+  pages     = {e104},
+  year      = {2017},
+  doi       = {10.7717/peerj-cs.104}
+}
+
+@article{kim2024metabuli,
+  author    = {Jaebeom Kim and Martin Steinegger},
+  title     = {Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA},
+  journal   = {Nature Methods},
+  volume    = {21},
+  number    = {6},
+  pages     = {971--973},
+  year      = {2024},
+  doi       = {10.1038/s41592-024-02273-y}
+}
+
+@article{breitwieser2018krakenuniq,
+  author    = {Florian P. Breitwieser and Daniel N. Baker and Steven L. Salzberg},
+  title     = {KrakenUniq: confident and fast metagenomics classification using unique k-mer counts},
+  journal   = {Genome Biology},
+  volume    = {19},
+  pages     = {198},
+  year      = {2018},
+  doi       = {10.1186/s13059-018-1568-0}
+}
+
+@article{kim2016centrifuge,
+  author    = {Daehwan Kim and Li Song and Florian P. Breitwieser and Steven L. Salzberg},
+  title     = {Centrifuge: rapid and sensitive classification of metagenomic sequences},
+  journal   = {Genome Research},
+  volume    = {26},
+  number    = {12},
+  pages     = {1721--1729},
+  year      = {2016},
+  doi       = {10.1101/gr.210641.116}
+}
+
+@article{ounit2015clark,
+  author    = {Rachid Ounit and Steve Wanamaker and Timothy J. Close and Stefano Lonardi},
+  title     = {CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers},
+  journal   = {BMC Genomics},
+  volume    = {16},
+  pages     = {236},
+  year      = {2015},
+  doi       = {10.1186/s12864-015-1419-2}
+}
+
+@article{menzel2016kaiju,
+  author    = {Philip Menzel and Kim Lee Ng and Anders Krogh},
+  title     = {Fast and sensitive taxonomic classification for metagenomics with Kaiju},
+  journal   = {Nature Communications},
+  volume    = {7},
+  pages     = {11257},
+  year      = {2016},
+  doi       = {10.1038/ncomms11257}
+}
+
+@article{breitwieser2020pavian,
+  author    = {Florian P. Breitwieser and Steven L. Salzberg},
+  title     = {Pavian: interactive analysis of metagenomics data for microbiome studies and pathogen identification},
+  journal   = {Bioinformatics},
+  volume    = {36},
+  number    = {4},
+  pages     = {1303--1304},
+  year      = {2020},
+  doi       = {10.1093/bioinformatics/btz715}
+}
+
+@article{sczyrba2017cami,
+  author    = {Alexander Sczyrba and Peter Hofmann and Peter Belmann and David Koslicki and Stefan Janssen and Johannes Dröge and Ivan Gregor and Stephan Majda and Jessika Fiedler and Eik Dahms and others},
+  title     = {Critical Assessment of Metagenome Interpretation: a benchmark of metagenomics software},
+  journal   = {Nature Methods},
+  volume    = {14},
+  number    = {11},
+  pages     = {1063--1071},
+  year      = {2017},
+  doi       = {10.1038/nmeth.4458}
+}
+
+@article{Hunter2007,
+  author    = {John D. Hunter},
+  title     = {Matplotlib: A 2D Graphics Environment},
+  journal   = {Computing in Science \& Engineering},
+  volume    = {9},
+  number    = {3},
+  pages     = {90--95},
+  year      = {2007},
+  publisher = {IEEE Computer Society},
+  doi       = {10.1109/MCSE.2007.55},
+  url       = {https://doi.org/10.1109/MCSE.2007.55}
+}
+
+@software{reback2020pandas,
+    author       = {The pandas development team},
+    title        = {pandas-dev/pandas: Pandas},
+    month        = feb,
+    year         = 2020,
+    publisher    = {Zenodo},
+    version      = {latest},
+    doi          = {10.5281/zenodo.3509134},
+    url          = {https://doi.org/10.5281/zenodo.3509134}
+}
+
+@article{Waskom2021,
+  author    = {Michael L. Waskom},
+  title     = {seaborn: statistical data visualization},
+  journal   = {Journal of Open Source Software},
+  volume    = {6},
+  number    = {60},
+  pages     = {3021},
+  year      = {2021},
+  doi       = {10.21105/joss.03021},
+  url       = {https://doi.org/10.21105/joss.03021}
+}
diff --git a/paper/paper.md b/paper/paper.md
@@ -0,0 +1,56 @@
+---
+title: 'KrakenParser: Efficient Post-processing of Kraken2-like Taxonomic Reports'
+tags:
+  - Python
+  - metagenomics
+  - taxonomy datasets
+  - data parsing
+authors:
+  - name: Ilia V. Popov
+    orcid: 0000-0001-7947-1654
+    affiliation: "1" 
+affiliations:
+ - name: Faculty of Bioengineering and Veterinary Medicine, Don State Technical University, Russia
+   index: 1
+   ror: 00x5je630
+date: 1 June 2025
+bibliography: paper.bib
+---
+
+# Summary
+
+`KrakenParser` is an open-source software tool (with a command-line interface and Python API) designed to streamline the post-analysis of metagenomic classification results produced by `Kraken2` [@wood2019kraken2] and similar taxonomic profilers such as `Bracken` [@lu2017bracken] and `Metabuli` [@kim2024metabuli]. `Kraken2` is a widely used taxonomic classifier that assigns metagenomic reads to taxa using exact k-mer matches, achieving high speed and accuracy. However, the raw output of `Kraken2` [@wood2019kraken2] (and related tools) is a text report that can be cumbersome to interpret and aggregate across multiple samples. `KrakenParser` addresses this need by converting multiple `Kraken`-format reports into structured tables (CSV files) at various taxonomic ranks (from phylum down to species), performing filtering and normalization (including relative abundance calculations), and providing APIs to produce publication-ready plots. The tool automates the multi-step process of combining and cleaning `Kraken` results, allowing researchers to quickly obtain human-readable summaries of community composition. `KrakenParser`’s focus is on efficiency, ease-of-use, and integration: it can run an entire conversion pipeline with a single command and also be imported as a Python library for custom workflows. In summary, `KrakenParser` significantly reduces the manual effort required to post-process metagenomic classification data, enabling scientists to go from raw classifier output to analysis-ready tables and figures in one step.
+
+# Statement of need
+
+Analyzing the taxonomic profiles of metagenomic samples often involves running k-mer based classifiers (like `Kraken2`) that generate detailed reports of read counts and abundances across taxa. These reports, while information-rich, are not immediately convenient for comparative analysis: they list each taxon in a hierarchical format for a single sample, and researchers must manually parse and merge multiple files to compare communities across samples. Existing scripts such as the `KrakenTools` suite [@lu2022kraken] (developed alongside `Kraken`) provide some post-processing functionality, but they require multiple steps and technical expertise to use. Similarly, interactive tools like `Pavian` focus on visualization and exploration of `Kraken` results rather than automated batch processing [@breitwieser2020pavian]. There is a clear need for a streamlined solution to transform raw `Kraken`-family outputs into tidy data matrices and summary statistics that can be readily used in downstream analysis or publication figures. `KrakenParser` fulfills this need by offering an all-in-one pipeline that reads in multiple `Kraken2`/`Bracken`/`Metabuli` reports and outputs clean CSV tables of taxonomic counts or relative abundances, optionally filtering out low-abundance taxa or non-target taxa (e.g. human reads) as specified by the user. This greatly simplifies metagenomic workflows, especially in comparative studies or clinical settings where dozens of samples must be processed consistently. By bridging the gap between raw classifier output and statistical analysis, KrakenParser empowers researchers who may not be bioinformatics experts to leverage high-throughput metagenomics with minimal data wrangling.
+
+Metagenomic classification has seen rapid development, with numerous tools available for assigning sequencing reads to taxa. `Kraken` was introduced in 2014 as an ultrafast k-mer based classifier [@wood2014kraken], and its successor `Kraken2` [@wood2019kraken2] further reduced memory usage and improved speed . Other k-mer classifiers include `Bracken` [@lu2017bracken], which refines `Kraken`’s counts to improve abundance estimates, `KrakenUniq` which tracks unique k-mers per taxon to reduce false positives [@breitwieser2018krakenuniq], `Centrifuge` which uses an FM-index to allow classification with compressed databases [@kim2016centrifuge], and `CLARK` which uses discriminative k-mers for fast classification [@ounit2015clark]. More recently, tools like `Kaiju` perform classification in protein space for greater sensitivity (especially on viruses) [@menzel2016kaiju], and `Metabuli` combines DNA and translated amino acid matching to improve accuracy [@kim2024metabuli]. Comprehensive evaluations have benchmarked these methods’ accuracy and speed, and community challenges like `CAMI` have pushed development of improved classifiers [@sczyrba2017cami]. Despite the variety of classifiers, a common challenge remains: the output format. Many tools output reports similar to `Kraken`’s: tab-delimited text with hierarchical labels and counts. To interpret such outputs, researchers often rely on additional scripts or manual processing. `KrakenTools` [@lu2022kraken] provides scripts to combine `Kraken` reports, convert to other formats (e.g., `Krona` for visualization). `Pavian` and other interactive platforms allow users to visualize results with `Sankey` diagrams and heatmaps [@breitwieser2020pavian], but require use of a web interface or `R` environment. There are also lightweight utilities (e.g., [`spideog`](https://github.com/jeanmanguy/spideog)) to convert Kraken reports to CSV or clean them, and researchers adept in programming sometimes write custom parsing scripts. In summary, prior to `KrakenParser`, users had to piece together multiple tools to achieve tasks like merging reports from multiple samples, summing reads at specific taxonomic ranks, and computing relative abundances. `KrakenParser` builds on this state of the field by consolidating the post-processing steps into one tool. It serves as an ideological successor to `KrakenTools` [@lu2022kraken], using some of the same internal conversion steps (like `KrakenTools`’ report-to-MPA conversion) but adding improvements in automation, filtering, and output formatting. By producing standardized CSV tables (with samples as rows and taxa as columns) and by computing percentages automatically, `KrakenParser` greatly accelerates the transition from raw classification data to biological insights. This is particularly valuable given the increasing scale of metagenomic studies (where dozens or hundreds of samples are profiled) and the need for reproducible, efficient analysis pipelines.
+
+# Implementation
+
+`KrakenParser` is implemented in `Python` (available via `PyPI` as `krakenparser`) with several auxiliary scripts. It leverages the original `KrakenTools` [@lu2022kraken] scripts for initial data reshaping and then applies its own pure-`Python` processing for downstream formatting. The software follows a pipeline of six main steps, which can be executed automatically in sequence (`--complete` mode) or run individually as needed:
+
+1. Convert reports to MPA format: Each `Kraken2`/`Bracken`/`Metabuli` report (text file with taxon lines) is converted to an “MPA” table format using `KrakenTools`’ `kreport2mpa.py` script. In MPA format, each row corresponds to a read and columns correspond to taxonomic ranks, allowing easy combination of multiple samples.
+2. Combine MPA files: All per-sample MPA files are merged into a single master table (samples × taxa) using `KrakenTools`’ combine_mpa.py. This yields a matrix of raw read counts, with entries where a taxon is absent in a sample filled with zero.
+3. Deconstruct taxonomic levels: The combined data is split out by rank. `KrakenParser` extracts separate text files for phylum, class, order, family, genus, and species counts. During this step, it can optionally isolate certain domains; for example, using `--deconstruct_viruses` will produce a file of only viral species counts, ignoring other domains. Also, the default `--deconstruct` excludes reads classified as human to focus on microbial content.
+4. Process extracted data: Each rank-specific text file is cleaned and formatted. `KrakenParser` removes classification prefixes (like “s__” for species, “g__” for genus) and replaces underscores with spaces for readability. This step ensures taxon names are human-friendly (e.g. “s__Escherichia_coli” becomes “Escherichia coli”).
+5. Convert to CSV: The cleaned text tables are converted to CSV files (comma-separated values). In this transpose operation, taxa become columns and sample identifiers become rows, yielding a standard matrix format. This structured CSV is easy to import into statistical software, spreadsheets, or R/Python data frames for further analysis.
+6. Calculate relative abundances: For each count table, `KrakenParser` can create a corresponding relative abundance table (`--relabund` option) by computing percentages of total reads per sample, using the formula: $\text{Relative Abundance} = \left( \frac{\text{Number of individuals of taxa}}{\text{Total number of individuals of all taxa}} \right) \times 100$. Users can specify a threshold to group low-abundance taxa into an “Other” category. This results in a normalized profile for each sample, often more interpretable in comparative studies than raw counts.
+
+Each of these steps is exposed as a sub-command in the CLI, so advanced users can integrate KrakenParser into custom workflows. By default, running `KrakenParser --complete -i <reports_dir>/kreports` executes all steps sequentially, writing outputs to a structured directory tree (with subfolders for each step). The outputs include one CSV file per rank (e.g. counts_phylum.csv, counts_species.csv) containing absolute read counts, and similarly named files under a `csv_relabund/` directory for percentages if requested. KrakenParser is optimized for speed and memory efficiency given the nature of the task: it processes text files line by line and uses `pandas` data frames for merging and calculations, which easily handle dozens of samples and tens of thousands of taxa on a standard workstation. The reliance on `KrakenTools` for the initial conversion ensures that the parsing logic benefits from the robustness of well-tested scripts, while the unified interface adds convenience. The tool also includes built-in help for each subcommand (`-h`), guiding users on required inputs and options. `KrakenParser`’s design reflects practical needs observed in the metagenomics community - it was tested during the [2025 “Bioinformatics Bootcamp”](https://pish.itmo.ru/genomics-bootcamp) hackathon organized by ITMO University, where teams analyzing metagenomic datasets were able to obtain meaningful results in a short time thanks to `KrakenParser`’s streamlined processing pipeline. By combining established methods with new automation, `KrakenParser` provides an efficient, reproducible, and user-friendly means to handle the otherwise tedious steps of post-classification data processing.
+
+`KrakenParser` also offers a suite of `Python`-based visualization tools to facilitate the interpretation of taxonomic profiles:
+
+- Stacked Bar Plots: Utilizing `matplotlib` [@Hunter2007] and `pandas` [@reback2020pandas], `KrakenParser` can generate stacked bar plots that display the relative abundances of taxa across multiple samples. These plots provide a clear comparison of taxonomic compositions between samples.
+- Streamgraphs: For a more dynamic representation, `KrakenParser` can create streamgraphs using `matplotlib`’s [@Hunter2007] stackplot function with a symmetric baseline. This visualization emphasizes changes in taxa abundances over a series of samples, highlighting temporal or sequential patterns. 
+- Combined Visualizations: To offer both detailed and overarching views, `KrakenParser` supports combined plots that integrate stacked bar plots and streamgraphs. This dual representation aids in comprehensive data analysis.
+- Clustermaps: Employing `seaborn` [@Waskom2021], `KrakenParser` can produce clustermaps that perform hierarchical clustering on taxa and samples. These heatmaps reveal patterns and groupings in the data, facilitating the identification of similar taxonomic profiles.
+
+These visualization tools are accessible through the `KrakenParser` Python API, allowing users to customize and integrate them into their analysis workflows seamlessly.
+
+# Acknowledgements
+
+The development of `KrakenParser` was supported by the Russian Science Foundation (Project no. 25-24-00351).
+
+# References