APHL-Infectious-Disease/group1
This repository contains a prototype pipeline developed for the 2026 APHL Hackathon by Group 1. All code, data products, and visualizations are for demonstration and development purposes only and should not be used for production, clinical, regulatory, or public health decision-making at this time.
APHL-Infectious-Disease/group1 is a Nextflow-based bioinformatics pipeline developed for the 2026 APHL Hackathon.
The pipeline enables re-analysis of publicly available sequencing data—particularly wastewater metagenomic datasets—to detect measles and enteric viruses of public health interest, including:
- Norovirus
- Rotavirus
- Enterovirus
- Astrovirus
- Adenovirus
- Sapovirus
- Poliovirus
- Morbillivirus
| Mode | Description |
|---|---|
--mode sra |
Automatically discover datasets from NCBI SRA |
--mode accessions |
Use a user-provided list of SRR/ERR accessions |
--mode samplesheet |
Provide local FASTQ files |
- Pulls RunInfo metadata from SRA
- Enriches with BioSample metadata via NCBI E-utilities
- Extracts:
- collection date
- geographic location
- isolation source
- lat/lon (when available)
- Filters to U.S.-based samples
- Parallel FASTQ download via
fastq-dl - Optional preprocessing (QC + trimming)
- Kraken2 classification against viral database
Generates:
metadata_postkraken.csvmetadata_postkraken_hits_only.csv
| Feature | Parameter |
|---|---|
| Preprocessing | --run_preprocessing |
| MultiQC report | --run_multiqc |
SRA / Accessions / Samplesheet
↓
FASTQ Download
↓
(Optional QC)
↓
Kraken2
↓
Metadata Enrichment
↓
PostKraken Matrix Build
↓
Results Dashboard
- Uses Entrez to search SRA
- Retrieves RunInfo metadata
- Outputs accession list + metadata
- Downloads FASTQ files from SRA
- Supports parallel retrieval
- Builds metadata for user-provided accessions
- Pulls SRA + BioSample metadata
- Enhances SRA metadata with BioSample fields
- Improves completeness (dates, locations, etc.)
- Tools:
fastp,FastQC,seqkit - Performs trimming + QC
- Accepts:
- local DB (
--kraken2_db) - remote DB (
--kraken2_db_url)
- local DB (
- Downloads and prepares database if needed
- Runs Kraken2 classification
- Produces per-sample reports
- Combines:
- Kraken reports
- enriched metadata
- Generates final matrices
- Parses Kraken outputs
- Produces:
- full matrix
- hits-only matrix
- Generates MultiQC report
bash shiny.sh- shiny.sh launches app.R, which combines and saves kraken files as a single summary file and generates an associated dashboard from the kraken summary file and SRA metadata file
- This should automatically open the associated dashboard html. If not, got to Ports, select 8080 and click "open in browser" (the globe icon)
In a new codespace, it may be necessary to first run the following commands in the terminal:
conda install -c conda-forge r-base
conda install -c conda-forge r-tidyverse r-shiny r-leaflet r-thematic r-DTnextflow run main.nf \
-profile conda \
--mode sra \
--kraken2_db assets/kraken2db_v2 \
--max_runs 20 \
--outdir resultsnextflow run main.nf \
-profile conda \
--mode accessions \
--accessions docs/accessions.txt \
--kraken2_db assets/kraken2db_v2 \
--outdir resultsnextflow run main.nf \
-profile conda \
--mode samplesheet \
--input path/to/samplesheet.csv \
--kraken2_db assets/kraken2db_v2 \
--outdir resultsnextflow run main.nf \
-profile conda \
--mode samplesheet \
--input path/to/samplesheet.csv \
--kraken2_db_url https://genome-idx.s3.amazonaws.com/kraken/k2_viral_20221209.tar.gz \
--outdir results| Parameter | Description |
|---|---|
--run_preprocessing |
Enable QC + trimming |
--run_multiqc |
Generate MultiQC report |
--max_runs |
Limit SRA runs |
--subsample_size |
Subsample reads |
| File | Description |
|---|---|
metadata_postkraken.csv |
Full detection matrix |
metadata_postkraken_hits_only.csv |
Filtered detections |
multiqc_report.html |
QC summary (optional) |
pipeline_info/ |
Execution logs + versions |
- SRA metadata may be incomplete or inconsistent
- BioSample metadata availability varies
- Detection thresholds not yet standardized
- Prototype only — not production ready
- Improve SRA query strategy
- Add detection confidence scoring
- Expand pathogen panel
- Integrate dashboards
- Optimize performance at scale
APHL-Infectious-Disease/group1 was written by Group 1.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.