APHL-Infectious-Disease/group1

APHL-Infectious-Disease/group1

This repository contains a prototype pipeline developed for the 2026 APHL Hackathon by Group 1. All code, data products, and visualizations are for demonstration and development purposes only and should not be used for production, clinical, regulatory, or public health decision-making at this time.

Introduction

APHL-Infectious-Disease/group1 is a Nextflow-based bioinformatics pipeline developed for the 2026 APHL Hackathon.

The pipeline enables re-analysis of publicly available sequencing data—particularly wastewater metagenomic datasets—to detect measles and enteric viruses of public health interest, including:

Norovirus
Rotavirus
Enterovirus
Astrovirus
Adenovirus
Sapovirus
Poliovirus
Morbillivirus

Key Features

Multiple Input Modes

Mode	Description
`--mode sra`	Automatically discover datasets from NCBI SRA
`--mode accessions`	Use a user-provided list of SRR/ERR accessions
`--mode samplesheet`	Provide local FASTQ files

Metadata Enrichment

Pulls RunInfo metadata from SRA
Enriches with BioSample metadata via NCBI E-utilities
Extracts:
- collection date
- geographic location
- isolation source
- lat/lon (when available)
Filters to U.S.-based samples

Data Processing

Parallel FASTQ download via fastq-dl
Optional preprocessing (QC + trimming)
Kraken2 classification against viral database

Post-classification Analysis

Generates:

metadata_postkraken.csv
metadata_postkraken_hits_only.csv

Optional Steps

Feature	Parameter
Preprocessing	`--run_preprocessing`
MultiQC report	`--run_multiqc`

Workflow Overview

SRA / Accessions / Samplesheet
              ↓
         FASTQ Download
              ↓
        (Optional QC)
              ↓
           Kraken2
              ↓
     Metadata Enrichment
              ↓
    PostKraken Matrix Build
              ↓
       Results Dashboard

Subworkflows and Modules

SRA_DISCOVERY

Uses Entrez to search SRA
Retrieves RunInfo metadata
Outputs accession list + metadata

FETCH_READS (FASTQDL)

Downloads FASTQ files from SRA
Supports parallel retrieval

MAKE_ACCESSIONS_METADATA

Builds metadata for user-provided accessions
Pulls SRA + BioSample metadata

ENRICH_SRA_METADATA

Enhances SRA metadata with BioSample fields
Improves completeness (dates, locations, etc.)

PREPROCESS_READS (Optional)

Tools: fastp, FastQC, seqkit
Performs trimming + QC

PREPARE_KRAKEN_DB

Accepts:
- local DB (--kraken2_db)
- remote DB (--kraken2_db_url)
Downloads and prepares database if needed

CLASSIFY_READS (Kraken2)

Runs Kraken2 classification
Produces per-sample reports

BUILD_POSTKRAKEN_MATRIX

Combines:
- Kraken reports
- enriched metadata
Generates final matrices

POSTKRAKEN_MATRIX

Parses Kraken outputs
Produces:
- full matrix
- hits-only matrix

QC REPORT (Optional)

Generates MultiQC report

RESULTS DASHBOARD

bash shiny.sh

shiny.sh launches app.R, which combines and saves kraken files as a single summary file and generates an associated dashboard from the kraken summary file and SRA metadata file
This should automatically open the associated dashboard html. If not, got to Ports, select 8080 and click "open in browser" (the globe icon)

In a new codespace, it may be necessary to first run the following commands in the terminal:

conda install -c conda-forge r-base
conda install -c conda-forge r-tidyverse r-shiny r-leaflet r-thematic r-DT

Usage

SRA Mode

nextflow run main.nf \
  -profile conda \
  --mode sra \
  --kraken2_db assets/kraken2db_v2 \
  --max_runs 20 \
  --outdir results

Accessions Mode

nextflow run main.nf \
  -profile conda \
  --mode accessions \
  --accessions docs/accessions.txt \
  --kraken2_db assets/kraken2db_v2 \
  --outdir results

Samplesheet Mode

nextflow run main.nf \
  -profile conda \
  --mode samplesheet \
  --input path/to/samplesheet.csv \
  --kraken2_db assets/kraken2db_v2 \
  --outdir results

Using a Remote Kraken2 Database

nextflow run main.nf \
  -profile conda \
  --mode samplesheet \
  --input path/to/samplesheet.csv \
  --kraken2_db_url https://genome-idx.s3.amazonaws.com/kraken/k2_viral_20221209.tar.gz \
  --outdir results

Optional Parameters

Parameter	Description
`--run_preprocessing`	Enable QC + trimming
`--run_multiqc`	Generate MultiQC report
`--max_runs`	Limit SRA runs
`--subsample_size`	Subsample reads

Output

File	Description
`metadata_postkraken.csv`	Full detection matrix
`metadata_postkraken_hits_only.csv`	Filtered detections
`multiqc_report.html`	QC summary (optional)
`pipeline_info/`	Execution logs + versions

Known Limitations

SRA metadata may be incomplete or inconsistent
BioSample metadata availability varies
Detection thresholds not yet standardized
Prototype only — not production ready

Future Directions

Improve SRA query strategy
Add detection confidence scoring
Expand pathogen panel
Integrate dashboards
Optimize performance at scale

Credits

APHL-Infectious-Disease/group1 was written by Group 1.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
assets		assets
conf		conf
dashboard		dashboard
docs		docs
lib		lib
modules		modules
ref_fastas		ref_fastas
results		results
subworkflows		subworkflows
tests		tests
workflows		workflows
.gitattributes		.gitattributes
.gitignore		.gitignore
.nf-core.yml		.nf-core.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.prettierrc.yml		.prettierrc.yml
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
LICENSE		LICENSE
README.md		README.md
enteric_ww_dash.html		enteric_ww_dash.html
group1.sh		group1.sh
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
nf-test.config		nf-test.config
shiny.sh		shiny.sh
test.csv		test.csv

Folders and files

Latest commit

History

Repository files navigation

APHL-Infectious-Disease/group1

Introduction

Key Features

Multiple Input Modes

Metadata Enrichment

Data Processing

Post-classification Analysis

Optional Steps

Workflow Overview

Subworkflows and Modules

SRA_DISCOVERY

FETCH_READS (FASTQDL)

MAKE_ACCESSIONS_METADATA

ENRICH_SRA_METADATA

PREPROCESS_READS (Optional)

PREPARE_KRAKEN_DB

CLASSIFY_READS (Kraken2)

BUILD_POSTKRAKEN_MATRIX

POSTKRAKEN_MATRIX

QC REPORT (Optional)

RESULTS DASHBOARD

Usage

SRA Mode

Accessions Mode

Samplesheet Mode

Using a Remote Kraken2 Database

Optional Parameters

Output

Known Limitations

Future Directions

Credits

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages