Software for evaluating ONT custom amplicon data from PRISTINE (https://github.com/AI-Biome/PRISTINE).
Authors: Tomas Kudlacek, Katharina J. Hoff
Contact: tomas.kudlacek@uni-greifswald.de
Version: 0.1.0-beta.1
- What is this?
- Quick Start Guide
- Workflow Overview
- Utility Scripts
- Advanced Configuration
- Output Files
- Known Issues
- Citation
This repository contains tools for processing Oxford Nanopore Technology (ONT) amplicon sequencing data from mixed samples containing custom amplicons. The main pipeline identifies which species are present in each custom amplicon by:
- Mapping ONT reads to a reference panel of amplicons
- Binning reads by amplicon type
- Clustering reads to reduce sequencing errors
- Generating consensus sequences through polishing (Racon + Medaka)
- Identifying species by sequence similarity search (MMseqs2)
Use cases:
- Analyzing environmental samples with multiple custom amplicons
- Species identification in targeted amplicon sequencing
- Quality control and validation of ONT amplicon data
- Creating synthetic test datasets with controlled error profiles
You need two things:
- Conda/Mamba (for managing software dependencies)
- Snakemake (workflow manager)
# Navigate to a directory with enough space (several GB needed)
cd /path/to/your/workspace
# Download and install Miniconda (if you don't have conda/mamba)
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# Follow the prompts (accept defaults)
source ~/.bashrc # Reload your shell
# Install mamba (faster than conda)
conda install -n base -c conda-forge mamba
# Create a Snakemake environment
mamba create -c conda-forge -c bioconda -n snakemake snakemake
mamba activate snakemake
# Optional: Install SLURM plugin for HPC clusters
pip install snakemake-executor-plugin-slurmgit clone https://github.com/YOUR_USERNAME/PRISCREEN.git
cd PRISCREENNote: All bioinformatics tools (minimap2, samtools, MMseqs2, vsearch, racon, medaka, seqtk) are automatically installed via conda environments - you don't need to install them manually.
Create a FASTA file with your reference amplicons in the format:
>amplicon_id|species=species_name
SEQUENCE
Example (data/reference_amplicons.fasta):
>amp1|species=SpA
ACGTACGTACGTACGT...
>amp1|species=SpB
ACGTACGTTTGTACGT...
>amp2|species=SpA
GCTAGCTAGCTAGCTA...
>amp2|species=SpC
GCTAGCTAGCCCGCTA...
Important: Each amplicon can have multiple species variants. The pipeline will:
- Map reads to all variants
- Extract reads for each amplicon type (amp1, amp2, etc.)
- Generate consensus sequences
- Identify which species match best
Edit config/samples.tsv to list your samples:
sample fastq
S1 data/S1.fastq.gz
S2 data/S2.fastq.gzColumns:
- sample: Sample name (no spaces)
- fastq: Path to FASTQ file (can be gzipped, relative to Snakefile location)
Edit config/config.yaml to customize the pipeline. See Advanced Configuration for details.
Before running, check that everything is configured correctly:
# Activate Snakemake environment
mamba activate snakemake
# Dry run - shows what would be executed
snakemake --cores 1 --use-conda --dry-run# Activate Snakemake environment
mamba activate snakemake
# Run pipeline using 8 cores
snakemake --cores 8 --use-conda --conda-frontend condaReplace 8 with the number of CPU cores available on your machine.
Note: The --conda-frontend conda flag is recommended if you encounter mamba/conda compatibility issues. It forces Snakemake to use conda instead of mamba for environment creation, which can prevent ImportError issues with numpy/pandas.
The pipeline includes SLURM configuration for running on HPC clusters. Resource allocation (CPUs, memory, runtime) is configured via config.ini and automatically applied to all rules.
Method 1: Using SLURM Profile (Recommended)
# Activate Snakemake environment
mamba activate snakemake
# Submit jobs to SLURM using the pre-configured profile
snakemake --profile profiles/slurmMethod 2: Command-line Options
# Submit jobs to SLURM with explicit options
snakemake \
--executor slurm \
--jobs 20 \
--use-conda \
--default-resources slurm_partition=batch slurm_account=noneCommand explanation:
--executor slurm: Use SLURM job scheduler--jobs 20: Run up to 20 jobs simultaneously--use-conda: Use conda environments for tools--default-resources: Set default SLURM partition and account
Configuring SLURM Resources
Edit config.ini to adjust resource allocation for your cluster:
[SLURM_ARGS]
cpus_per_task = 48 # CPUs per SLURM task
mem_of_node = 126000 # Total node memory in MB
max_runtime = 4320 # Max runtime in minutes (72h)
slurm_partition = batch # SLURM partition nameParameter explanation:
cpus_per_task: Number of CPUs to request per jobmem_of_node: Total memory available on the node in MB (memory per job = mem_of_node / cpus_per_task)max_runtime: Maximum runtime in minutesslurm_partition: Name of the SLURM partition to use
Edit profiles/slurm/config.yaml to adjust cluster-specific settings:
executor: slurm
jobs: 20 # Max simultaneous jobs
default-resources:
- slurm_account=none # Your SLURM account
- slurm_partition=batch # Your partition name# Check SLURM queue (if using cluster)
squeue -u $USER
# Watch jobs in real-time
watch -n 5 'squeue -u $USER'
# Check for output files
ls -lh results/identify/*/species_summary.tsv
# View Snakemake log
tail -f .snakemake/log/*.log
# View SLURM logs for specific rules
tail -f .snakemake/slurm_logs/rule_*/slurm-*.outOnce the workflow completes, find your results in results/identify/{sample}/:
Main output file:
results/identify/{sample}/species_summary.tsv
This file contains species identifications for all amplicons in the sample:
amplicon consensus best_species pident qcov status
amp1 consensus_1 SpA 0.995 0.98 OK
amp1 consensus_2 SpB 0.987 0.96 OK
amp2 consensus_1 SpC 0.992 0.99 AMBIGUOUSColumns:
- amplicon: Amplicon identifier
- consensus: Consensus sequence identifier
- best_species: Top matching species
- pident: Percent identity (0-1 scale)
- qcov: Query coverage (0-1 scale)
- status: OK, AMBIGUOUS, or NO_HIT
Status meanings:
- OK: Clear species identification
- AMBIGUOUS: Multiple species match within top_delta threshold
- NO_HIT: No species match above quality thresholds
This Snakemake pipeline processes ONT amplicon data to identify species in mixed samples. The pipeline uses conda environments to manage all dependencies automatically.
- mmseqs_createdb - Build searchable database from reference panel
- map_to_panel - Align reads to all reference amplicon variants
- Uses minimap2 -ax map-ont for ONT reads
- Keeps up to 10 secondary alignments (-N 10) to handle multi-species panels
- Allows reads to map to multiple species variants (--secondary=yes)
- amplicon_readnames - Extract read names for each amplicon type
- amplicon_fastq - Extract FASTQ records for individual amplicons (seqtk)
- cluster_vsearch - Cluster reads by similarity (vsearch, default 99.5% identity)
- Always runs, but results can be optionally ignored in consensus step
- build_racon - Compile Racon with CPU-specific optimizations (one-time)
- Builds generic (x86-64) and AVX2 versions
- Automatically selects optimal binary based on CPU features
- consensus_polish - Generate consensus sequences:
- Select seed: cluster centroids (if clustering enabled) or first read
- Iterative Racon polishing with re-mapping between rounds (default: 2 rounds)
- Final Medaka error correction
- mmseqs_search - Search consensus sequences against reference database
- Outputs 5-column custom format: query, target, pident, qcov, bits
- summarize_hits - Identify best species match with quality filtering
- sample_summary - Aggregate results per sample across all amplicons
Total runtime: Varies by sample size and number of amplicons
- Small samples (< 10K reads): ~30 minutes
- Medium samples (10K-100K reads): 1-3 hours
- Large samples (> 100K reads): 3-6 hours
Generate a visual representation of the workflow DAG (Directed Acyclic Graph) to better understand rule dependencies:
# Generate rule graph (simplified workflow structure)
snakemake --rulegraph | dot -Tpng > rulegraph.pngRequirements:
- Graphviz must be installed:
sudo apt-get install graphviz(Debian/Ubuntu) orbrew install graphviz(macOS)
What you get:
rulegraph.png: Simplified graph showing rule dependencies- Useful for understanding workflow structure and identifying bottlenecks
Alternative visualizations:
# Full DAG with all jobs (can be large for many samples)
snakemake --dag | dot -Tpng > dag.png
# File dependency graph
snakemake --filegraph | dot -Tpng > filegraph.pngNote: The rulegraph shows one instance of each rule, while the DAG shows all jobs for all samples and can become very large.
The repository includes standalone utility scripts in the scripts/ directory:
Purpose: Introduce controlled sequencing errors into FASTQ files for testing.
Usage:
python scripts/fastq_error_simulation.py \
--input data/input.fastq.gz \
--output data/output.fastq.gz \
--error_rate 0.01 \
--seed 42Parameters:
- --input: Input FASTQ file (plain or gzipped)
- --output: Output FASTQ file (same format as input)
- --error_rate: Per-base substitution error rate (0-1), e.g., 0.01 = 1%
- --seed: Random seed for reproducibility (optional)
- --low_quality_char: Quality score for mutated bases (default: "!")
- --preserve_quality: Keep original quality scores even for mutated bases
Features:
- Case-preserving mutations (maintains upper/lowercase)
- Supports gzipped files
- Optionally updates quality scores
Purpose: Subsample a specified number of reads from a FASTQ file.
Usage:
python scripts/fastq_random_sample.py \
--input data/input.fastq \
--output data/output.fastq \
--num_reads 10000 \
--seed 42Parameters:
- --input: Input FASTQ file (plain or gzipped)
- --output: Output FASTQ file
- --num_reads: Number of reads to sample
- --seed: Random seed for reproducibility (optional)
Note: This script loads the entire FASTQ into memory. Not recommended for very large files (> 1GB).
Purpose: Map query reads to target sequences using minimap2 or pblat.
Usage:
python scripts/map_fastq_to_fasta.py \
--target_folder data/targets \
--query_folder data/queries \
--software minimap2 \
--output_dir results/mapping \
--threads 8 \
--min_identity 95.0 \
--max_size_diff 5.0Parameters:
- --target_folder: Directory with target sequences (.fasta/.fa/.fna)
- --query_folder: Directory with query reads (.fastq/.fq, optionally .gz)
- --software: Mapping tool (minimap2 or pblat)
- --output_dir: Output directory
- --threads: CPU threads (default: all available)
- --min_identity: Minimum % identity threshold (default: 95.0)
- --max_size_diff: Maximum % length difference (default: 5.0)
Note: This script is independent of the main Snakemake workflow and intended for standalone analysis or validation. Previously had bugs (see ISSUES.md), now functional.
Edit config/config.yaml to customize the pipeline:
cluster:
enable: true # If false, consensus uses first read as seed instead of centroids
id_threshold: 0.995 # Clustering identity (0.99-1.0 typical)Note: The cluster_vsearch rule always executes to generate cluster files. Setting enable: false only changes the consensus polishing strategy (first read as seed vs. cluster centroids). This has minimal performance impact.
When to disable clustering:
- Very low coverage (< 10 reads per amplicon)
- High-accuracy basecalling already performed
- Testing/debugging
consensus:
racon_rounds: 2 # More rounds = better accuracy but slower
medaka_model: "r1041_e82_400bps_hac_v4.2.0"Medaka models (choose based on your sequencing chemistry):
- r1041_e82_400bps_hac_v4.2.0 - R10.4.1, 400 bps, HAC basecalling
- r1041_e82_400bps_sup_v4.2.0 - R10.4.1, 400 bps, SUP basecalling
- r941_min_hac_g507 - R9.4.1, MinION, HAC basecalling
See Medaka documentation for full list: https://github.com/nanoporetech/medaka#models
identify:
min_pident: 0.985 # Minimum identity (0-1 scale)
min_qcov: 0.90 # Minimum query coverage (0-1 scale)
top_delta: 0.01 # Ambiguity thresholdThreshold guidelines:
- min_pident: 0.97-0.99 for closely related species, 0.90-0.95 for genus-level
- min_qcov: 0.80-0.95 typical, depends on amplicon length variation
- top_delta: 0.01-0.02 for strict matches, 0.05 for more permissive
Status determination:
- OK: Top hit passes thresholds, and (pident₁ - pident₂) / pident₁ > top_delta
- AMBIGUOUS: Multiple hits where relative difference ≤ top_delta
- Formula: (top_pident - second_pident) / top_pident ≤ top_delta
- Example: top=99%, second=98.5%, delta=0.01 → (99-98.5)/99=0.00505 < 0.01 → AMBIGUOUS
- NO_HIT: No hits pass min_pident and min_qcov thresholds
threads:
map: 8 # Minimap2 alignment threads
polish: 8 # Racon and Medaka threads
mmseqs: 8 # MMseqs2 search threadsAdjust based on available CPU cores. More threads = faster processing but higher memory usage.
File: results/identify/{sample}/species_summary.tsv
Tab-delimited file with species identifications for all amplicons.
File: results/identify/{sample}/{amplicon}/summary.tsv
Detailed results for each amplicon separately.
File: results/identify/{sample}/{amplicon}/mmseqs_hits.tsv
5-column custom format (not standard BLAST format):
- query - Consensus sequence ID
- target - Reference sequence ID
- pident - Percent identity (0-100 scale)
- qcov - Query coverage (0-100 scale)
- bits - Bit score
Empty file if no consensus sequences were generated for this amplicon.
File: results/consensus/{sample}/{amplicon}/consensus.fasta
Polished consensus sequences after Racon + Medaka.
File: results/bin/{sample}/fastq/{amplicon}.fq.gz
Reads binned by amplicon type.
If no reads map to an amplicon (amplicon not present in the sample):
- amplicon_fastq produces empty .fq.gz file
- cluster_vsearch may fail or produce empty centroids
- consensus_polish creates empty consensus.fasta
- mmseqs_search creates empty mmseqs_hits.tsv
- summarize_hits creates empty summary.tsv with headers only
- Final species_summary.tsv excludes this amplicon
This is expected behavior - not all samples contain all amplicons.
To identify which amplicons had reads:
# Count reads per amplicon
for f in results/bin/SAMPLE/fastq/*.fq.gz; do
echo "$f: $(zcat $f | wc -l | awk '{print $1/4}')"
doneSee ISSUES.md for detailed documentation of known bugs and limitations.
Note: Most critical bugs in scripts/map_fastq_to_fasta.py have been fixed (see ISSUES.md for details).
Remaining design considerations:
- Memory inefficiency in scripts/fastq_random_sample.py (loads entire file into memory)
If you encounter conda dependency conflicts when creating environments (especially with medaka, samtools, or htslib), try setting channel priority to flexible:
conda config --set channel_priority flexibleThis allows conda to search across all configured channels to find compatible package versions, rather than strictly prioritizing packages from specific channels. This is particularly important for bioinformatics workflows that mix packages from bioconda and conda-forge.
Jobs failing with memory errors:
- Increase
mem_of_nodeinconfig.ini - Or reduce
cpus_per_task(allocates more memory per job: mem_of_node / cpus_per_task)
Jobs timing out:
- Increase
max_runtimeinconfig.ini - Check partition time limits:
sinfo -o "%P %l"
Permission denied on partition:
- Check available partitions:
sinfo - Update
slurm_partitioninprofiles/slurm/config.yaml - Verify access:
sacctmgr show user $USER withassoc
Too many jobs queued:
- Reduce
jobsvalue inprofiles/slurm/config.yaml - Check cluster fair-use policies
Conda environments not building on compute nodes:
- Pre-create environments:
snakemake --use-conda --conda-create-envs-only --cores 1 - Check internet connectivity on compute nodes
- Consider using
--conda-frontend condainstead of mamba
If you use this pipeline, please cite the tools it depends on:
- Minimap2: Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18), 3094-3100.
- Samtools: Danecek, P., et al. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), giab008.
- Racon: Vaser, R., et al. (2017). Fast and accurate de novo genome assembly from long uncorrected reads. Genome Research, 27(5), 737-746.
- Medaka: Oxford Nanopore Technologies. Medaka: Sequence correction provided by ONT Research. https://github.com/nanoporetech/medaka
- MMseqs2: Steinegger, M., & Soding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), 1026-1028.
- vsearch: Rognes, T., et al. (2016). VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, e2584.
- Snakemake: Molder, F., et al. (2021). Sustainable data analysis with Snakemake. F1000Research, 10, 33
