Benchmark FastaGuard with deterministic synthetic FASTA files before adding heavier biology-specific modules. The product promise is "fast preflight"; every release should be able to prove that on a repeatable input.
Build a debug binary and run a tiny benchmark:
cargo build
python3 scripts/benchmark_large_fasta.py \
--records 10 \
--length 100 \
--binary target/debug/fastaguard \
--out-dir target/bench-smokeThis should finish quickly and produce fastaguard.json, fastaguard.tsv, fastaguard_report.html, and fastaguard_mqc.json in target/bench-smoke/.
For the v0.3 assembly gate contract, add the pipeline gate preset:
fastaguard sample.fa --profile assembly --gate pipelineFor the v0.4 compare-mode starter pattern, benchmark the same local binaries across a directory of FASTA files before sending the cohort to interpretive QC:
fastaguard compare assemblies/*.fa --profile assembly --gate pipelineBuild an optimized binary:
cargo build --release --lockedRun a 10 Mbp synthetic FASTA:
python3 scripts/benchmark_large_fasta.py \
--records 10000 \
--length 1000 \
--binary target/release/fastaguard \
--out-dir target/benchmarks/10mbpRun a 100 Mbp synthetic FASTA:
python3 scripts/benchmark_large_fasta.py \
--records 100000 \
--length 1000 \
--binary target/release/fastaguard \
--out-dir target/benchmarks/100mbpThe script prints a JSON timing summary with:
- record count
- bases per record
- total bases
- FASTA bytes
- elapsed seconds
- bases per second
- FastaGuard verdict
- output artifact paths
The generated FASTA is removed after the run by default to avoid leaving large files in target/. Keep it for inspection with:
python3 scripts/benchmark_large_fasta.py \
--records 10000 \
--length 1000 \
--keep-fastaThe synthetic benchmark is not a biological benchmark. It measures parser, metric, and report overhead on deterministic valid FASTA content.
Use it to answer:
- did runtime regress between commits?
- did output generation become unexpectedly expensive?
- does the tool still behave well on large record counts?
Do not use it to claim performance on contaminated assemblies, highly ambiguous assemblies, or compressed FASTA until separate fixtures cover those cases.
The value benchmark is documented in
docs/value-benchmark.md as a v0.3 single-file baseline.
It is local evidence for preflight overhead, not a v0.4 compare-mode timing. The
measured local frame is:
fastaguard 0.3.0, commit1873216, macOS ARM64- 10 Mbp synthetic FASTA, 10k records: PASS, 0.51 seconds, about 17 MB RSS
- 100 Mbp synthetic FASTA, 100k records: WARN for GC outliers, 0.98 seconds, about 50 MB RSS
Frame timings as evidence, not formal universal benchmark claims. Use them to show that FastaGuard is cheap enough to run before QUAST, BUSCO, BlobToolKit, CheckM, official validators, annotation, or other heavier follow-up tools.
FastaGuard should prove four preflight categories with small reproducible fixtures. For v0.3, the evidence should also show whether each category blocks the pipeline gate:
| Evidence case | Gate behavior | What FastaGuard catches | Why it should run before heavier tools |
|---|---|---|---|
| duplicate IDs | blocking | repeated FASTA identifiers | helps prevent workflow joins, indexes, and annotations from becoming ambiguous |
| invalid characters | blocking | non-IUPAC sequence symbols | flags inputs that may trigger downstream parser and aligner failures |
| high-N | blocking | ambiguous scaffolds and gap-heavy records | flags low-confidence mapping and annotation inputs before they are treated as clean |
| GC outliers | advisory by default | composition-anomalous records | supports routing suspicious records to BlobToolKit, sourmash, Kraken, or other follow-up tools |
FastaGuard should not replace QUAST, BUSCO, or BlobToolKit. It should make their
inputs safer and make obvious FASTA-level problems visible before those tools run.
For automated workflows, record gate.blocking_findings and
provenance.input_sha256 alongside runtime and verdict so the gate decision can
be audited against exact input bytes.
Use release binaries and public assemblies to build a small evidence table for the README and release notes:
- bacterial assembly around 5 Mbp
- fungal or small eukaryotic assembly around 30-50 Mbp
- large fragmented assembly with many contigs
- gzipped FASTA input
- intentionally problematic FASTA fixture with duplicate IDs and high-N scaffolds
For each run, record:
- FastaGuard version
- platform
- input size and sequence count
- elapsed seconds
- peak memory if measured externally
- verdict and top findings
- whether downstream tools would have been blocked or recommended
- gate status and
gate.blocking_findingswhen run with--gate pipeline provenance.input_sha256
This evidence matters more than synthetic speed alone because it shows the wedge: cheap FASTA preflight before expensive downstream QC.
The original v0.2 evidence workflow is documented in
docs/evidence/fastaguard-v0.2-evidence.md. The v0.3 gate evidence workflow is
documented in docs/evidence/fastaguard-v0.3-evidence.md.
CI-safe local run:
python3 scripts/collect_evidence.py \
--binary target/release/fastaguard \
--out-dir target/evidence/local-smoke \
--local-onlyPublic NCBI run:
python3 scripts/collect_evidence.py \
--binary target/release/fastaguard \
--out-dir target/evidence/v0.3The public run uses NCBI Datasets commands such as
datasets download genome accession <ACCESSION> --include genome --filename <zip>.
It writes compact evidence_summary.json and evidence_summary.tsv files while
leaving downloaded FASTA files and full reports under target/.