Goal:
FastaGuard catches FASTA-level assembly problems before expensive assembly QC.
Capabilities:
- assembly profile
- plain and gzipped FASTA input
- streaming parser
- duplicate IDs
- duplicate sequences
- invalid nucleotide/IUPAC symbols
- empty records
- core assembly stats
- N50, N90, L50, L90
- GC, AT, N, and ambiguity rates
- high-N scaffolds
- gap runs
- tiny contig heuristics
- length histogram data and HTML plot
- GC-vs-length plot data and HTML plot
- PASS / WARN / FAIL verdict
- JSON, TSV, HTML, and MultiQC-compatible outputs
- deterministic exit codes
Goal:
Make assembly preflight reports easier to trust, route, and consume in real pipelines.
Capabilities:
- GC, length, and composite assembly outlier findings
- finding taxonomy, confidence, and follow-up-tool metadata
- structured routing hints for workflow engines and tool agents
- richer provenance with command, timestamps, duration, and input size
- expanded TSV summary rows for outlier counts
- hardened MultiQC plugin and custom-content support
- Snakemake and nf-core starter material
- ready Bioconda recipe metadata for the v0.2.0 update
- benchmark evidence guidance for adoption decisions
Goal:
Make FastaGuard credible enough for pipeline authors to add as a default assembly gate.
Development scope:
- public evidence pack from local fixtures and public assemblies
- Bioconda and BioContainers v0.3 availability documented as live
- input checksum provenance with
provenance.input_sha256 - clearer machine-readable threshold metadata
- assembly gate preset with
--gate pipeline - explicit
gate.blocking_findingsandgate.advisory_findingsfor workflow engines and humans - clearer blocking vs follow-up recommendations; GC and length outliers remain
advisory unless added with
--fail-on
Goal:
Make many FASTA files easy to rank, filter, and route before interpretive QC.
Development scope:
- readiness categories for file, structure, alphabet, index, assembly, submission, and machine readiness
- index-readiness checks for duplicate first-token IDs and related identifier hazards
- submission-readiness advisories for FASTA-level issues worth fixing before official validators
fastaguard compareas a starter cohort triage layer- cross-file metrics table
- cohort-level outliers
- sample-to-sample summary
- combined JSON, TSV, HTML, and MultiQC outputs
- explicit boundaries: compare mode routes to QUAST, BUSCO, BlobToolKit, CheckM, official validators, annotation, or other downstream tools; it does not replace them
Goal:
Make assembly FASTA files safer to hand to official validators, annotation, and downstream QC.
Development scope:
--gate submissionfor stricter assembly FASTA preflight--submission-target <generic|ncbi>for target-aware submission advisories- stricter identifier and first-token ID safety checks
- gap-like
Nrun summaries for submission review - high ambiguity and tiny-record submission advisories
- submission readiness fields in JSON, TSV, HTML, and MultiQC outputs
- compare-mode aggregation of submission readiness across many FASTA files
- evidence and release notes for
--gate submissionworkflows before official validators - clear scope boundaries: FastaGuard does not replace NCBI, ENA, DDBJ, FCS, QUAST, BUSCO, BlobToolKit, CheckM, or annotation validation
- FastaGuard does not replace NCBI, ENA, DDBJ official validators or guarantee repository acceptance
Potential additions:
- excessive duplicate transcripts
- polyA and polyT tails
- very short transcripts
- extreme GC outliers
- isoform-heavy warning heuristics
Potential additions:
- invalid amino acid symbols
- internal stop codons
- terminal stop codons
- low-complexity regions
- suspicious nucleotide-looking proteins
Potential additions:
- stricter ID normalization checks
- reference naming conventions
- sequence uniqueness checks
- panel consistency summaries
- submission-readiness warnings
Potential additions:
- publish and verify package updates for each released contract
- upstream nf-core module submission
- official Snakemake wrapper submission
- Galaxy wrapper
- upstream MultiQC distribution path
- BioContainers verification for each published package
- Homebrew formula
Avoid external databases in v0.1. Later releases can explore:
- k-mer sketching with minimizers
- contamination suspicion without taxonomy
- optional sourmash, Kraken, or BlobToolKit hooks
- AI-generated report summaries from local metrics only
- browser-based interactive contig filtering
- WASM report viewer
FastaGuard should prepare for a future where machines talk to QC tools directly. The near-term path is not a chatbot feature; it is a cleaner contract.
Completed foundation:
- publish
schema/fastaguard.schema.json - document a stable finding catalog
- add structured
actions[]records to findings - add bounded per-record evidence to findings
- add provenance for profile, thresholds, fail rules, thread count, command, timestamps, duration, and input size
- add input checksum provenance with
provenance.input_sha256 - add explicit scope fields for what FastaGuard can and cannot conclude
- add structured routing hints for workflow engines and tool agents
- add
--schema,--finding-catalog, and--explain-finding <id>commands - add golden JSON conformance tests
Recommended next sequence:
- extend evidence tables across submission, transcriptome, protein, reference, and compare modes
- keep the v0.3 gate contract stable through workflow adoption examples
- explore an MCP or tool-server interface after the CLI schema is stable