A Python framework for reproducible microbial serotyping from sequencing data
Torchbase addresses challenges in microbial typing pipelines around database distribution, versioning, and reproducibility. It provides a standardized, versioned, and distributed approach to microbial typing databases and workflows.
A "torch" is a versioned, self-contained typing database package distributed via IPFS that contains:
- Allele reference sequences - FASTA files in
_resources/ - Allelic profile tables - Tab-separated schema definitions
- WDL workflows - Execution workflows for typing
- Metadata - Citations, maintainers, build information
Torches are distributed via IPFS to enable versioned, reproducible typing across different users and institutions.
- Reproducible typing - Pin specific torch versions for deterministic results
- Distributed databases - IPFS-based distribution eliminates single points of failure
- Multiple scheme support - Convert from PubMLST, cgMLST, ShigaTyper, and more
- Flexible typing strategies - Choose speed/accuracy tradeoff (fast/balanced/sensitive/auto)
- Generalized allelic typing - Works for MLST, serotyping, and other allelic profile systems
- Multi-scheme torches - Single torch can contain multiple typing schemes
- Quality analysis - K-mer analysis for detecting similar/duplicate alleles
- Registry system - Hierarchical configuration with fallback registries
- WDL workflows - Standardized execution via miniwdl
pip install torchbaseFor development:
git clone https://github.com/CFSAN-Biostatistics/torchbase
cd torchbase
make install-devtorchbase listtorchbase pull pubmlst/ecolitorchbase info pubmlst/ecoli# Run with default balanced strategy
torchbase run pubmlst/ecoli --contigs contigs.fasta
# Choose typing strategy (fast/balanced/sensitive/auto)
torchbase run pubmlst/ecoli --reads reads.fastq --strategy fast
torchbase run pubmlst/ecoli --contigs contigs.fasta --strategy sensitive
# Auto strategy analyzes input and picks optimal approach
torchbase run pubmlst/ecoli --contigs contigs.fasta --strategy autoTyping Strategies:
fast- MinHash-based calling only, fastest (best for high-quality assemblies)balanced- MinHash with alignment fallback if needed (default, good for most cases)sensitive- Full alignment-based calling, most accurate (best for challenging samples)auto- Automatically selects strategy based on input characteristics
torchbase pin pubmlst/ecoli 1.2.0Torchbase uses hierarchical configuration with two levels:
- User config:
~/.torchbase/config.toml- Global settings - Project config:
.torchbase.toml- Project-specific overrides
[registries]
default = "https://registry.torchbase.org/manifest.toml"
additional = [
"https://alt-registry.example.com/manifest.toml"
]
[pins]
"pubmlst/ecoli" = "1.2.0"
"pubmlst/salmonella" = "2.1.5"Pins lock torch versions for reproducibility. Project pins override user pins.
# Convert PubMLST MLST scheme
torchtools convert pubmlst --scheme ecoli
# Convert PubMLST cgMLST scheme
torchtools convert pubcgmlst --scheme listeria
# Convert ShigaTyper database
torchtools convert shigatypertorchtools build <namespace>/<torchname>/<version>.torchtorchtools version <namespace>/<torchname> --increment minorTorchbase consists of three layers:
Schema: Container for typing profiles with version infoProfile: Represents allelic profiles with wildcard (?) and exclusion (X) support- Profile equality supports tuples, dicts, and PubMLST-style strings
Torchdataclass: Loads and validates torch packages- IPFS integration via
ipyfs - Manifest system for tracking available torches
- Environment-based IPFS configuration
torchbase: User-facing commands (list, pull, info, run, workflow)torchtools: Authoring commands (build, version, convert)- Strategy-based workflow routing (fast/balanced/sensitive/auto)
- Automatic file compression to zstandard format
Torchbase provides three built-in typing strategies that balance speed and accuracy:
- Method: MinHash-based similarity only
- Speed: Fastest (~1-2 min for typical MLST)
- Best for: High-quality assemblies, large batches, screening
- Pipeline: Sketch → Compare → Call alleles → Lookup profile
- Method: MinHash with conditional alignment fallback
- Speed: Moderate (~2-5 min)
- Best for: Most use cases, mixed quality data
- Pipeline: Sketch → Compare → Call alleles → If confidence <85% → Align → Refine
- Method: Full alignment-based calling
- Speed: Slower (~5-15 min)
- Best for: Novel alleles, difficult samples, maximum accuracy
- Pipeline: Sketch (guide) → Align → Refine calls → Lookup profile
- Method: Analyzes input and picks appropriate strategy
- Logic: Contigs → fast, Reads → balanced, Edge cases → balanced
- Best for: Unknown data quality, automated pipelines
Visualize any workflow's pipeline:
# View built-in workflow
torchbase workflow inspect balanced
# View torch-embedded workflow
torchbase workflow inspect path/to/torch/<namespace>/<torchname>/<version>.torch/
├── metadata.toml # Package metadata, citations
├── profiles.tsv # Allelic profile table
├── main.wdl # Optional: custom workflow
└── _resources/ # Reference FASTA files
├── locus1.fasta
└── locus2.fasta
<namespace>/<torchname>/<version>.torch/
├── metadata.toml
└── schemes/
├── organism1/
│ ├── profiles.tsv
│ └── alleles/
│ ├── locus1.fasta
│ └── locus2.fasta
└── organism2/
├── profiles.tsv
└── alleles/
├── locus1.fasta
└── locus2.fasta
Torches can include a main.wdl workflow for custom typing logic. If present, the torch's workflow is used instead of built-in strategies. Note: --strategy flag cannot be used with torch-embedded workflows.
Torchbase provides built-in quality analysis and filtering capabilities to handle suspect or low-quality alleles in typing databases.
Torchbase includes k-mer analysis for quality control:
from torchbase.quality.kmer_analysis import analyze_locus
report = analyze_locus(fasta_path, k=21)
print(f"Found {len(report.suspect_pairs)} suspect pairs")When running typing workflows, you can provide a quality.json file to filter suspect alleles, loci, or profiles. This is useful for databases with known quality issues or to exclude problematic sequences.
# Exclude suspect alleles only (default behavior)
torchbase run pubmlst/ecoli --contigs sample.fasta \
--quality-json quality.json \
--exclude-suspect-alleles
# Exclude all alleles from suspect loci
torchbase run pubmlst/ecoli --contigs sample.fasta \
--quality-json quality.json \
--exclude-suspect-loci
# Exclude all loci from suspect profiles (most aggressive)
torchbase run pubmlst/ecoli --contigs sample.fasta \
--quality-json quality.json \
--exclude-suspect-profilesFiltering Levels (hierarchical):
--exclude-suspect-alleles- Excludes only specific flagged alleles--exclude-suspect-loci- Excludes all alleles from flagged loci (implies level 1)--exclude-suspect-profiles- Excludes all loci from flagged profiles (implies levels 1 & 2)
Note: If no --quality-json is provided, exclusion flags are silently ignored and no filtering occurs.
The quality.json file contains quality annotations for alleles, loci, and profiles:
{
"loci": {
"locus_name": {
"suspect": false,
"threshold": 90.0,
"similarities": {
"allele_1-allele_2": 45.5,
"allele_1-allele_3": 98.5
},
"alleles": {
"1": {
"suspect": false,
"length": 450,
"gc_content": 52.3
},
"2": {
"suspect": true,
"reason": "low similarity to other alleles"
}
},
"statistics": {
"mean": 72.0,
"std_dev": 31.2,
"min": 45.5,
"max": 98.5,
"percentile_99": 97.0,
"threshold_type": "percentile"
}
}
},
"profiles": {
"ST1": {
"suspect": false,
"loci": ["locus1", "locus2", "locus3"]
},
"ST42": {
"suspect": true,
"loci": ["locus1", "locus2"],
"reason": "incomplete profile"
}
}
}Key fields:
loci[].suspect: Boolean flag marking entire locus as suspectloci[].similarities: Pairwise allele similarity scores (pairs below threshold are suspect)loci[].threshold: Similarity cutoff for suspect pairs (default: 90.0)loci[].alleles[].suspect: Per-allele quality flagsprofiles[].suspect: Boolean flag marking entire profile as suspectprofiles[].loci: List of loci in this profile
Allele pairs in the similarities object with scores below the threshold are automatically marked as suspect. This enables detection of duplicate or highly similar alleles that may represent sequencing errors or database quality issues.
When filtering is applied, the typing result includes exclusion metadata:
{
"profile_id": "ST1",
"status": "known",
"confidence": 0.95,
"notes": {
"exclusions": {
"excluded_alleles": ["locus1_42", "locus2_7"],
"excluded_loci": ["locus3"],
"num_excluded_alleles": 2,
"num_excluded_loci": 1
}
}
}This allows you to track which alleles/loci were filtered during typing for provenance and quality control purposes.
make test # Run pytest
make test-all # Test on all Python versions
make coverage # Generate coverage reportmake lint # Run flake8make dist # Build source and wheel
make release # Upload to PyPIThe RegistryManager resolves torch references to IPFS CIDs:
from torchbase.registry import RegistryManager
from torchbase.config import RegistryConfig
config = RegistryConfig.load()
manager = RegistryManager(config)
# Resolve torch to CID
cid = manager.resolve("pubmlst/ecoli", version="1.2.0")
# Fetch torch to local path
path = manager.fetch_torch("pubmlst/ecoli")Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure
make testandmake lintpass - Submit a pull request
See LICENSE file for details.
If you use Torchbase in your research, please cite:
[Citation information to be added]
- GitHub Issues: https://github.com/CFSAN-Biostatistics/torchbase/issues
- Documentation: [Coming soon]
Created from Binfie-cookiecutter, https://github.com/crashfrog/binfie-cookiecutter