Bygul is a Python 3 tool designed for simulating sequencing reads in wastewater surveillance and other metagenomic applications. It allows users to simulate complex multi-sample datasets with customizable proportions using industry-standard backends like wgsim and mason.
Bygul requires Python 3. Since it relies on external simulators (wgsim and mason), we recommend using Conda to manage dependencies.For more info on wgsim and mason simulator please check their documentations.
conda create -n bygul bioconda::bygulpip install bygulNote: Some binary dependencies (wgsim/mason) may need to be installed manually or built from source if using this method.
git clone [https://github.com/andersen-lab/Bygul](https://github.com/andersen-lab/Bygul)
cd Bygul
pip install -e .Use this mode when simulating specific genomic regions defined by a primer set.
bygul simulate-proportions [SAMPLE1.fasta,SAMPLE2.fasta] --primers [primer.bed] --reference [reference.fasta] --proportions [0.8,0.2] --outdir [output_dir]- Random Proportions & Mismatches:
Simulate with random proportions and allow up to 2 SNPs in primer regions.
bygul simulate-proportions sample1.fasta,sample2.fasta --primers primer.bed --reference reference.fasta --outdir results/ --maxmismatch 2
- Switching Simulators:
Use
masoninstead of the defaultwgsim.bygul simulate-proportions sample1.fasta,sample2.fasta --primers primer.bed --reference reference.fasta --simulator mason
- Custom Error Rates & Lengths:
Pass simulator-specific parameters (e.g. indel fraction
-R) directly.bygul simulate-proportions sample1.fasta,sample2.fasta --primers primer.bed --reference reference.fasta -R 0.01
Simulate reads from entire samples without requiring a primer BED file or a reference sequence.
bygul simulate-proportions sample1.fasta,sample2.fasta --outdir results/ --simulation_mode metagenomicsbygul simulate-proportions sample1.fasta,sample2.fasta --proportions 0.5,0.5 --outdir results/ --simulation_mode metagenomics --simulator mason --illumina-read-length 200Bygul acts as a wrapper. While most flags are passed directly to the underlying simulators, the following are managed directly by Bygul for more realistic simulations(amplicon simulation mode only):
--readcnt: Number of reads per amplicon.--wgsim_insert_size: Insert size for wgsim.--wgsim_read_length/--wgsim_error_rate.
To see all available backend flags, run:
wgsim --help
mason_simulator --help- Read Counts: Set
--readcnthigher than the number of contigs in your amplicon file. Too few reads can result in empty files for certain amplicons. - Primer Files: The BED file must include a column with the primer sequence. Bygul allows 1 SNP mismatch by default; use
--maxmismatchto change this.
- Consolidated Reads: Simulated reads from all samples are at
outdir/reads.fastq. - Proportions: Assigned proportions are recorded in
results/sample_proportions.txt. - Quality Metrics: Check
outdir/[sample_name]/amplicon_stats.csvfor information on amplicon dropouts, mismatches, and ambiguous bases.
If you use this workflow in a paper, please cite the original repository: https://github.com/andersen-lab/Bygul