Skip to content

andersen-lab/Bygul

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

153 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bygul: Amplicon & Metagenomics Read Simulator

Bygul is a Python 3 tool designed for simulating sequencing reads in wastewater surveillance and other metagenomic applications. It allows users to simulate complex multi-sample datasets with customizable proportions using industry-standard backends like wgsim and mason.


🏗 Installation

Bygul requires Python 3. Since it relies on external simulators (wgsim and mason), we recommend using Conda to manage dependencies.For more info on wgsim and mason simulator please check their documentations.

Option 1: Via Conda (Recommended)

conda create -n bygul bioconda::bygul

Option 2: Via PyPI

pip install bygul

Note: Some binary dependencies (wgsim/mason) may need to be installed manually or built from source if using this method.

Option 3: Local Build from Source

git clone [https://github.com/andersen-lab/Bygul](https://github.com/andersen-lab/Bygul)
cd Bygul
pip install -e .

🧬 Usage: Amplicon Sequencing Mode

Use this mode when simulating specific genomic regions defined by a primer set.

Basic Command

bygul simulate-proportions [SAMPLE1.fasta,SAMPLE2.fasta] --primers [primer.bed] --reference [reference.fasta] --proportions [0.8,0.2] --outdir [output_dir]

Advanced Examples

  • Random Proportions & Mismatches: Simulate with random proportions and allow up to 2 SNPs in primer regions.
    bygul simulate-proportions sample1.fasta,sample2.fasta --primers primer.bed --reference reference.fasta --outdir results/ --maxmismatch 2
  • Switching Simulators: Use mason instead of the default wgsim.
    bygul simulate-proportions sample1.fasta,sample2.fasta --primers primer.bed --reference reference.fasta --simulator mason
  • Custom Error Rates & Lengths: Pass simulator-specific parameters (e.g. indel fraction -R) directly.
    bygul simulate-proportions sample1.fasta,sample2.fasta --primers primer.bed --reference reference.fasta -R 0.01

🌍 Usage: Metagenomics Mode

Simulate reads from entire samples without requiring a primer BED file or a reference sequence.

Basic Metagenomics Simulation

bygul simulate-proportions sample1.fasta,sample2.fasta --outdir results/ --simulation_mode metagenomics

Metagenomics with Specific Parameters

bygul simulate-proportions sample1.fasta,sample2.fasta --proportions 0.5,0.5 --outdir results/ --simulation_mode metagenomics --simulator mason --illumina-read-length 200

📝 Technical Notes

Parameter Handling

Bygul acts as a wrapper. While most flags are passed directly to the underlying simulators, the following are managed directly by Bygul for more realistic simulations(amplicon simulation mode only):

  • --readcnt: Number of reads per amplicon.
  • --wgsim_insert_size: Insert size for wgsim.
  • --wgsim_read_length / --wgsim_error_rate.

To see all available backend flags, run:

wgsim --help
mason_simulator --help

Best Practices

  • Read Counts: Set --readcnt higher than the number of contigs in your amplicon file. Too few reads can result in empty files for certain amplicons.
  • Primer Files: The BED file must include a column with the primer sequence. Bygul allows 1 SNP mismatch by default; use --maxmismatch to change this.

Output Files

  • Consolidated Reads: Simulated reads from all samples are at outdir/reads.fastq.
  • Proportions: Assigned proportions are recorded in results/sample_proportions.txt.
  • Quality Metrics: Check outdir/[sample_name]/amplicon_stats.csv for information on amplicon dropouts, mismatches, and ambiguous bases.

🎓 Citation

If you use this workflow in a paper, please cite the original repository: https://github.com/andersen-lab/Bygul

About

Amplicon read simulator

Resources

License

Stars

Watchers

Forks

Packages