Skip to content

BioEarthDigital/CycSim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CycSim - a context-based long-read simulator

Long-read sequencing data contain context-dependent errors, where certain bases are more likely to be misread depending on their surrounding sequence. Most existing simulators introduce errors randomly, which overlooks these error biases and only approximates the overall error rate. CycSim takes a different approach by modeling errors in a k-mer–dependent manner, enabling more realistic and biologically accurate error simulation.

CycSim is easy to train and supports all types of long-read sequencing data. It currently provides pre-trained models for BGI CycloneSEQ, PacBio HiFi, and Oxford Nanopore Q20 data. Users can also quickly train their own custom models using a BAM file of reads aligned to a reference genome.

Table of Contents

Installation

Installing from source

Dependencies

CycSim is written in rust, try below commands (no root required) or refer here to install Rust first.

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Download and install

国内用户请参考这里设置清华源加速

git clone https://github.com/BioEarthDigital/CycSim.git
cd CycSim && cargo build --release
Test
cd test && bash hh.sh

Download pre-trained models

# BGI CycloneSEQ model
wget https://zenodo.org/records/17017268/files/cyclone_hd118_mode.v1.1.cy
# PacBio HiFi model
wget https://zenodo.org/records/17017268/files/hifi_model.v1.1.cy
# Oxford Nanopore Q20 data model
wget https://zenodo.org/records/17017268/files/ont_q20_model.v1.1.cy

General usage

Simulation

CycSim takes a genome assembly file and a trained model file as input to generate simulated reads in BAM format.

./target/release/cycsim sim -t 60 -d 30 model.cy ref.fa -o sim.bam

Note: If you need to simulate more than 50× coverage (i.e., more than the depth used for training), it is recommended to add the -n option. This will introduce additional random errors and help avoid oversampling artifacts.

Training

CycSim can be trained to build an error model from real sequencing data. It takes a genome assembly file and a read mapping file in BAM format as input (sorting is not required) and produces a trained model file.

./target/release/cycsim train -t 60 -r nanopore read.bam ref.fa -o model.cy

Use ./target/release/cycsim -h to see options.

Getting help

Help

Feel free to raise an issue at the issue page.

Note: Please ask questions on the issue page first. They are also helpful to other users.

Contact

For additional help, please send an email to hujiang_at_genomics_dot_cn.

Limitations

  1. CycSim currently supports training and simulation only in whole-genome sequencing (WGS) scenarios.

Benchmarking

  1. CycSim introduces an error rate distribution that is consistent with real sequencing data.

mapping-identity

  1. CycSim introduces an error bias comparable to that observed in real sequencing data.

error-preference

  1. CycSim introduces a position-dependent error distribution that is consistent with real sequencing data.
    Note: If you need a global, context-independent error rate, enable --global_error_rate in the simulation stage.

position-dependent

Star

You can track updates by tab the Star button on the upper-right corner at the github page.

Citation

Preprint:

Context-aware simulation enables systematic optimization of long-read mapping parameters, Jiang Hu, Dongming Fang, Xin Jin, Chentao Yang, bioRxiv 2025.12.04.692264; doi: https://doi.org/10.64898/2025.12.04.692264

About

A context-based long-read simulator

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •