CodonMamba

CodonMamba: Towards a Foundation Model for Programmable mRNA Design

CodonMamba is a codon-level masked language model built on a parameter-efficient bidirectional Mamba (BiMamba) architecture. Pretrained on ~10 million coding sequences (CDSs) from 1,544 phylogenetically diverse organisms, CodonMamba supports both mRNA property prediction and synonymous CDS generation with dual constraints.

Highlights

BiMamba architecture — Bidirectional state space model with shared projection weights; captures full-sequence contextual dependencies with linear-time complexity and only 71M parameters.
Broad phylogenetic coverage — Pretrained on ~9M CDSs spanning 1,544 species across bacteria, archaea, fungi, plants, invertebrates, and vertebrates.
12 downstream prediction tasks — Achieves best or second-best performance on all 12 mRNA-related tasks (ranking first on 10), outperforming existing codon-level, nucleic acid, and protein foundation models under a unified frozen-backbone probing protocol.
Dual-constraint CDS generation — Generates diverse synonymous coding sequences via iterative masked infilling with: (1) hard constraints (synonym-aware logit masking) that guarantee protein identity preservation, and (2) optional soft constraints (host-specific codon usage bias) for organism-specific adaptation — both applied at inference time without retraining.
Multi-objective optimization — Generated sequences show coordinated improvements across CAI, GC content, MFE, and sequence naturalness while maintaining near-zero translational risk features.

Installation

We recommend running CodonMamba on Linux systems with CUDA-enabled GPUs.

From source

git clone https://github.com/meilanglang/CodonMamba.git
cd CodonMamba
conda env create -f environment.yaml
conda activate codonmamba

Dependencies

python>=3.10
torch==2.3.0
mamba-ssm==2.2.6.post3
causal-conv1d==1.5.3.post1
numpy=1.24.4
pandas==1.5.3
biopython==1.83
scikit-learn==1.3.2

Model Checkpoints

Model	Parameters	Pretraining Data	Download
CodonMamba-71M	71M	~9M CDSs, 1,544 organism	HuggingFace

Quick Start

1. Sequence Representation Extraction

Extract codon-level embeddings for downstream tasks:

python extract_embedding.py --fasta_file input.fasta --output out_embedding.npz

2. Coding Sequence Generation or Optimazation with synonymous masking

Generate optimized synonymous coding sequences for a target protein:

set raw_beta = 0.0 in cds_generation.py file python cds_generation.py

3. Coding Sequence Generation or Optimazaition with Host-Specific Codon Usage Bias(hard + soft)

set raw_beta = 2.0 ,codon_usage_table_path = "your codon usage table path" in cds_generation.py file python cds_generation.py

Related mRNA/Protein/DNA Language Models

Methods	Data Type	Context	Token	Architecture	Pre-train Task	Parameter Size	Weights	Code	Data size
mRNABERT	mRNA	1024	Codon	Transformer	MLM	113M	HuggingFace	GitHub	~18M
codonGPT	mRNA	1024	Codon	Transformer	NTP	~0.34M	HuggingFace	GitHub	3.4M
GEMORNA	Codon	-	Codon	Transformer	NTP	4.4M	GitHub	GitHub	>1M
Helix-mRNA	mRNA	1024	Base	Mamba	NTP	5.19M	HuggingFace	GitHub	27M
CodonBERT	mRNA	512	Codon	Transformer	MLM/STP	87M	GitHub	GitHub	10M
CaLM	mRNA	1024	Codon	Transformer	MLM	87M	CaLM	GitHub	10M
CodonTransformer	Codon	2048	Codon	Transformer	MLM	90M	HuggingFace	GitHub	1M
SpliceBERT	pre-mRNA	1024	Base	Transformer	MLM	20M	Zenodo	GitHub	2M
ESM2	Protein	1024	AA	Transformer	MLM	650M	GitHub	GitHub	250M
mRNAFM	RNA	1024	Base	Transformer	MLM	239M	HuggingFace	GitHub	40M
NT	DNA	6-12K	6-mer	Transformer	MLM	50M-2.5B	HuggingFace	GitHub	174B

Contact

For questions, feedback, or collaboration inquiries, please contact: xiaolinli@ieee.org or yc48617@connect.um.edu.mo

Citation

If you find CodonMamba useful in your research, please cite:

@article{codonmamba2025,
  title={CodonMamba: Towards a Foundation Model for Programmable mRNA Design},
  author={xxx},
  journal={xxx},
  year={2025}
}

License

This project is licensed under the MIT License. See LICENSE for details.

Acknowledgments

We thank the developers of Mamba for the state space model implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
logs		logs
results		results
src		src
tables/codon_usage_table		tables/codon_usage_table
test_data		test_data
utils		utils
LICENSE.md		LICENSE.md
README.md		README.md
cds_generation.py		cds_generation.py
environment.yml		environment.yml
extract_embedding.py		extract_embedding.py
generation.py		generation.py
load_model.py		load_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodonMamba

Highlights

Installation

From source

Dependencies

Model Checkpoints

Quick Start

1. Sequence Representation Extraction

2. Coding Sequence Generation or Optimazation with synonymous masking

3. Coding Sequence Generation or Optimazaition with Host-Specific Codon Usage Bias(hard + soft)

Related mRNA/Protein/DNA Language Models

Contact

Citation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CodonMamba

Highlights

Installation

From source

Dependencies

Model Checkpoints

Quick Start

1. Sequence Representation Extraction

2. Coding Sequence Generation or Optimazation with synonymous masking

3. Coding Sequence Generation or Optimazaition with Host-Specific Codon Usage Bias(hard + soft)

Related mRNA/Protein/DNA Language Models

Contact

Citation

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages