CodonMamba: Towards a Foundation Model for Programmable mRNA Design
CodonMamba is a codon-level masked language model built on a parameter-efficient bidirectional Mamba (BiMamba) architecture. Pretrained on ~10 million coding sequences (CDSs) from 1,544 phylogenetically diverse organisms, CodonMamba supports both mRNA property prediction and synonymous CDS generation with dual constraints.
- BiMamba architecture — Bidirectional state space model with shared projection weights; captures full-sequence contextual dependencies with linear-time complexity and only 71M parameters.
- Broad phylogenetic coverage — Pretrained on ~9M CDSs spanning 1,544 species across bacteria, archaea, fungi, plants, invertebrates, and vertebrates.
- 12 downstream prediction tasks — Achieves best or second-best performance on all 12 mRNA-related tasks (ranking first on 10), outperforming existing codon-level, nucleic acid, and protein foundation models under a unified frozen-backbone probing protocol.
- Dual-constraint CDS generation — Generates diverse synonymous coding sequences via iterative masked infilling with: (1) hard constraints (synonym-aware logit masking) that guarantee protein identity preservation, and (2) optional soft constraints (host-specific codon usage bias) for organism-specific adaptation — both applied at inference time without retraining.
- Multi-objective optimization — Generated sequences show coordinated improvements across CAI, GC content, MFE, and sequence naturalness while maintaining near-zero translational risk features.
We recommend running CodonMamba on Linux systems with CUDA-enabled GPUs.
git clone https://github.com/meilanglang/CodonMamba.git
cd CodonMamba
conda env create -f environment.yaml
conda activate codonmambapython>=3.10
torch==2.3.0
mamba-ssm==2.2.6.post3
causal-conv1d==1.5.3.post1
numpy=1.24.4
pandas==1.5.3
biopython==1.83
scikit-learn==1.3.2
| Model | Parameters | Pretraining Data | Download |
|---|---|---|---|
| CodonMamba-71M | 71M | ~9M CDSs, 1,544 organism | HuggingFace |
Extract codon-level embeddings for downstream tasks:
python extract_embedding.py --fasta_file input.fasta --output out_embedding.npz
Generate optimized synonymous coding sequences for a target protein:
set raw_beta = 0.0 in cds_generation.py file python cds_generation.py
set raw_beta = 2.0 ,codon_usage_table_path = "your codon usage table path" in cds_generation.py file python cds_generation.py
| Methods | Data Type | Context | Token | Architecture | Pre-train Task | Parameter Size | Weights | Code | Data size |
|---|---|---|---|---|---|---|---|---|---|
| mRNABERT | mRNA | 1024 | Codon | Transformer | MLM | 113M | HuggingFace | GitHub | ~18M |
| codonGPT | mRNA | 1024 | Codon | Transformer | NTP | ~0.34M | HuggingFace | GitHub | 3.4M |
| GEMORNA | Codon | - | Codon | Transformer | NTP | 4.4M | GitHub | GitHub | >1M |
| Helix-mRNA | mRNA | 1024 | Base | Mamba | NTP | 5.19M | HuggingFace | GitHub | 27M |
| CodonBERT | mRNA | 512 | Codon | Transformer | MLM/STP | 87M | GitHub | GitHub | 10M |
| CaLM | mRNA | 1024 | Codon | Transformer | MLM | 87M | CaLM | GitHub | 10M |
| CodonTransformer | Codon | 2048 | Codon | Transformer | MLM | 90M | HuggingFace | GitHub | 1M |
| SpliceBERT | pre-mRNA | 1024 | Base | Transformer | MLM | 20M | Zenodo | GitHub | 2M |
| ESM2 | Protein | 1024 | AA | Transformer | MLM | 650M | GitHub | GitHub | 250M |
| mRNAFM | RNA | 1024 | Base | Transformer | MLM | 239M | HuggingFace | GitHub | 40M |
| NT | DNA | 6-12K | 6-mer | Transformer | MLM | 50M-2.5B | HuggingFace | GitHub | 174B |
For questions, feedback, or collaboration inquiries, please contact: xiaolinli@ieee.org or yc48617@connect.um.edu.mo
If you find CodonMamba useful in your research, please cite:
@article{codonmamba2025,
title={CodonMamba: Towards a Foundation Model for Programmable mRNA Design},
author={xxx},
journal={xxx},
year={2025}
}This project is licensed under the MIT License. See LICENSE for details.
We thank the developers of Mamba for the state space model implementation.
