Skip to content

meilanglang/CodonMamba

Repository files navigation

CodonMamba

CodonMamba: Towards a Foundation Model for Programmable mRNA Design

CodonMamba is a codon-level masked language model built on a parameter-efficient bidirectional Mamba (BiMamba) architecture. Pretrained on ~10 million coding sequences (CDSs) from 1,544 phylogenetically diverse organisms, CodonMamba supports both mRNA property prediction and synonymous CDS generation with dual constraints.

CodonMamba Overview


Highlights

  • BiMamba architecture — Bidirectional state space model with shared projection weights; captures full-sequence contextual dependencies with linear-time complexity and only 71M parameters.
  • Broad phylogenetic coverage — Pretrained on ~9M CDSs spanning 1,544 species across bacteria, archaea, fungi, plants, invertebrates, and vertebrates.
  • 12 downstream prediction tasks — Achieves best or second-best performance on all 12 mRNA-related tasks (ranking first on 10), outperforming existing codon-level, nucleic acid, and protein foundation models under a unified frozen-backbone probing protocol.
  • Dual-constraint CDS generation — Generates diverse synonymous coding sequences via iterative masked infilling with: (1) hard constraints (synonym-aware logit masking) that guarantee protein identity preservation, and (2) optional soft constraints (host-specific codon usage bias) for organism-specific adaptation — both applied at inference time without retraining.
  • Multi-objective optimization — Generated sequences show coordinated improvements across CAI, GC content, MFE, and sequence naturalness while maintaining near-zero translational risk features.

Installation

We recommend running CodonMamba on Linux systems with CUDA-enabled GPUs.

From source

git clone https://github.com/meilanglang/CodonMamba.git
cd CodonMamba
conda env create -f environment.yaml
conda activate codonmamba

Dependencies

python>=3.10
torch==2.3.0
mamba-ssm==2.2.6.post3
causal-conv1d==1.5.3.post1
numpy=1.24.4
pandas==1.5.3
biopython==1.83
scikit-learn==1.3.2

Model Checkpoints

Model Parameters Pretraining Data Download
CodonMamba-71M 71M ~9M CDSs, 1,544 organism HuggingFace

Quick Start

1. Sequence Representation Extraction

Extract codon-level embeddings for downstream tasks:

python extract_embedding.py --fasta_file input.fasta --output out_embedding.npz

2. Coding Sequence Generation or Optimazation with synonymous masking

Generate optimized synonymous coding sequences for a target protein:

set raw_beta = 0.0 in cds_generation.py file python cds_generation.py

3. Coding Sequence Generation or Optimazaition with Host-Specific Codon Usage Bias(hard + soft)

set raw_beta = 2.0 ,codon_usage_table_path = "your codon usage table path" in cds_generation.py file python cds_generation.py

Related mRNA/Protein/DNA Language Models

Methods Data Type Context Token Architecture Pre-train Task Parameter Size Weights Code Data size
mRNABERT mRNA 1024 Codon Transformer MLM 113M HuggingFace GitHub ~18M
codonGPT mRNA 1024 Codon Transformer NTP ~0.34M HuggingFace GitHub 3.4M
GEMORNA Codon - Codon Transformer NTP 4.4M GitHub GitHub >1M
Helix-mRNA mRNA 1024 Base Mamba NTP 5.19M HuggingFace GitHub 27M
CodonBERT mRNA 512 Codon Transformer MLM/STP 87M GitHub GitHub 10M
CaLM mRNA 1024 Codon Transformer MLM 87M CaLM GitHub 10M
CodonTransformer Codon 2048 Codon Transformer MLM 90M HuggingFace GitHub 1M
SpliceBERT pre-mRNA 1024 Base Transformer MLM 20M Zenodo GitHub 2M
ESM2 Protein 1024 AA Transformer MLM 650M GitHub GitHub 250M
mRNAFM RNA 1024 Base Transformer MLM 239M HuggingFace GitHub 40M
NT DNA 6-12K 6-mer Transformer MLM 50M-2.5B HuggingFace GitHub 174B

Contact

For questions, feedback, or collaboration inquiries, please contact: xiaolinli@ieee.org or yc48617@connect.um.edu.mo

Citation

If you find CodonMamba useful in your research, please cite:

@article{codonmamba2025,
  title={CodonMamba: Towards a Foundation Model for Programmable mRNA Design},
  author={xxx},
  journal={xxx},
  year={2025}
}

License

This project is licensed under the MIT License. See LICENSE for details.


Acknowledgments

We thank the developers of Mamba for the state space model implementation.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors