Bacformer is a prokaryotic foundational model which models whole-bacterial genomes as a sequence of proteins ordered by their genomic coordinates on the chromosome and plasmid(s). It takes as input average protein embeddings from protein language models and computes contextualised protein embeddings conditional on other proteis present in the genome. Bacformer is trained on a diverse dataset of ~1.3M bacterial genomes and ~3B proteins.
Bacformer can be applied to a wide range of tasks, including: strain clustering, essential genes prediction, operon identification, ppi prediction, protein function prediction and more. We provide model checkpoints for pretrained models as well as Bacformer finetuned for various tasks. We also provide tutorials and make Bacformer available via HuggingFace.
- 2025-07-21: Bacformer preprint is now available on biorxiv.
- 2025-05-15: Bacformer is now available on HuggingFace.
- Setup
- Usage
- HuggingFace
- Pretrained model checkpoints
- Contributing
- Citation
- Installation
- Release notes
- Contact
Bacformer is based on PyTorch and HuggingFace Transformers
and was developed in python=3.10.
Bacformer uses ESM-2 protein embedding as input (esm2_t12_35M_UR50D). We
recommend using the faplm package to compute protein embeddings in a fast and efficient way.
We recommend to firstly install PyTorch 2.2 or above (pip install "torch>=2.2"),
then installing flash attention (pip install flash-attn --no-build-isolation),
and finally installing bacformer via pip:
pip install bacformeror by cloning the repository and installing the dependencies:
git clone https://github.com/macwiatrak/Bacformer.git
cd Bacformer
# 1) install Bacformer **with its core dependencies**
pip install .
# 2) (optional but recommended) add the fast‐attention extra (“faesm”)
pip install ".[faesm]"Have trouble installing?
create clean conda env, and install the cuda-toolkit 12.1.0 for compilation:
# Create new environment with Python 3.10
micromamba create -n bacformer python=3.10 -y
# Activate the environment
micromamba activate bacformer
# Install CUDA toolkit
micromamba install -c nvidia/label/cuda-12.1.0 cuda-toolkit -y
# Install PyTorch with CUDA support (using pip for latest version)
pip install torch --index-url https://download.pytorch.org/whl/cu121
# Install flash-attention
pip install flash-attn --no-build-isolation --no-cache-dir
# Optional: verify installations
python -c "import torch; print(f'PyTorch version: {torch.__version__}')"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
# install faesm package for fast ESM-2 embeddings (recommended)
pip install faesm[flash_attn]
# finally, install the bacformer package
pip install bacformerAnother workaround is docker container. You can use the official nvidia pytorch containers which have all the dependencies for flash attention.
Below are examples on how to use Bacformer to compute contextual protein embeddings.
import torch
from transformers import AutoModel
from bacformer.pp import protein_seqs_to_bacformer_inputs
device = "cuda:0"
model = AutoModel.from_pretrained(
"macwiatrak/bacformer-masked-MAG", trust_remote_code=True
).to(device).eval().to(torch.bfloat16)
# Example input: a sequence of protein sequences
# in this case: 4 toy protein sequences
# Bacformer was trained with a maximum nr of proteins of 6000.
protein_sequences = [
"MGYDLVAGFQKNVRTI",
"MKAILVVLLG",
"MQLIESRFYKDPWGNVHATC",
"MSTNPKPQRFAWL",
]
# embed the proteins with ESM-2 to get average protein embeddings
inputs = protein_seqs_to_bacformer_inputs(
protein_sequences,
device=device,
batch_size=128, # the batch size for computing the protein embeddings
max_n_proteins=6000, # the maximum number of proteins Bacformer was trained with
)
# compute contextualised protein embeddings with Bacformer
with torch.no_grad():
outputs = model(**inputs, return_dict=True)
# (batch_size, n_prots + special tokens or max_n_proteins, embedding_dim)
print('last hidden state shape:', outputs["last_hidden_state"].shape)
# (batch_size, embedding_dim)
print('genome embedding:', outputs.last_hidden_state.mean(dim=1).shape)Process a whole bacterial genome assembly from GenBank (in this case, Pseudomonas aeruginosa PAO1 genome)
and compute contextualised protein embeddings with Bacformer.
import torch
from transformers import AutoModel
from bacformer.pp import preprocess_genome_assembly, protein_seqs_to_bacformer_inputs
# preprocess a bacterial genome assembly
genome_info = preprocess_genome_assembly(filepath="files/pao1.gbff")
# load the model
device = "cuda:0"
model = AutoModel.from_pretrained(
"macwiatrak/bacformer-masked-complete-genomes", trust_remote_code=True
).to(device).eval().to(torch.bfloat16)
# embed the proteins with ESM-2 to get average protein embeddings
inputs = protein_seqs_to_bacformer_inputs(
genome_info['protein_sequence'],
device=device,
batch_size=128, # the batch size for computing the protein embeddings
max_n_proteins=6000, # the maximum number of proteins Bacformer was trained with
)
# compute contextualised protein embeddings with Bacformer
with torch.no_grad():
outputs = model(**inputs, return_dict=True)
# the resulting contextalized protein embeddings can be used for analysis
print('last hidden state shape:', outputs["last_hidden_state"].shape)Use Bacformer to embed a column of protein sequences from a HuggingFace dataset. The example below can be easily adapted to a pandas DataFrame or any other data structure containing protein sequences.
Below we show how to compute contextualised protein embeddings for all proteins present in the genome required for operon prediction, or how to compute a single genome embedding for a set of genomes required for strain clustering.
from bacformer.pp import embed_dataset_col
from datasets import load_dataset
# load the operon dataset from long-read RNA sequencing
operon_dataset = load_dataset("macwiatrak/operon-identification-long-read-rna-sequencing", split="test")
# embed the protein sequences with Bacformer
# we compute contextualised protein embeddings for all proteins in the genome
operon_dataset = embed_dataset_col(
dataset=operon_dataset,
model_path="macwiatrak/bacformer-masked-complete-genomes",
max_n_proteins=9000,
genome_pooling_method=None, # set to None to get embeddings for all proteins in the genome
)
# load the strain clustering toy dataset
strain_clustering_dataset = load_dataset("macwiatrak/strain-clustering-protein-sequences-sample", split="train")
# embed the protein sequences with Bacformer
# use mean genome pooling as we need a single genome embedding for each genome for clustering
strain_clustering_dataset = embed_dataset_col(
dataset=strain_clustering_dataset,
model_path="macwiatrak/bacformer-masked-MAG",
max_n_proteins=9000,
genome_pooling_method="mean",
)
# convert to pandas and print the first 5 rows
strain_clustering_df = strain_clustering_dataset.to_pandas()
strain_clustering_df.head()We provide a set of tutorials to help you get started with Bacformer. The tutorials cover the following topics:
- Bacformer for strain clustering
- Finetuning Bacformer for essential genes prediction
- Bacformer for phenotypic traits prediction
- Finetuning Bacformer for phenotypic traits prediction
- Zero-shot operon identification with Bacformer
We are actively working on more tutorials, so stay tuned! If you have any suggestions for tutorials, please let us know by raising an issue in the issue tracker.
Bacformer is integrated with HuggingFace.
import torch
from transformers import AutoModel, AutoModelForMaskedLM, AutoModelForCausalLM
device = "cuda:0"
# load the Bacformer model trained on MAGs with an autoregressive objective
causal_model = AutoModelForCausalLM.from_pretrained("macwiatrak/bacformer-causal-MAG", trust_remote_code=True).to(torch.bfloat16).eval().to(device)
# load the Bacformer model trained on MAGs with a masked objective
masked_model = AutoModelForMaskedLM.from_pretrained("macwiatrak/bacformer-masked-MAG", trust_remote_code=True).to(torch.bfloat16).eval().to(device)
# load the Bacformer encoder model finetuned on complete genomes (i.e. without the protein family classification head)
# we recommend using this model for complete genomes as a start for finetuning on your own dataset for all tasks except generation
encoder_model = AutoModel.from_pretrained("macwiatrak/bacformer-masked-complete-genomes", trust_remote_code=True).to(torch.bfloat16).eval().to(device)We provide a range of pretrained model checkpoints for Bacformer which are available via HuggingFace.
| Checkpoint name | Genome type | Description |
|---|---|---|
bacformer-causal-MAG |
MAG | A model pretrained on the ~1.3 M metagenome-assembled genomes (MAG) with an autoregressive objective. |
bacformer-masked-MAG |
MAG | A model pretrained on the ~1.3 M metagenome-assembled genomes (MAG) with a masked objective, randomly masking 15 % of proteins. |
bacformer-causal-complete-genomes |
Complete (i.e. uninterrupted) | A bacformer-causal-MAG finetuned on a set of ~40 k complete genomes with an autoregressive objective. |
bacformer-masked-complete-genomes |
Complete (i.e. uninterrupted) | A bacformer-masked-MAG finetuned on a set of ~40 k complete genomes with a masked objective, randomly masking 15 % of the proteins. |
bacformer-causal-protein-family-modeling-complete-genomes |
Complete (i.e. uninterrupted) | A bacformer-causal-MAG finetuned on a set of ~40 k complete genomes with an autoregressive objective. In contrast to other models, this model takes as input a protein-family token rather than the protein sequence, allowing generation of sequences of protein families. |
bacformer-for-essential-genes-prediction |
Complete (i.e. uninterrupted) | A bacformer-masked-complete-genomes finetuned on the essential-genes prediction task. |
We welcome contributions to Bacformer! If you would like to contribute, please follow these steps:
- Fork the repository.
- Install
pre-commitand set up the pre-commit hooks (make sure to do it at the root of the repository).
pip install pre-commit
pre-commit install- Create a new branch for your feature or bug fix.
- Make your changes and commit them.
- Push your changes to your forked repository.
- Create a pull request to the main repository.
- Make sure to add tests for your changes and run the tests to ensure everything is working correctly.
For questions, bugs, and feature requests, please raise an issue in the repository.
@article{Wiatrak2025.07.20.665723,
author = {Wiatrak, Maciej and Vi{\~n}as Torn{\'e}, Ramon and Ntemourtsidou, Maria and Dinan, Adam M. and Abelson, David C. and Arora, Divya and Brbi{\'c}, Maria and Weimann, Aaron and Floto, Rodrigo Andres},
title = {A contextualised protein language model reveals the functional syntax of bacterial evolution},
elocation-id = {2025.07.20.665723},
year = {2025},
doi = {10.1101/2025.07.20.665723},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/07/20/2025.07.20.665723},
eprint = {https://www.biorxiv.org/content/early/2025/07/20/2025.07.20.665723.full.pdf},
journal = {bioRxiv}
}