AnyModal is a modular framework for building multimodal language models in PyTorch. It bridges pretrained vision (or other modality) encoders with large language models via pluggable projector architectures — letting you train a multimodal model with just a config file and a few lines of code.
- 5 Projector Architectures: Linear, MLP, Q-Former (BLIP-2), Perceiver Resampler (Flamingo), C-Abstractor (Honeybee)
- Pluggable Encoders: ViT, SigLip, CLIP, DINOv2 — any HuggingFace vision model works out of the box
- Config-Driven: YAML configs for all hyperparameters; CLI overrides for experiments
- HuggingFace Trainer Integration: LR scheduling, gradient clipping, mixed precision, checkpointing, wandb/tensorboard
- Proper Packaging:
pip installsupport, type-annotated API, tested with CI - PEFT/LoRA Support: Optional LoRA on both the vision encoder and language model
pip install -e .
# With optional dependencies:
pip install -e ".[peft]" # LoRA support
pip install -e ".[quantization]" # 4-bit / 8-bit quantization
pip install -e ".[all]" # Everything# Train an image captioning model
cd examples/image_captioning
python ../train.py --config config.yaml
# Run inference
python ../inference.py --config config.yaml --model_dir ./output/image_captioning/finalfrom anymodal import ModelConfig, EncoderConfig, ProjectorConfig, build_model
config = ModelConfig(
encoder=EncoderConfig(model_name="google/vit-base-patch16-224"),
projector=ProjectorConfig(type="qformer", kwargs={"num_queries": 32}),
prompt_text="Describe this image: ",
)
processor, model = build_model(config)
model.print_trainable_parameters()
# Training
logits, loss = model({"input": batch_pixel_values, "text": captions})
loss.backward()
# Inference
generated = model.generate(sample_input, max_new_tokens=100)from anymodal import build_model, build_hf_trainer, TrainingConfig, MultiModalDataset
processor, model = build_model(config)
train_dataset = MultiModalDataset(
dataset_name="AnyModal/flickr30k",
processor=processor,
image_field="image",
text_field="original_alt_text",
split="train",
)
trainer = build_hf_trainer(
model, TrainingConfig(num_epochs=3, fp16=True), train_dataset
)
trainer.train()
model.save_pretrained("./output/final")The projector is the key trainable component — it maps encoded modality features into the LLM's embedding space. Choose based on your needs:
| Projector | Based On | Resamples? | Best For |
|---|---|---|---|
linear |
LLaVA v1 | No | Quick experiments, minimal overhead |
mlp |
LLaVA v1.5 | No | General-purpose, good default |
qformer |
BLIP-2 | Yes → N queries | Fixed output length, cross-attention |
perceiver |
Flamingo | Yes → N latents | Rich latent interactions, longer inputs |
c_abstractor |
Honeybee | Yes → spatial pool | Vision tasks with spatial structure |
# Switch projector with one line:
ProjectorConfig(type="perceiver", kwargs={"num_latents": 64, "num_layers": 4})[Input] → Processor → Encoder → Projector → [start_token | projected_tokens | end_token | prompt] → LLM → Text
↑ ↑
(trainable) (frozen or LoRA)
| Task | Config | Vision Model | LLM |
|---|---|---|---|
| Image Captioning | config.yaml | ViT-base-224 | Llama 3.2-1B |
| LaTeX OCR | config.yaml | SigLip-384 | Llama 3.2-1B + LoRA |
| LexiCaption | config.yaml | SigLip-384 | Llama 3.2-1B + LoRA |
| Radiology Caption | config.yaml | ViT-base-224 + LoRA | Llama 3.2-1B |
Train any example:
cd examples/image_captioning
python ../train.py --config config.yaml --projector_type qformer --num_epochs 5Pre-trained models available on HuggingFace:
- Image-Captioning-Llama-3.2-1B: ViT + Llama 3.2-1B on Flickr30k
from anymodal.encoders import BaseEncoder
class AudioEncoder(BaseEncoder):
@property
def hidden_size(self) -> int:
return 768
def forward(self, inputs):
# Your encoding logic
return features # (batch, seq_len, 768)from anymodal.projectors import BaseProjector
class MyProjector(BaseProjector):
def forward(self, x: torch.Tensor) -> torch.Tensor:
# (batch, seq, input_dim) → (batch, seq, output_dim)
return self.transform(x)pip install -e ".[dev]"
pytest tests/ -v
ruff check src/ tests/MIT License. See LICENSE for details.
