CM3P (Contrastive Metadata-Map Masked Pre-training) is a multi-modal representation learning framework for osu! beatmaps. It learns high-quality embeddings for both beatmap structure (events, timing, positions, hitsounds, scroll speed, etc.) and beatmap metadata (difficulty, year, mapper, tags, mode, etc.), optionally conditioned on audio. These embeddings serve as a foundation for downstream tasks such as beatmap retrieval, recommendation, classification (e.g. ranked vs unranked), masked modeling, and transfer to fine-tuned generative or discriminative models.
CM3P provides:
- Unified multi-modal processor: parses raw
.osufiles + metadata + audio into token & feature tensors. - Dual-tower ModernBERT encoders (beatmap + metadata) with optional fused audio embeddings via placeholder audio tokens.
- Contrastive embedding pretraining with structured metadata variations (robust in-batch negatives).
- Optional masked beatmap language modeling and downstream classification heads.
- High-quality embeddings for retrieval, recommendation, filtering, and fine-tuning bases.
- Flexible Hydra configuration & Hugging Face Trainer integration (freeze/unfreeze, Muon optimizer, WandB & Hub push).
- Efficient long sequence handling (Flash Attention 2 support) and mixed precision.
To use a CM3P model in your project, you can simply load it from Hugging Face Hub and start extracting embeddings:
import torch
from transformers import AutoProcessor, AutoModel
device = "cuda" if torch.cuda.is_available() else "cpu"
repo_id = "OliBomby/CM3P"
processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True, revision="main")
model = AutoModel.from_pretrained(repo_id, device_map=device, dtype=torch.bfloat16, trust_remote_code=True, revision="main")
inputs = processor(beatmap="path/to/beatmap.osu", audio="path/to/audio.mp3")
inputs = inputs.to(device, dtype=torch.bfloat16)
with torch.no_grad():
outputs = model(**inputs)
beatmap_embeds = outputs.beatmap_embeds # (beatmap_length_seconds / 16, projection_dim)The osu! Map Explorer is a static visualizer website that allows you to explore a dataset of beatmap embeddings in 2D space. You can search for beatmaps by title, artist, or mapper, and see the nearest neighbors.
You can use this Colab notebook or extract_beatmap_embeddings.py to easily make a dataset of embeddings from your own beatmaps for visualization.
A dataset with a ton of precomputed beatmap embeddings can be found here on Hugging Face Hub.
- Python 3.12
- Git
- ffmpeg
- CUDA (For NVIDIA GPUs) or ROCm (For AMD GPUs on linux)
- PyTorch: Make sure to follow the Get Started guide so you install
torchandtorchaudiowith GPU support. Select the correct Compute Platform version that you have installed in the previous step. - A GPU for efficient training (Flash Attention 2 support recommended). For CPU-only or unsupported GPUs, set
attn_implementation: sdpa
# Clone the repository
git clone https://github.com/OliBomby/CM3P.git
cd CM3P
# (Optional) Create and activate a virtual environment
python -m venv .venv
# In cmd.exe
.venv\Scripts\activate.bat
# In PowerShell
.venv\Scripts\Activate.ps1
# In Linux or MacOS
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# (Optional) Install Flash Attention 2 if your GPU supports it (A100/H100/RTX 40xx, etc.)
# Follow official instructions; otherwise switch attn implementation in config.# Clone the repository
git clone https://github.com/OliBomby/CM3P.git
cd CM3P
# Build and run the Docker container
docker compose up -d
docker attach cm3p_space
cd cm3pThe Docker compose is configured to mount the datasets directory next to the CM3P directory, so place your datasets there to use them in the container.
Create your own dataset using the Mapperator console app. It requires an osu! OAuth client token to verify beatmaps and get additional metadata.
Mapperator.ConsoleApp.exe dataset2 -t "/Mapperatorinator/datasets/beatmap_descriptors.csv" -i "path/to/osz/files" -o "/datasets/cool_dataset"When training CM3P, you can provide multiple dataset roots (list of paths) in configs/train/default.yaml under dataset.train_dataset_paths and dataset.test_dataset_paths.
Filtering knobs (see dataset section in config):
- Year range (
min_year,max_year) - Difficulty range (
min_difficulty,max_difficulty) - Gamemodes filter (
gamemodeslist) - Splitting via indices (
train_dataset_start,train_dataset_end, etc.)
CM3P consists of three main transformer-based components built on ModernBERT:
- Metadata Tower (
CM3PMetadataTransformer): Encodes metadata token sequences; pools either CLS token or mean over valid tokens. - Beatmap Tower (
CM3PBeatmapTransformer): Encodes beatmap token sequences; internally can fuse audio embeddings by replacing audio placeholder tokens with projected audio features produced by:- Audio Encoder (
CM3PAudioEncoder): Two 1D convolutional layers (inspired by Whisper) + ModernBERT + projection MLP (CM3PMultiModalProjector) to reach the same embedding dimensionality as beatmap token embeddings.
- Audio Encoder (
- Projection Heads: Linear layers map pooled outputs of both towers into a shared
projection_dimembedding space.
Optional components:
- Masked LM Head (
CM3PPredictionHead + decoder): Whenhas_decoder_head=Truein config, produces logits over beatmap vocabulary for MLM training/inference.
- Contrastive Loss (
cm3p_loss): Symmetric cross-entropy over similarity matrices between beatmap embeddings and metadata embeddings. If metadata variations are present, the original metadata acts as the positive; others as structured negatives. - Masked LM Loss (if enabled): Standard token-level cross-entropy over masked positions.
- Classification Loss (downstream fine-tunes): For tasks like ranked vs unranked beatmap classification (
CM3PForBeatmapClassification).
attn_implementation can be set to flash_attention_2 (for supported GPUs with Flash Attention 2 installed), sdpa (standard PyTorch attention), or eager (fallback implementation).
Flash Attention 2 offers significant speed and memory benefits for long sequences. CM3P + Flash Attention also supports unpadding batched input sequences for token efficiency.
Hydra is used for config composition:
python train.py --config-name v1 # Uses configs/train/v1.yaml -> defaults chain
python train.py --config-name v7 # Swap to another experiment variantOverride any field inline:
python train.py -cn v7 training.learning_rate=5e-5 dataset.labels=masked_lm model_cls=CM3PForMaskedLMFor all overridable configurations see configs/train/default.yaml.
I recommend making a copy of an existing config (e.g., v7.yaml) and modifying it for your experiments.
Provide a checkpoint path or load a Hub model:
python train.py -cn "v7_classifier" from_pretrained="OliBomby/CM3P" 'dataset={train_dataset_paths:["/workspace/datasets/MMRS39389"],test_dataset_paths:["/workspace/datasets/MMRS39389"],train_dataset_end:39000,test_dataset_start:39000,test_dataset_end:39389}' 'training={dataloader_num_workers:8}' wandb_entity=mappingtoolsIf output_dir has checkpoints and overwrite_output_dir=false, the script auto-resumes (unless overridden by training.resume_from_checkpoint).
Set:
wandb_project=CM3P wandb_entity=your_entity wandb_mode=onlineDisable (offline) by wandb_mode=disabled or remove variables.
Make sure you are logged in (huggingface-cli login).
Enable:
training.push_to_hub=trueOr use push_to_hub.py after training with a path to the saved checkpoint.
compute_metrics (in train.py) aggregates metrics across evaluation steps:
- Zero-shot classification accuracy per variation class: original vs altered year/status/tags/mapper.
- Masked LM accuracy (if MLM labels present).
- Classification accuracy + top-5 for beatmap-level tasks.
During evaluation, metadata variation groups are resolved to check whether the highest-scoring metadata among variations is the original.
Metrics logged to console, saved to eval_results.json style files in output_dir, and optionally to WandB.
configs/train/default.yaml controls training, processor parameters, dataset filtering, and Hydra output directory.
configs/model/ contains model-level defaults (can extend for different projection dims, hidden sizes, enabling decoder heads, etc.).
To inspect active config at runtime, print or log OmegaConf.to_yaml(args) (you can add a line in train.py).
Audio features are extracted (log-mel), chunked to max_source_positions, passed through conv + ModernBERT encoder, then projected. The resulting dense embeddings replace placeholder audio tokens in the beatmap embedding sequence before the beatmap transformer processes them.
Multiple metadata sequences per beatmap allow structured negatives (e.g., one with altered tags/year). Loss only treats the original (variation class 0) as positive; others increase robustness.
- OOM: Reduce
per_device_train_batch_size, increasegradient_accumulation_steps, lower sequence/window lengths (processor.default_kwargs.beatmap_kwargs.max_length). - Slow data loading: Increase
training.dataloader_num_workersor reducecycle_length.
- Find some way to use PyTorch compilation during training.
- Colab notebook examples for inference & embedding extraction.
- Evaluate beatmap generative models using distributions of CM3P embeddings.
- osu! Beatmap Generator by Syps (Nick Sypteras)
- osumapper by kotritrona, jyvden, Yoyolick (Ryan Zmuda)
- osu-diffusion by OliBomby (Olivier Schipper), NiceAesth (Andrei Baciu)
- osuT5 by gyataro (Xiwen Teoh)
- Beat Learning by sedthh (Richard Nagyfi)
- osu!dreamer by jaswon (Jason Won)
- Mapperatorinator by OliBomby (Olivier Schipper)
- osuBERT by Khangaroo