Skip to content
/ CM3P Public

CM3P (Contrastive Metadata-Map Masked Pre-training) multi-modal representation learning framework for osu! beatmaps

License

Notifications You must be signed in to change notification settings

OliBomby/CM3P

Repository files navigation

CM3P: Contrastive Metadata-Map Masked Pre-training

CM3P (Contrastive Metadata-Map Masked Pre-training) is a multi-modal representation learning framework for osu! beatmaps. It learns high-quality embeddings for both beatmap structure (events, timing, positions, hitsounds, scroll speed, etc.) and beatmap metadata (difficulty, year, mapper, tags, mode, etc.), optionally conditioned on audio. These embeddings serve as a foundation for downstream tasks such as beatmap retrieval, recommendation, classification (e.g. ranked vs unranked), masked modeling, and transfer to fine-tuned generative or discriminative models.

CM3P provides:

  • Unified multi-modal processor: parses raw .osu files + metadata + audio into token & feature tensors.
  • Dual-tower ModernBERT encoders (beatmap + metadata) with optional fused audio embeddings via placeholder audio tokens.
  • Contrastive embedding pretraining with structured metadata variations (robust in-batch negatives).
  • Optional masked beatmap language modeling and downstream classification heads.
  • High-quality embeddings for retrieval, recommendation, filtering, and fine-tuning bases.
  • Flexible Hydra configuration & Hugging Face Trainer integration (freeze/unfreeze, Muon optimizer, WandB & Hub push).
  • Efficient long sequence handling (Flash Attention 2 support) and mixed precision.

1. Quick Start (Inference)

To use a CM3P model in your project, you can simply load it from Hugging Face Hub and start extracting embeddings:

import torch
from transformers import AutoProcessor, AutoModel

device = "cuda" if torch.cuda.is_available() else "cpu"
repo_id = "OliBomby/CM3P"

processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True, revision="main")
model = AutoModel.from_pretrained(repo_id, device_map=device, dtype=torch.bfloat16, trust_remote_code=True, revision="main")

inputs = processor(beatmap="path/to/beatmap.osu", audio="path/to/audio.mp3")
inputs = inputs.to(device, dtype=torch.bfloat16)

with torch.no_grad():
    outputs = model(**inputs)

beatmap_embeds = outputs.beatmap_embeds  # (beatmap_length_seconds / 16, projection_dim)

2. Beatmap Embedding Explorer

The osu! Map Explorer is a static visualizer website that allows you to explore a dataset of beatmap embeddings in 2D space. You can search for beatmaps by title, artist, or mapper, and see the nearest neighbors.

You can use this Colab notebook or extract_beatmap_embeddings.py to easily make a dataset of embeddings from your own beatmaps for visualization. A dataset with a ton of precomputed beatmap embeddings can be found here on Hugging Face Hub.

image

3. Installation

Prerequisites

  • Python 3.12
  • Git
  • ffmpeg
  • CUDA (For NVIDIA GPUs) or ROCm (For AMD GPUs on linux)
  • PyTorch: Make sure to follow the Get Started guide so you install torch and torchaudio with GPU support. Select the correct Compute Platform version that you have installed in the previous step.
  • A GPU for efficient training (Flash Attention 2 support recommended). For CPU-only or unsupported GPUs, set attn_implementation: sdpa

Steps

# Clone the repository
git clone https://github.com/OliBomby/CM3P.git
cd CM3P

# (Optional) Create and activate a virtual environment
python -m venv .venv

# In cmd.exe
.venv\Scripts\activate.bat
# In PowerShell
.venv\Scripts\Activate.ps1
# In Linux or MacOS
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# (Optional) Install Flash Attention 2 if your GPU supports it (A100/H100/RTX 40xx, etc.)
# Follow official instructions; otherwise switch attn implementation in config.

Docker Steps (Optional)

# Clone the repository
git clone https://github.com/OliBomby/CM3P.git
cd CM3P

# Build and run the Docker container
docker compose up -d
docker attach cm3p_space
cd cm3p

The Docker compose is configured to mount the datasets directory next to the CM3P directory, so place your datasets there to use them in the container.


4. Data Preparation

Create your own dataset using the Mapperator console app. It requires an osu! OAuth client token to verify beatmaps and get additional metadata.

Mapperator.ConsoleApp.exe dataset2 -t "/Mapperatorinator/datasets/beatmap_descriptors.csv" -i "path/to/osz/files" -o "/datasets/cool_dataset"

When training CM3P, you can provide multiple dataset roots (list of paths) in configs/train/default.yaml under dataset.train_dataset_paths and dataset.test_dataset_paths.

Filtering knobs (see dataset section in config):

  • Year range (min_year, max_year)
  • Difficulty range (min_difficulty, max_difficulty)
  • Gamemodes filter (gamemodes list)
  • Splitting via indices (train_dataset_start, train_dataset_end, etc.)

5. Model Architecture

CM3P consists of three main transformer-based components built on ModernBERT:

  • Metadata Tower (CM3PMetadataTransformer): Encodes metadata token sequences; pools either CLS token or mean over valid tokens.
  • Beatmap Tower (CM3PBeatmapTransformer): Encodes beatmap token sequences; internally can fuse audio embeddings by replacing audio placeholder tokens with projected audio features produced by:
    • Audio Encoder (CM3PAudioEncoder): Two 1D convolutional layers (inspired by Whisper) + ModernBERT + projection MLP (CM3PMultiModalProjector) to reach the same embedding dimensionality as beatmap token embeddings.
  • Projection Heads: Linear layers map pooled outputs of both towers into a shared projection_dim embedding space.

Optional components:

  • Masked LM Head (CM3PPredictionHead + decoder): When has_decoder_head=True in config, produces logits over beatmap vocabulary for MLM training/inference.

Objectives

  • Contrastive Loss (cm3p_loss): Symmetric cross-entropy over similarity matrices between beatmap embeddings and metadata embeddings. If metadata variations are present, the original metadata acts as the positive; others as structured negatives.
  • Masked LM Loss (if enabled): Standard token-level cross-entropy over masked positions.
  • Classification Loss (downstream fine-tunes): For tasks like ranked vs unranked beatmap classification (CM3PForBeatmapClassification).

Attention Implementation

attn_implementation can be set to flash_attention_2 (for supported GPUs with Flash Attention 2 installed), sdpa (standard PyTorch attention), or eager (fallback implementation). Flash Attention 2 offers significant speed and memory benefits for long sequences. CM3P + Flash Attention also supports unpadding batched input sequences for token efficiency.


6. Training (From Scratch / Fine-tuning)

Base Command

Hydra is used for config composition:

python train.py --config-name v1              # Uses configs/train/v1.yaml -> defaults chain
python train.py --config-name v7              # Swap to another experiment variant

Override any field inline:

python train.py -cn v7 training.learning_rate=5e-5 dataset.labels=masked_lm model_cls=CM3PForMaskedLM

For all overridable configurations see configs/train/default.yaml. I recommend making a copy of an existing config (e.g., v7.yaml) and modifying it for your experiments.

Fine-tuning From Pretrained

Provide a checkpoint path or load a Hub model:

python train.py -cn "v7_classifier" from_pretrained="OliBomby/CM3P" 'dataset={train_dataset_paths:["/workspace/datasets/MMRS39389"],test_dataset_paths:["/workspace/datasets/MMRS39389"],train_dataset_end:39000,test_dataset_start:39000,test_dataset_end:39389}' 'training={dataloader_num_workers:8}' wandb_entity=mappingtools

Resume Training

If output_dir has checkpoints and overwrite_output_dir=false, the script auto-resumes (unless overridden by training.resume_from_checkpoint).

WandB Logging

Set:

wandb_project=CM3P wandb_entity=your_entity wandb_mode=online

Disable (offline) by wandb_mode=disabled or remove variables.

Pushing to Hugging Face Hub

Make sure you are logged in (huggingface-cli login). Enable:

training.push_to_hub=true

Or use push_to_hub.py after training with a path to the saved checkpoint.


7. Evaluation & Metrics

compute_metrics (in train.py) aggregates metrics across evaluation steps:

  • Zero-shot classification accuracy per variation class: original vs altered year/status/tags/mapper.
  • Masked LM accuracy (if MLM labels present).
  • Classification accuracy + top-5 for beatmap-level tasks.

During evaluation, metadata variation groups are resolved to check whether the highest-scoring metadata among variations is the original.

Metrics logged to console, saved to eval_results.json style files in output_dir, and optionally to WandB.


8. Configuration Overview

configs/train/default.yaml controls training, processor parameters, dataset filtering, and Hydra output directory. configs/model/ contains model-level defaults (can extend for different projection dims, hidden sizes, enabling decoder heads, etc.).

To inspect active config at runtime, print or log OmegaConf.to_yaml(args) (you can add a line in train.py).


9. Advanced Topics

Audio Fusion Details

Audio features are extracted (log-mel), chunked to max_source_positions, passed through conv + ModernBERT encoder, then projected. The resulting dense embeddings replace placeholder audio tokens in the beatmap embedding sequence before the beatmap transformer processes them.

Metadata Variations

Multiple metadata sequences per beatmap allow structured negatives (e.g., one with altered tags/year). Loss only treats the original (variation class 0) as positive; others increase robustness.


10. Troubleshooting

  • OOM: Reduce per_device_train_batch_size, increase gradient_accumulation_steps, lower sequence/window lengths (processor.default_kwargs.beatmap_kwargs.max_length).
  • Slow data loading: Increase training.dataloader_num_workers or reduce cycle_length.

11. Roadmap / Next Steps

  • Find some way to use PyTorch compilation during training.
  • Colab notebook examples for inference & embedding extraction.
  • Evaluate beatmap generative models using distributions of CM3P embeddings.

Related works

  1. osu! Beatmap Generator by Syps (Nick Sypteras)
  2. osumapper by kotritrona, jyvden, Yoyolick (Ryan Zmuda)
  3. osu-diffusion by OliBomby (Olivier Schipper), NiceAesth (Andrei Baciu)
  4. osuT5 by gyataro (Xiwen Teoh)
  5. Beat Learning by sedthh (Richard Nagyfi)
  6. osu!dreamer by jaswon (Jason Won)
  7. Mapperatorinator by OliBomby (Olivier Schipper)
  8. osuBERT by Khangaroo

About

CM3P (Contrastive Metadata-Map Masked Pre-training) multi-modal representation learning framework for osu! beatmaps

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published