TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

ICML 2026

TMD-Bench is a benchmark for text-driven music-dance co-generation. It evaluates generated results from multiple perspectives, including music quality, dance video quality, instruction following, and cross-modal rhythmic alignment. This repository also organizes rhythm-aligned music-dance data, a fine-grained Music Captioner, and the unified generation baseline RhyJAM, aiming to support research on beat-level synchronization and semantic consistency between music and dance.

Highlights

A multi-level benchmark for music-dance co-generation, covering unimodal quality, instruction following, and cross-modal rhythmic alignment.
A unified evaluation protocol that combines computable physical metrics with MLLM-as-a-Judge perceptual assessment.
A 10k-scale rhythm-aligned music-dance dataset covering diverse dance styles, scenes, performer settings, and music attributes.
A Music Captioner for structured music semantics, including instruments, rhythm, tempo, genre, emotion, and functional scenes.
RhyJAM, a unified music-dance generation baseline for text-to-music-and-dance video generation.

Repository Structure

TMD-Bench/
├── inference/                  # Inference scripts and example inputs
├── examples/Ovi/               # Text-to-audio and text-to-audio-video training entrypoints
├── examples/wanvideo/          # S2V training and inference scripts
├── ovi/                        # Ovi/RhyJAM model code and configurations
├── diffsynth/                  # DiffSynth base modules
├── pre_deal/                   # Data preprocessing scripts
├── requirements.txt
├── environment.yml
└── README.md

Environment

The project was mainly organized and tested with CUDA 12.5 and Python 3.12.2. You can install the dependencies with:

conda create -n tmd-bench python=3.12.2
conda activate tmd-bench
pip install -r requirements.txt

Alternatively, you can create the Conda environment from environment.yml:

conda env create -f environment.yml
conda activate diffsynth-org

Data and Checkpoints

Place model checkpoints under ckpts/, or modify the checkpoint paths in the configuration files. A recommended directory layout is:

ckpts/
├── MMAudio/                  # Original Ovi audio VAE
├── Wan2.2-TI2V-5B/           # Video VAE and umT5 text encoder
├── Ovi/                      # Original Ovi audio-video generation checkpoint
├── RhyJAM/                   # TMD-Bench/RhyJAM main checkpoint
├── T2A/                      # Text-to-music checkpoint
└── VAE-ASM/                  # Sound/Speech/Music VAE

The data consist of two parts: pure music data for Music Captioner training and semantic annotation, and music-dance video data for rhythmic alignment evaluation and joint generation training. Training data can be organized as follows:

dataset_base_path/
├── evan_metadata_s2v_with_prompt.csv
├── videos/
│   └── xxx.mp4
└── audios/
    └── xxx.wav

CSV example:

video,input_audio,prompt
videos/clip_001.mp4,audios/clip_001.wav,a person is dancing
videos/clip_002.mp4,audios/clip_002.wav,a person is dancing

Inference

Text-to-music-and-dance generation:

torchrun --nnodes 1 --nproc_per_node 8 inference/t2av_infer.py \
  --config-file ovi/configs/inference/inference_fusion.yaml

Before running, please check ckpt_dir, ovi_ckpt, text_prompt, and output_dir in ovi/configs/inference/inference_fusion.yaml.

Text-to-music generation:

torchrun --nnodes 1 --nproc_per_node 8 inference/t2a_infer.py \
  --config-file ovi/configs/inference/inference_audio.yaml

Audio/speech-driven video generation:

python inference/s2v_infer.py

Input, output, and checkpoint paths can be overridden with environment variables:

set CKPT_DIR=./ckpts
set SFT_CKPT_PATH=./ckpts/Wan2.2-S2V-5B/step-19000.safetensors
set S2V_INPUT_DIR=./outputs/s2v_input
set S2V_OUTPUT_DIR=./outputs/s2v_output

Training

Text-to-music-and-dance training:

bash examples/Ovi/run_multinode_t2av.sh

Text-to-music training:

bash examples/Ovi/run_multinode_t2a.sh

S2V training:

bash examples/wanvideo/model_training/taiji/Evan-Wan2.2-S2V-5B-multi-node.sh

Trajectory distillation training:

bash examples/wanvideo/model_training/taiji/Evan-Wan2.2-S2V-TI-5B-multi-node.sh

The training scripts contain platform-specific paths, node settings, and data paths. Please update them according to your local or cluster environment before use.

Evaluation Dimensions

TMD-Bench evaluates music-dance co-generation models from three complementary perspectives:

Music quality and music instruction following, including music aesthetics, CLAP similarity, and dimension-wise semantic matching with the Music Captioner.
Video quality and video instruction following, including spatiotemporal consistency, visual quality, motion magnitude, motion smoothness, and video-text matching.
Music-dance rhythmic alignment, combining beat-level physical metrics with MLLM-based perceptual judgments to assess synchronization between dance motion accents and musical beats.

Citation

If you find this project useful for your research, please cite:

@article{yang2026tmd,
  title={TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation},
  author={Yang, Xiaoda and Zhang, Majun and Pan, Changhao and Huang, Nick and Yuguang, Yang and Zhuo, Fan and Zhou, Pengfei and Zhou, Jin and Shan, Sizhe and Yang, Shan and others},
  journal={arXiv preprint arXiv:2605.01809},
  year={2026}
}

Acknowledgement

This project is organized and extended based on open-source works including Ovi, DiffSynth-Studio, Wan, MMAudio, and Qwen-Omni. We thank the community for its contributions to audio-video generation, music understanding, and multimodal evaluation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

Highlights

Repository Structure

Environment

Data and Checkpoints

Inference

Training

Evaluation Dimensions

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 869 Commits
.github/workflows		.github/workflows
apps		apps
diffsynth		diffsynth
examples		examples
inference		inference
models		models
ovi		ovi
pre_deal		pre_deal
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

Highlights

Repository Structure

Environment

Data and Checkpoints

Inference

Training

Evaluation Dimensions

Citation

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages