Skip to content

MM-Speech/TMD-Bench

Repository files navigation

TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

ICML 2026

Paper

TMD-Bench is a benchmark for text-driven music-dance co-generation. It evaluates generated results from multiple perspectives, including music quality, dance video quality, instruction following, and cross-modal rhythmic alignment. This repository also organizes rhythm-aligned music-dance data, a fine-grained Music Captioner, and the unified generation baseline RhyJAM, aiming to support research on beat-level synchronization and semantic consistency between music and dance.

Highlights

  • A multi-level benchmark for music-dance co-generation, covering unimodal quality, instruction following, and cross-modal rhythmic alignment.
  • A unified evaluation protocol that combines computable physical metrics with MLLM-as-a-Judge perceptual assessment.
  • A 10k-scale rhythm-aligned music-dance dataset covering diverse dance styles, scenes, performer settings, and music attributes.
  • A Music Captioner for structured music semantics, including instruments, rhythm, tempo, genre, emotion, and functional scenes.
  • RhyJAM, a unified music-dance generation baseline for text-to-music-and-dance video generation.

Repository Structure

TMD-Bench/
├── inference/                  # Inference scripts and example inputs
├── examples/Ovi/               # Text-to-audio and text-to-audio-video training entrypoints
├── examples/wanvideo/          # S2V training and inference scripts
├── ovi/                        # Ovi/RhyJAM model code and configurations
├── diffsynth/                  # DiffSynth base modules
├── pre_deal/                   # Data preprocessing scripts
├── requirements.txt
├── environment.yml
└── README.md

Environment

The project was mainly organized and tested with CUDA 12.5 and Python 3.12.2. You can install the dependencies with:

conda create -n tmd-bench python=3.12.2
conda activate tmd-bench
pip install -r requirements.txt

Alternatively, you can create the Conda environment from environment.yml:

conda env create -f environment.yml
conda activate diffsynth-org

Data and Checkpoints

Place model checkpoints under ckpts/, or modify the checkpoint paths in the configuration files. A recommended directory layout is:

ckpts/
├── MMAudio/                  # Original Ovi audio VAE
├── Wan2.2-TI2V-5B/           # Video VAE and umT5 text encoder
├── Ovi/                      # Original Ovi audio-video generation checkpoint
├── RhyJAM/                   # TMD-Bench/RhyJAM main checkpoint
├── T2A/                      # Text-to-music checkpoint
└── VAE-ASM/                  # Sound/Speech/Music VAE

The data consist of two parts: pure music data for Music Captioner training and semantic annotation, and music-dance video data for rhythmic alignment evaluation and joint generation training. Training data can be organized as follows:

dataset_base_path/
├── evan_metadata_s2v_with_prompt.csv
├── videos/
│   └── xxx.mp4
└── audios/
    └── xxx.wav

CSV example:

video,input_audio,prompt
videos/clip_001.mp4,audios/clip_001.wav,a person is dancing
videos/clip_002.mp4,audios/clip_002.wav,a person is dancing

Inference

Text-to-music-and-dance generation:

torchrun --nnodes 1 --nproc_per_node 8 inference/t2av_infer.py \
  --config-file ovi/configs/inference/inference_fusion.yaml

Before running, please check ckpt_dir, ovi_ckpt, text_prompt, and output_dir in ovi/configs/inference/inference_fusion.yaml.

Text-to-music generation:

torchrun --nnodes 1 --nproc_per_node 8 inference/t2a_infer.py \
  --config-file ovi/configs/inference/inference_audio.yaml

Audio/speech-driven video generation:

python inference/s2v_infer.py

Input, output, and checkpoint paths can be overridden with environment variables:

set CKPT_DIR=./ckpts
set SFT_CKPT_PATH=./ckpts/Wan2.2-S2V-5B/step-19000.safetensors
set S2V_INPUT_DIR=./outputs/s2v_input
set S2V_OUTPUT_DIR=./outputs/s2v_output

Training

Text-to-music-and-dance training:

bash examples/Ovi/run_multinode_t2av.sh

Text-to-music training:

bash examples/Ovi/run_multinode_t2a.sh

S2V training:

bash examples/wanvideo/model_training/taiji/Evan-Wan2.2-S2V-5B-multi-node.sh

Trajectory distillation training:

bash examples/wanvideo/model_training/taiji/Evan-Wan2.2-S2V-TI-5B-multi-node.sh

The training scripts contain platform-specific paths, node settings, and data paths. Please update them according to your local or cluster environment before use.

Evaluation Dimensions

TMD-Bench evaluates music-dance co-generation models from three complementary perspectives:

  1. Music quality and music instruction following, including music aesthetics, CLAP similarity, and dimension-wise semantic matching with the Music Captioner.
  2. Video quality and video instruction following, including spatiotemporal consistency, visual quality, motion magnitude, motion smoothness, and video-text matching.
  3. Music-dance rhythmic alignment, combining beat-level physical metrics with MLLM-based perceptual judgments to assess synchronization between dance motion accents and musical beats.

Citation

If you find this project useful for your research, please cite:

@article{yang2026tmd,
  title={TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation},
  author={Yang, Xiaoda and Zhang, Majun and Pan, Changhao and Huang, Nick and Yuguang, Yang and Zhuo, Fan and Zhou, Pengfei and Zhou, Jin and Shan, Sizhe and Yang, Shan and others},
  journal={arXiv preprint arXiv:2605.01809},
  year={2026}
}

Acknowledgement

This project is organized and extended based on open-source works including Ovi, DiffSynth-Studio, Wan, MMAudio, and Qwen-Omni. We thank the community for its contributions to audio-video generation, music understanding, and multimodal evaluation.

About

[ICML 2026] TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages