TMD-Bench is a benchmark for text-driven music-dance co-generation. It evaluates generated results from multiple perspectives, including music quality, dance video quality, instruction following, and cross-modal rhythmic alignment. This repository also organizes rhythm-aligned music-dance data, a fine-grained Music Captioner, and the unified generation baseline RhyJAM, aiming to support research on beat-level synchronization and semantic consistency between music and dance.
- A multi-level benchmark for music-dance co-generation, covering unimodal quality, instruction following, and cross-modal rhythmic alignment.
- A unified evaluation protocol that combines computable physical metrics with MLLM-as-a-Judge perceptual assessment.
- A 10k-scale rhythm-aligned music-dance dataset covering diverse dance styles, scenes, performer settings, and music attributes.
- A Music Captioner for structured music semantics, including instruments, rhythm, tempo, genre, emotion, and functional scenes.
- RhyJAM, a unified music-dance generation baseline for text-to-music-and-dance video generation.
TMD-Bench/
├── inference/ # Inference scripts and example inputs
├── examples/Ovi/ # Text-to-audio and text-to-audio-video training entrypoints
├── examples/wanvideo/ # S2V training and inference scripts
├── ovi/ # Ovi/RhyJAM model code and configurations
├── diffsynth/ # DiffSynth base modules
├── pre_deal/ # Data preprocessing scripts
├── requirements.txt
├── environment.yml
└── README.md
The project was mainly organized and tested with CUDA 12.5 and Python 3.12.2. You can install the dependencies with:
conda create -n tmd-bench python=3.12.2
conda activate tmd-bench
pip install -r requirements.txtAlternatively, you can create the Conda environment from environment.yml:
conda env create -f environment.yml
conda activate diffsynth-orgPlace model checkpoints under ckpts/, or modify the checkpoint paths in the configuration files. A recommended directory layout is:
ckpts/
├── MMAudio/ # Original Ovi audio VAE
├── Wan2.2-TI2V-5B/ # Video VAE and umT5 text encoder
├── Ovi/ # Original Ovi audio-video generation checkpoint
├── RhyJAM/ # TMD-Bench/RhyJAM main checkpoint
├── T2A/ # Text-to-music checkpoint
└── VAE-ASM/ # Sound/Speech/Music VAE
The data consist of two parts: pure music data for Music Captioner training and semantic annotation, and music-dance video data for rhythmic alignment evaluation and joint generation training. Training data can be organized as follows:
dataset_base_path/
├── evan_metadata_s2v_with_prompt.csv
├── videos/
│ └── xxx.mp4
└── audios/
└── xxx.wav
CSV example:
video,input_audio,prompt
videos/clip_001.mp4,audios/clip_001.wav,a person is dancing
videos/clip_002.mp4,audios/clip_002.wav,a person is dancingText-to-music-and-dance generation:
torchrun --nnodes 1 --nproc_per_node 8 inference/t2av_infer.py \
--config-file ovi/configs/inference/inference_fusion.yamlBefore running, please check ckpt_dir, ovi_ckpt, text_prompt, and output_dir in ovi/configs/inference/inference_fusion.yaml.
Text-to-music generation:
torchrun --nnodes 1 --nproc_per_node 8 inference/t2a_infer.py \
--config-file ovi/configs/inference/inference_audio.yamlAudio/speech-driven video generation:
python inference/s2v_infer.pyInput, output, and checkpoint paths can be overridden with environment variables:
set CKPT_DIR=./ckpts
set SFT_CKPT_PATH=./ckpts/Wan2.2-S2V-5B/step-19000.safetensors
set S2V_INPUT_DIR=./outputs/s2v_input
set S2V_OUTPUT_DIR=./outputs/s2v_outputText-to-music-and-dance training:
bash examples/Ovi/run_multinode_t2av.shText-to-music training:
bash examples/Ovi/run_multinode_t2a.shS2V training:
bash examples/wanvideo/model_training/taiji/Evan-Wan2.2-S2V-5B-multi-node.shTrajectory distillation training:
bash examples/wanvideo/model_training/taiji/Evan-Wan2.2-S2V-TI-5B-multi-node.shThe training scripts contain platform-specific paths, node settings, and data paths. Please update them according to your local or cluster environment before use.
TMD-Bench evaluates music-dance co-generation models from three complementary perspectives:
- Music quality and music instruction following, including music aesthetics, CLAP similarity, and dimension-wise semantic matching with the Music Captioner.
- Video quality and video instruction following, including spatiotemporal consistency, visual quality, motion magnitude, motion smoothness, and video-text matching.
- Music-dance rhythmic alignment, combining beat-level physical metrics with MLLM-based perceptual judgments to assess synchronization between dance motion accents and musical beats.
If you find this project useful for your research, please cite:
@article{yang2026tmd,
title={TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation},
author={Yang, Xiaoda and Zhang, Majun and Pan, Changhao and Huang, Nick and Yuguang, Yang and Zhuo, Fan and Zhou, Pengfei and Zhou, Jin and Shan, Sizhe and Yang, Shan and others},
journal={arXiv preprint arXiv:2605.01809},
year={2026}
}This project is organized and extended based on open-source works including Ovi, DiffSynth-Studio, Wan, MMAudio, and Qwen-Omni. We thank the community for its contributions to audio-video generation, music understanding, and multimodal evaluation.