π€ AI Summary
Existing approaches to dance-to-music generation rely heavily on single human pose features and are constrained by small-scale datasets, limiting their generalization to complex scenarios such as multiple dancers or non-human performers. This work proposes a pose-free diffusion model that directly extracts visual features from dance videos, bypassing the need for explicit pose estimation. To enhance data efficiency and generalization, the method incorporates a progressive training strategy. By operating directly on raw visual inputs, the model accommodates an arbitrary number and type of dancers without requiring pose annotations. Experimental results demonstrate that the proposed approach achieves state-of-the-art performance in both objective metrics and subjective evaluations, setting new benchmarks for danceβmusic alignment and audio generation quality.
π Abstract
Dance-to-music generation aims to generate music that is aligned with dance movements. Existing approaches typically rely on body motion features extracted from a single human dancer and limited dance-to-music datasets, which restrict their performance and applicability to real-world scenarios involving multiple dancers and non-human dancers. In this paper, we propose PF-D2M, a universal diffusion-based dance-to-music generation model that incorporates visual features extracted from dance videos. PF-D2M is trained with a progressive training strategy that effectively addresses data scarcity and generalization challenges. Both objective and subjective evaluations show that PF-D2M achieves state-of-the-art performance in dance-music alignment and music quality.