🤖 AI Summary
This paper addresses three key challenges in music-driven group dance generation: inter-dancer collisions, foot sliding in single-dancer motion, and abrupt positional discontinuities in long sequences. To tackle these, we propose an end-to-end controllable trajectory generation framework based on diffusion models. Methodologically: (1) we introduce dancer localization embeddings and a distance-consistency loss to explicitly enforce spatial constraints and suppress collisions; (2) we design swap-pattern embeddings and a foot-motion adapter to improve the physical plausibility of foot trajectories; and (3) we develop a long-sequence diffusion sampling strategy coupled with a sequence decoding layer to enhance spatiotemporal coherence across frames. Extensive experiments demonstrate that our approach significantly outperforms existing methods in long-duration group dance synthesis, achieving state-of-the-art performance in motion fluency, group coordination, and trajectory controllability.
📝 Abstract
Music-driven dance generation has garnered significant attention due to its wide range of industrial applications, particularly in the creation of group choreography. During the group dance generation process, however, most existing methods still face three primary issues: multi-dancer collisions, single-dancer foot sliding and abrupt swapping in the generation of long group dance. In this paper, we propose TCDiff++, a music-driven end-to-end framework designed to generate harmonious group dance. Specifically, to mitigate multi-dancer collisions, we utilize a dancer positioning embedding to better maintain the relative positioning among dancers. Additionally, we incorporate a distance-consistency loss to ensure that inter-dancer distances remain within plausible ranges. To address the issue of single-dancer foot sliding, we introduce a swap mode embedding to indicate dancer swapping patterns and design a Footwork Adaptor to refine raw motion, thereby minimizing foot sliding. For long group dance generation, we present a long group diffusion sampling strategy that reduces abrupt position shifts by injecting positional information into the noisy input. Furthermore, we integrate a Sequence Decoder layer to enhance the model's ability to selectively process long sequences. Extensive experiments demonstrate that our TCDiff++ achieves state-of-the-art performance, particularly in long-duration scenarios, ensuring high-quality and coherent group dance generation.