Music-Aligned Holistic 3D Dance Generation via Hierarchical Motion Modeling

📅 2025-07-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in music-driven 3D dance generation: poor cross-modal alignment between audio and motion, scarcity of high-quality paired music-dance data, and the complexity of holistic full-body motion modeling (including torso, hands, and face). To this end, we propose an end-to-end hierarchical residual vector quantization (VQ) generative framework. Methodologically: (1) we introduce SoulDance, a high-fidelity professional motion-capture dataset of synchronized music-dance pairs; (2) we design a hierarchical residual VQ architecture to explicitly capture inter-joint and multi-body-part dependencies; (3) we integrate a music-aligned generation module with a pre-trained music-to-motion retrieval module to enhance cross-modal consistency. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in musical alignment fidelity, motion naturalness, and full-body coordination—achieving expressive, emotionally resonant, and temporally coherent whole-body pose synthesis.

Technology Category

Application Category

📝 Abstract
Well-coordinated, music-aligned holistic dance enhances emotional expressiveness and audience engagement. However, generating such dances remains challenging due to the scarcity of holistic 3D dance datasets, the difficulty of achieving cross-modal alignment between music and dance, and the complexity of modeling interdependent motion across the body, hands, and face. To address these challenges, we introduce SoulDance, a high-precision music-dance paired dataset captured via professional motion capture systems, featuring meticulously annotated holistic dance movements. Building on this dataset, we propose SoulNet, a framework designed to generate music-aligned, kinematically coordinated holistic dance sequences. SoulNet consists of three principal components: (1) Hierarchical Residual Vector Quantization, which models complex, fine-grained motion dependencies across the body, hands, and face; (2) Music-Aligned Generative Model, which composes these hierarchical motion units into expressive and coordinated holistic dance; (3) Music-Motion Retrieval Module, a pre-trained cross-modal model that functions as a music-dance alignment prior, ensuring temporal synchronization and semantic coherence between generated dance and input music throughout the generation process. Extensive experiments demonstrate that SoulNet significantly surpasses existing approaches in generating high-quality, music-coordinated, and well-aligned holistic 3D dance sequences.
Problem

Research questions and friction points this paper is trying to address.

Generating music-aligned holistic 3D dance sequences
Modeling interdependent motion across body, hands, and face
Achieving cross-modal alignment between music and dance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Residual Vector Quantization for motion dependencies
Music-Aligned Generative Model for expressive dance
Music-Motion Retrieval Module for synchronization
🔎 Similar Papers
No similar papers found.