Flexible Music-Conditioned Dance Generation with Style Description Prompts

📅 2024-06-12

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Dance generation has long been constrained by reliance solely on raw audio signals, limiting semantic controllability and multi-task editing capabilities. To address this, we propose the Music-Conditioned Style-Aware Diffusion model (MCSAD), the first framework to deeply integrate textual music-style prompts (e.g., “jazz”, “energetic”) into the diffusion process via a synergistic style modulation module and music feature encoder. Furthermore, we introduce a spatiotemporal masking strategy enabling unified support for long-sequence generation, in-betweening, and localized inpainting. Evaluated under diverse musical styles, MCSAD produces high-fidelity dance sequences with precise audio–motion alignment. Quantitative and qualitative results demonstrate substantial improvements over state-of-the-art methods across all editing tasks. Our approach achieves semantically grounded, flexibly editable, and high-quality dance synthesis—advancing controllable motion generation beyond audio-driven baselines.

Technology Category

Application Category

📝 Abstract

Dance plays an important role as an artistic form and expression in human culture, yet the creation of dance remains a challenging task. Most dance generation methods primarily rely solely on music, seldom taking into consideration intrinsic attributes such as music style or genre. In this work, we introduce Flexible Dance Generation with Style Description Prompts (DGSDP), a diffusion-based framework suitable for diversified tasks of dance generation by fully leveraging the semantics of music style. The core component of this framework is Music-Conditioned Style-Aware Diffusion (MCSAD), which comprises a Transformer-based network and a music Style Modulation module. The MCSAD seemly integrates music conditions and style description prompts into the dance generation framework, ensuring that generated dances are consistent with the music content and style. To facilitate flexible dance generation and accommodate different tasks, a spatial-temporal masking strategy is effectively applied in the backward diffusion process. The proposed framework successfully generates realistic dance sequences that are accurately aligned with music for a variety of tasks such as long-term generation, dance in-betweening, dance inpainting, and etc. We hope that this work has the potential to inspire dance generation and creation, with promising applications in entertainment, art, and education.

Problem

Research questions and friction points this paper is trying to address.

Generating dance sequences aligned with music style and content

Integrating style descriptions into diffusion-based dance generation

Enabling flexible tasks like long-term generation and inpainting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based framework with style prompts

Music-Conditioned Style-Aware Diffusion module

Spatial-temporal masking in backward diffusion

🔎 Similar Papers

No similar papers found.