DanceEditor: Towards Iterative Editable Music-driven Dance Generation with Open-Vocabulary Descriptions

📅 2025-08-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing dance generation methods lack support for user-driven, multi-turn text editing and are hindered by the scarcity of high-quality, editable dance data. To address this, we introduce DanceRemix—the first large-scale, multi-turn editable dance dataset—and propose a unified prediction-editing framework. Our approach employs music-text dual-conditioned modeling, incorporating a customized music-alignment module and a cross-modal editing module (CEM) to jointly optimize rhythmic consistency and fine-grained semantic alignment. The framework enables open-vocabulary, text-guided iterative motion editing. Evaluated on DanceRemix, our method significantly outperforms state-of-the-art approaches, producing dances that exhibit superior musical coherence, textual fidelity, and editing controllability—directly fulfilling practical choreography requirements.

Technology Category

Application Category

📝 Abstract
Generating coherent and diverse human dances from music signals has gained tremendous progress in animating virtual avatars. While existing methods support direct dance synthesis, they fail to recognize that enabling users to edit dance movements is far more practical in real-world choreography scenarios. Moreover, the lack of high-quality dance datasets incorporating iterative editing also limits addressing this challenge. To achieve this goal, we first construct DanceRemix, a large-scale multi-turn editable dance dataset comprising the prompt featuring over 25.3M dance frames and 84.5K pairs. In addition, we propose a novel framework for iterative and editable dance generation coherently aligned with given music signals, namely DanceEditor. Considering the dance motion should be both musical rhythmic and enable iterative editing by user descriptions, our framework is built upon a prediction-then-editing paradigm unifying multi-modal conditions. At the initial prediction stage, our framework improves the authority of generated results by directly modeling dance movements from tailored, aligned music. Moreover, at the subsequent iterative editing stages, we incorporate text descriptions as conditioning information to draw the editable results through a specifically designed Cross-modality Editing Module (CEM). Specifically, CEM adaptively integrates the initial prediction with music and text prompts as temporal motion cues to guide the synthesized sequences. Thereby, the results display music harmonics while preserving fine-grained semantic alignment with text descriptions. Extensive experiments demonstrate that our method outperforms the state-of-the-art models on our newly collected DanceRemix dataset. Code is available at https://lzvsdy.github.io/DanceEditor/.
Problem

Research questions and friction points this paper is trying to address.

Enabling iterative editing of dance movements with user descriptions
Addressing lack of high-quality datasets for editable dance generation
Achieving music-rhythmic alignment while allowing text-guided modifications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs DanceRemix dataset with 25.3M frames
Uses prediction-then-editing paradigm with multi-modal conditions
Implements Cross-modality Editing Module for iterative refinement
🔎 Similar Papers