MDD: A Dataset for Text-and-Music Conditioned Duet Dance Generation

📅 2025-08-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing research lacks a high-quality multimodal benchmark dataset supporting joint text–music–conditioned 3D duet dance generation. To address this gap, we introduce MDD—the first synchronized, high-fidelity multimodal dataset comprising 620 minutes of professional motion-captured duet dance sequences, precisely aligned with over 10,000 fine-grained natural language descriptions and their corresponding musical audio tracks. Leveraging MDD, we formally define two novel tasks: Text-to-Duet (generating coordinated dual-dancer motions from text) and Text-to-Dance Accompaniment (generating one dancer’s motion conditioned on the other’s motion, text, and music). We provide standardized preprocessing pipelines, rigorous temporal alignment annotation protocols, and strong baseline models to facilitate reproducible, conditional duet dance generation research. MDD establishes a critical benchmark for multimodal dance generation, significantly advancing embodied motion synthesis driven jointly by linguistic semantics and musical structure.

Technology Category

Application Category

📝 Abstract
We introduce Multimodal DuetDance (MDD), a diverse multimodal benchmark dataset designed for text-controlled and music-conditioned 3D duet dance motion generation. Our dataset comprises 620 minutes of high-quality motion capture data performed by professional dancers, synchronized with music, and detailed with over 10K fine-grained natural language descriptions. The annotations capture a rich movement vocabulary, detailing spatial relationships, body movements, and rhythm, making MDD the first dataset to seamlessly integrate human motions, music, and text for duet dance generation. We introduce two novel tasks supported by our dataset: (1) Text-to-Duet, where given music and a textual prompt, both the leader and follower dance motion are generated (2) Text-to-Dance Accompaniment, where given music, textual prompt, and the leader's motion, the follower's motion is generated in a cohesive, text-aligned manner. We include baseline evaluations on both tasks to support future research.
Problem

Research questions and friction points this paper is trying to address.

Generating 3D duet dance motions from text and music
Creating cohesive dance accompaniment given leader motion
Integrating multimodal inputs for synchronized partner dancing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal dataset integrating motion, music and text
Text-controlled duet dance generation with music conditioning
Novel tasks for leader-follower motion generation alignment
🔎 Similar Papers
No similar papers found.