🤖 AI Summary
Existing research lacks a high-quality multimodal benchmark dataset supporting joint text–music–conditioned 3D duet dance generation. To address this gap, we introduce MDD—the first synchronized, high-fidelity multimodal dataset comprising 620 minutes of professional motion-captured duet dance sequences, precisely aligned with over 10,000 fine-grained natural language descriptions and their corresponding musical audio tracks. Leveraging MDD, we formally define two novel tasks: Text-to-Duet (generating coordinated dual-dancer motions from text) and Text-to-Dance Accompaniment (generating one dancer’s motion conditioned on the other’s motion, text, and music). We provide standardized preprocessing pipelines, rigorous temporal alignment annotation protocols, and strong baseline models to facilitate reproducible, conditional duet dance generation research. MDD establishes a critical benchmark for multimodal dance generation, significantly advancing embodied motion synthesis driven jointly by linguistic semantics and musical structure.
📝 Abstract
We introduce Multimodal DuetDance (MDD), a diverse multimodal benchmark dataset designed for text-controlled and music-conditioned 3D duet dance motion generation. Our dataset comprises 620 minutes of high-quality motion capture data performed by professional dancers, synchronized with music, and detailed with over 10K fine-grained natural language descriptions. The annotations capture a rich movement vocabulary, detailing spatial relationships, body movements, and rhythm, making MDD the first dataset to seamlessly integrate human motions, music, and text for duet dance generation. We introduce two novel tasks supported by our dataset: (1) Text-to-Duet, where given music and a textual prompt, both the leader and follower dance motion are generated (2) Text-to-Dance Accompaniment, where given music, textual prompt, and the leader's motion, the follower's motion is generated in a cohesive, text-aligned manner. We include baseline evaluations on both tasks to support future research.