π€ AI Summary
Existing dance generation methods struggle to simultaneously achieve realism, music synchronization, motion diversity, and physical plausibility, while lacking flexible editing capabilities for multimodal conditioning signalsβsuch as musical cues, pose constraints, action labels, and genre descriptions. To address these limitations, we propose the first multimodal masked motion model tailored for high-fidelity 3D dance generation, integrating a text-to-motion framework with dual adapters for music and pose conditioning. We further introduce multimodal classifier-free guidance and inference-time motion optimization, jointly enhancing cross-modal alignment fidelity and editing flexibility. Our approach achieves state-of-the-art performance across multiple quantitative metrics, significantly improving generation quality, physical plausibility, and real-time editability. It enables rich, user-controllable creative expression through diverse multimodal inputs.
π Abstract
Recent advances in dance generation have enabled automatic synthesis of 3D dance motions. However, existing methods still struggle to produce high-fidelity dance sequences that simultaneously deliver exceptional realism, precise dance-music synchronization, high motion diversity, and physical plausibility. Moreover, existing methods lack the flexibility to edit dance sequences according to diverse guidance signals, such as musical prompts, pose constraints, action labels, and genre descriptions, significantly restricting their creative utility and adaptability. Unlike the existing approaches, DanceMosaic enables fast and high-fidelity dance generation, while allowing multimodal motion editing. Specifically, we propose a multimodal masked motion model that fuses the text-to-motion model with music and pose adapters to learn probabilistic mapping from diverse guidance signals to high-quality dance motion sequences via progressive generative masking training. To further enhance the motion generation quality, we propose multimodal classifier-free guidance and inference-time optimization mechanism that further enforce the alignment between the generated motions and the multimodal guidance. Extensive experiments demonstrate that our method establishes a new state-of-the-art performance in dance generation, significantly advancing the quality and editability achieved by existing approaches.