Reframing Music-Driven 2D Dance Pose Generation as Multi-Channel Image Generation

πŸ“… 2025-12-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses three key challenges in music-driven 2D dance pose generation: poor temporal coherence, difficulty in rhythm alignment, and weak generalization to real-world scenarios. To this end, we propose a multi-channel image synthesis framework based on the diffusion Transformer (DiT). Dance sequences are encoded as one-hot images and compressed into latent representations via a pre-trained image VAE. We introduce a time-shared temporal indexing mechanism to enable precise cross-modal alignment between musical tokens and pose latents. Additionally, a reference-pose conditioning strategy is designed to ensure anatomical consistency and stability during long-sequence segment stitching. Evaluated on AIST++ 2D and a large-scale in-the-wild dataset, our method achieves state-of-the-art performance across FID, average pose distance (APD), action–music synchronization accuracy, and human preference scores. Ablation studies confirm the significant contributions of each component.

Technology Category

Application Category

πŸ“ Abstract
Recent pose-to-video models can translate 2D pose sequences into photorealistic, identity-preserving dance videos, so the key challenge is to generate temporally coherent, rhythm-aligned 2D poses from music, especially under complex, high-variance in-the-wild distributions. We address this by reframing music-to-dance generation as a music-token-conditioned multi-channel image synthesis problem: 2D pose sequences are encoded as one-hot images, compressed by a pretrained image VAE, and modeled with a DiT-style backbone, allowing us to inherit architectural and training advances from modern text-to-image models and better capture high-variance 2D pose distributions. On top of this formulation, we introduce (i) a time-shared temporal indexing scheme that explicitly synchronizes music tokens and pose latents over time and (ii) a reference-pose conditioning strategy that preserves subject-specific body proportions and on-screen scale while enabling long-horizon segment-and-stitch generation. Experiments on a large in-the-wild 2D dance corpus and the calibrated AIST++2D benchmark show consistent improvements over representative music-to-dance methods in pose- and video-space metrics and human preference, and ablations validate the contributions of the representation, temporal indexing, and reference conditioning. See supplementary videos at https://hot-dance.github.io
Problem

Research questions and friction points this paper is trying to address.

Generate rhythm-aligned 2D dance poses from music
Handle complex in-the-wild pose distributions effectively
Ensure temporal coherence and subject-specific body proportions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reframes dance generation as multi-channel image synthesis
Uses time-shared indexing to synchronize music and pose
Employs reference-pose conditioning for consistent body proportions
πŸ”Ž Similar Papers
No similar papers found.
Y
Yan Zhang
Global Business Unit, Baidu Inc.
Han Zou
Han Zou
Meta
Multimodal AI
L
Lincong Feng
Global Business Unit, Baidu Inc.
Cong Xie
Cong Xie
ByteDance Inc.,University of Illinois at Urbana-Champaign
Distributed Machine Learning
R
Ruiqi Yu
Global Business Unit, Baidu Inc.
Z
Zhenpeng Zhan
Global Business Unit, Baidu Inc.