DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation methods suffer from identity drift and appearance entanglement in complex multi-person interactions (e.g., dancing, martial arts), primarily due to insufficient identity-action disentanglement under noisy pose control signals. To address this, we propose MaskPoseAdapter—the first end-to-end diffusion framework that enables long-duration, high-fidelity multi-person video generation from a single image and multiple pose masks, while strictly preserving individual identity consistency. Our approach achieves robust cross-modal fusion of object-tracking masks and semantic pose heatmaps. We introduce two new benchmarks: the two-person interaction dataset PairFS-4K and the cross-domain HumanRob-300. Additionally, we design a multi-scale spatiotemporal modeling module and the three-axis evaluation suite TogetherVideoBench. Experiments demonstrate significant improvements over state-of-the-art methods; high-quality human-robot interaction videos are generated after only one hour of fine-tuning. Ablation studies confirm that sustained identity-action binding is the key driver of performance gains.

Technology Category

Application Category

📝 Abstract
Controllable video generation (CVG) has advanced rapidly, yet current systems falter when more than one actor must move, interact, and exchange positions under noisy control signals. We address this gap with DanceTogether, the first end-to-end diffusion framework that turns a single reference image plus independent pose-mask streams into long, photorealistic videos while strictly preserving every identity. A novel MaskPoseAdapter binds"who"and"how"at every denoising step by fusing robust tracking masks with semantically rich-but noisy-pose heat-maps, eliminating the identity drift and appearance bleeding that plague frame-wise pipelines. To train and evaluate at scale, we introduce (i) PairFS-4K, 26 hours of dual-skater footage with 7,000+ distinct IDs, (ii) HumanRob-300, a one-hour humanoid-robot interaction set for rapid cross-domain transfer, and (iii) TogetherVideoBench, a three-track benchmark centered on the DanceTogEval-100 test suite covering dance, boxing, wrestling, yoga, and figure skating. On TogetherVideoBench, DanceTogether outperforms the prior arts by a significant margin. Moreover, we show that a one-hour fine-tune yields convincing human-robot videos, underscoring broad generalization to embodied-AI and HRI tasks. Extensive ablations confirm that persistent identity-action binding is critical to these gains. Together, our model, datasets, and benchmark lift CVG from single-subject choreography to compositionally controllable, multi-actor interaction, opening new avenues for digital production, simulation, and embodied intelligence. Our video demos and code are available at https://DanceTog.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Generating multi-actor videos with preserved identities under noisy controls
Preventing identity drift and appearance bleeding in interactive video generation
Enabling compositional control for multi-subject interactions in digital production
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end diffusion framework for multi-person video generation
MaskPoseAdapter fuses tracking masks with pose heat-maps
Large-scale datasets and benchmark for training and evaluation
🔎 Similar Papers
No similar papers found.
J
Junhao Chen
Tsinghua University
M
Mingjin Chen
Beijing Normal–Hong Kong Baptist University
Jianjin Xu
Jianjin Xu
CMU Robotics Institute
X
Xiang Li
Peking University
Junting Dong
Junting Dong
Zhejiang University
Computer Vision
Mingze Sun
Mingze Sun
Tsinghua University
computer visiongraphics
Puhua Jiang
Puhua Jiang
Tencent
Computer VisionGraphicsGenerative model
H
Hongxiang Li
Peking University
Y
Yuhang Yang
University of Science & Technology of China
H
Hao Zhao
Tsinghua University
X
Xiaoxiao Long
Nanjing University
Ruqi Huang
Ruqi Huang
Tsinghua Shenzhen International Graduate School
3D Computer VisionShape AnalysisGeometry Processing