X-Dancer: Expressive Music to Human Dance Video Generation

📅 2025-02-24

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This paper introduces the first zero-shot music-driven 2D dance video generation method, enabling long-duration, expressive, beat-aligned, and photorealistic dance videos from a single static portrait and arbitrary music. The approach employs a unified Transformer-diffusion framework: first, an autoregressive Transformer generates music-synchronized, tokenized 2D pose sequences using a spatially composable pose representation and a global attention mechanism that jointly encodes musical style and motion context; second, an AdaIN-conditioned diffusion model animates the pose sequence into photorealistic video frames. The entire pipeline is end-to-end differentiable and requires no fine-tuning or domain-specific training data. Quantitative and qualitative evaluations demonstrate state-of-the-art performance in motion diversity, expressiveness, and visual realism, with robust cross-style, long-sequence generation capability. The code and models are publicly released.

Technology Category

Application Category

📝 Abstract

We present X-Dancer, a novel zero-shot music-driven image animation pipeline that creates diverse and long-range lifelike human dance videos from a single static image. As its core, we introduce a unified transformer-diffusion framework, featuring an autoregressive transformer model that synthesize extended and music-synchronized token sequences for 2D body, head and hands poses, which then guide a diffusion model to produce coherent and realistic dance video frames. Unlike traditional methods that primarily generate human motion in 3D, X-Dancer addresses data limitations and enhances scalability by modeling a wide spectrum of 2D dance motions, capturing their nuanced alignment with musical beats through readily available monocular videos. To achieve this, we first build a spatially compositional token representation from 2D human pose labels associated with keypoint confidences, encoding both large articulated body movements (e.g., upper and lower body) and fine-grained motions (e.g., head and hands). We then design a music-to-motion transformer model that autoregressively generates music-aligned dance pose token sequences, incorporating global attention to both musical style and prior motion context. Finally we leverage a diffusion backbone to animate the reference image with these synthesized pose tokens through AdaIN, forming a fully differentiable end-to-end framework. Experimental results demonstrate that X-Dancer is able to produce both diverse and characterized dance videos, substantially outperforming state-of-the-art methods in term of diversity, expressiveness and realism. Code and model will be available for research purposes.

Problem

Research questions and friction points this paper is trying to address.

Generates dance videos from static images

Synchronizes 2D dance motions with music

Enhances scalability using monocular video data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-diffusion framework

Music-synchronized pose sequences

Differentiable end-to-end animation

🔎 Similar Papers

Flexible Music-Conditioned Dance Generation with Style Description Prompts