Every Image Listens, Every Image Dances: Music-Driven Image Animation

📅 2025-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing music-driven dance video generation methods suffer from complex modeling, reliance on explicit motion priors, and insufficient beat alignment and text fidelity. This paper introduces the first end-to-end music-and-text dual-conditioned single-image animation framework, eliminating the need for intermediate representations (e.g., pose or depth maps) and enabling zero-shot, personalized animation generation. Our key contributions are: (1) construction of the first large-scale music–text paired dataset comprising 2,904 dance videos; (2) design of a diffusion-based multimodal temporal modeling architecture that jointly encodes audio rhythmic features and textual semantics, augmented with spatiotemporal consistency optimization; and (3) comprehensive outperformance over state-of-the-art baselines in generation quality, beat alignment accuracy, and text-condition fidelity—establishing a new benchmark for music-driven image animation.

Technology Category

Application Category

📝 Abstract
Image animation has become a promising area in multimodal research, with a focus on generating videos from reference images. While prior work has largely emphasized generic video generation guided by text, music-driven dance video generation remains underexplored. In this paper, we introduce MuseDance, an innovative end-to-end model that animates reference images using both music and text inputs. This dual input enables MuseDance to generate personalized videos that follow text descriptions and synchronize character movements with the music. Unlike existing approaches, MuseDance eliminates the need for complex motion guidance inputs, such as pose or depth sequences, making flexible and creative video generation accessible to users of all expertise levels. To advance research in this field, we present a new multimodal dataset comprising 2,904 dance videos with corresponding background music and text descriptions. Our approach leverages diffusion-based methods to achieve robust generalization, precise control, and temporal consistency, setting a new baseline for the music-driven image animation task.
Problem

Research questions and friction points this paper is trying to address.

Music-driven Dance Generation
Synchronization
Video Generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

MuseDance
diffusion-based technique
personalized dance video generation
🔎 Similar Papers
No similar papers found.
Z
Zhikang Dong
Stony Brook University
Weituo Hao
Weituo Hao
Bytedance
Ju-Chiang Wang
Ju-Chiang Wang
ByteDance
Music AIMusic Information RetrievalMachine Learning
P
Peng Zhang
Apple
P
Paweł Polak
Stony Brook University