MusicInfuser: Making Video Diffusion Listen and Dance

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the low beat-melody synchronization accuracy and reliance on motion-capture annotations in music-driven dance video generation. We propose a lightweight music-video cross-attention mechanism integrated with LoRA (Low-Rank Adaptation) to endow pre-trained video diffusion models with music understanding and rhythm alignment capabilities—without architectural modification—using only unlabeled dance videos for fine-tuning. Our method significantly improves generated videos in rhythmic consistency, motion diversity, and visual fidelity. We further introduce a multidimensional automatic evaluation framework based on Video-LLM, which quantitatively validates that our approach achieves state-of-the-art performance in both semantic plausibility of dance motions and music-motion alignment. This work establishes an efficient, scalable paradigm for music-conditioned video generation.

Technology Category

Application Category

📝 Abstract
We introduce MusicInfuser, an approach for generating high-quality dance videos that are synchronized to a specified music track. Rather than attempting to design and train a new multimodal audio-video model, we show how existing video diffusion models can be adapted to align with musical inputs by introducing lightweight music-video cross-attention and a low-rank adapter. Unlike prior work requiring motion capture data, our approach fine-tunes only on dance videos. MusicInfuser achieves high-quality music-driven video generation while preserving the flexibility and generative capabilities of the underlying models. We introduce an evaluation framework using Video-LLMs to assess multiple dimensions of dance generation quality. The project page and code are available at https://susunghong.github.io/MusicInfuser.
Problem

Research questions and friction points this paper is trying to address.

Generating music-synchronized dance videos
Adapting video diffusion models with music inputs
Evaluating dance generation quality using Video-LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts video diffusion models with cross-attention
Uses low-rank adapter for music-video alignment
Fine-tunes on dance videos without motion capture
🔎 Similar Papers
No similar papers found.