🤖 AI Summary
This work addresses the low beat-melody synchronization accuracy and reliance on motion-capture annotations in music-driven dance video generation. We propose a lightweight music-video cross-attention mechanism integrated with LoRA (Low-Rank Adaptation) to endow pre-trained video diffusion models with music understanding and rhythm alignment capabilities—without architectural modification—using only unlabeled dance videos for fine-tuning. Our method significantly improves generated videos in rhythmic consistency, motion diversity, and visual fidelity. We further introduce a multidimensional automatic evaluation framework based on Video-LLM, which quantitatively validates that our approach achieves state-of-the-art performance in both semantic plausibility of dance motions and music-motion alignment. This work establishes an efficient, scalable paradigm for music-conditioned video generation.
📝 Abstract
We introduce MusicInfuser, an approach for generating high-quality dance videos that are synchronized to a specified music track. Rather than attempting to design and train a new multimodal audio-video model, we show how existing video diffusion models can be adapted to align with musical inputs by introducing lightweight music-video cross-attention and a low-rank adapter. Unlike prior work requiring motion capture data, our approach fine-tunes only on dance videos. MusicInfuser achieves high-quality music-driven video generation while preserving the flexibility and generative capabilities of the underlying models. We introduce an evaluation framework using Video-LLMs to assess multiple dimensions of dance generation quality. The project page and code are available at https://susunghong.github.io/MusicInfuser.