🤖 AI Summary
This work addresses the challenge of automatic audio-visual temporal alignment in music-driven video editing—where manual editing is often required and risks compromising semantic and structural continuity. We propose MVAA, a two-stage framework that decouples alignment into (1) motion-beat matching and (2) rhythm-aware video inpainting. First, musical beats are detected and temporally aligned with salient motion frames in the input video. Then, a frame-conditioned diffusion model—built upon CogVideoX-5B-I2V—generates photorealistic, temporally coherent intermediate frames synchronized to the beat structure. Crucially, MVAA integrates pretrained knowledge with efficient inference-time fine-tuning, enabling personalized adaptation on a single GPU within 10 minutes. Experiments across diverse scenarios demonstrate high-precision beat alignment (mean temporal error < 0.12 s) and strong visual continuity, significantly improving both editing efficiency and audio-visual synchronization fidelity.
📝 Abstract
Aligning the rhythm of visual motion in a video with a given music track is a practical need in multimedia production, yet remains an underexplored task in autonomous video editing. Effective alignment between motion and musical beats enhances viewer engagement and visual appeal, particularly in music videos, promotional content, and cinematic editing. Existing methods typically depend on labor-intensive manual cutting, speed adjustments, or heuristic-based editing techniques to achieve synchronization. While some generative models handle joint video and music generation, they often entangle the two modalities, limiting flexibility in aligning video to music beats while preserving the full visual content. In this paper, we propose a novel and efficient framework, termed MVAA (Music-Video Auto-Alignment), that automatically edits video to align with the rhythm of a given music track while preserving the original visual content. To enhance flexibility, we modularize the task into a two-step process in our MVAA: aligning motion keyframes with audio beats, followed by rhythm-aware video inpainting. Specifically, we first insert keyframes at timestamps aligned with musical beats, then use a frame-conditioned diffusion model to generate coherent intermediate frames, preserving the original video's semantic content. Since comprehensive test-time training can be time-consuming, we adopt a two-stage strategy: pretraining the inpainting module on a small video set to learn general motion priors, followed by rapid inference-time fine-tuning for video-specific adaptation. This hybrid approach enables adaptation within 10 minutes with one epoch on a single NVIDIA 4090 GPU using CogVideoX-5b-I2V as the backbone. Extensive experiments show that our approach can achieve high-quality beat alignment and visual smoothness.