🤖 AI Summary
Existing video-to-music generation methods suffer from inadequate cross-modal alignment across semantic, rhythmic, and affective dimensions, as well as limited musical diversity. To address these issues, we propose a hierarchical attention-driven cross-modal alignment framework: (1) a spatiotemporally decoupled hierarchical attention mechanism enables fine-grained alignment between video content and music features along semantic, beat-level, and emotional dimensions; (2) a zero-shot multi-style music generation module supports arbitrary style synthesis without requiring target-style exemplars; and (3) we introduce VidMuS—the first large-scale, high-quality video–music paired dataset—and AMScore, a novel alignment-aware evaluation metric. Extensive experiments demonstrate that our method achieves state-of-the-art performance across relevance, diversity, and generalization, with significant improvements on multiple benchmarks. This work establishes a scalable, high-fidelity, and controllable paradigm for automatic video background music generation.
📝 Abstract
Composing music for video is essential yet challenging, leading to a growing interest in automating music generation for video applications. Existing approaches often struggle to achieve robust music-video correspondence and generative diversity, primarily due to inadequate feature alignment methods and insufficient datasets. In this study, we present General Video-to-Music Generation model (GVMGen), designed for generating high-related music to the video input. Our model employs hierarchical attentions to extract and align video features with music in both spatial and temporal dimensions, ensuring the preservation of pertinent features while minimizing redundancy. Remarkably, our method is versatile, capable of generating multi-style music from different video inputs, even in zero-shot scenarios. We also propose an evaluation model along with two novel objective metrics for assessing video-music alignment. Additionally, we have compiled a large-scale dataset comprising diverse types of video-music pairs. Experimental results demonstrate that GVMGen surpasses previous models in terms of music-video correspondence, generative diversity, and application universality.