🤖 AI Summary
Existing video-to-music (V2M) generation methods rely on static visual or textual features, lacking fine-grained control over musical style, emotion, and rhythm—resulting in poor temporal alignment and weak user-intent fidelity. To address this, we propose a multi-time-varying conditional V2M framework featuring a fine-grained feature selection module and a dynamic conditional fusion mechanism that jointly models visual inputs with multimodal time-varying signals (e.g., emotional intensity, motion velocity). We further introduce progressive temporal alignment attention and a control-oriented decoder, enabling two-stage training for enhanced controllability and synchronization. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in temporal synchronization, semantic consistency, and subjective audio quality—achieving personalized, high-fidelity music generation with precise stylistic and expressive control.
📝 Abstract
Music enhances video narratives and emotions, driving demand for automatic video-to-music (V2M) generation. However, existing V2M methods relying solely on visual features or supplementary textual inputs generate music in a black-box manner, often failing to meet user expectations. To address this challenge, we propose a novel multi-condition guided V2M generation framework that incorporates multiple time-varying conditions for enhanced control over music generation. Our method uses a two-stage training strategy that enables learning of V2M fundamentals and audiovisual temporal synchronization while meeting users' needs for multi-condition control. In the first stage, we introduce a fine-grained feature selection module and a progressive temporal alignment attention mechanism to ensure flexible feature alignment. For the second stage, we develop a dynamic conditional fusion module and a control-guided decoder module to integrate multiple conditions and accurately guide the music composition process. Extensive experiments demonstrate that our method outperforms existing V2M pipelines in both subjective and objective evaluations, significantly enhancing control and alignment with user expectations.