🤖 AI Summary
To address the challenge of background music (BGM) generation from unpaired audio-video data, this paper proposes the first video-driven BGM generation framework that requires no paired training samples. Methodologically, it first employs a large language model to parse video semantics and generate fine-grained musical descriptors—including instrumentation, genre, tempo, and emotion—then conditions a U-Net-based conditional diffusion model on these descriptors to synthesize high-fidelity BGM. Cross-modal alignment and music representation learning are incorporated to ensure semantic consistency between visual input and generated audio. Contributions include: (1) the first unsupervised semantic-controllable BGM generation framework leveraging video-only supervision; (2) multi-dimensional, fine-grained stylistic control over generated music; and (3) superior subjective evaluation performance compared to supervised baselines, alongside publicly released code and an online demo system.
📝 Abstract
We present SONIQUE, a model for generating background music tailored to video content. Unlike traditional video-to-music generation approaches, which rely heavily on paired audio-visual datasets, SONIQUE leverages unpaired data, combining royalty-free music and independent video sources. By utilizing large language models (LLMs) for video understanding and converting visual descriptions into musical tags, alongside a U-Net-based conditional diffusion model, SONIQUE enables customizable music generation. Users can control specific aspects of the music, such as instruments, genres, tempo, and melodies, ensuring the generated output fits their creative vision. SONIQUE is open-source, with a demo available online.