SONIQUE: Video Background Music Generation Using Unpaired Audio-Visual Data

📅 2024-10-04
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of background music (BGM) generation from unpaired audio-video data, this paper proposes the first video-driven BGM generation framework that requires no paired training samples. Methodologically, it first employs a large language model to parse video semantics and generate fine-grained musical descriptors—including instrumentation, genre, tempo, and emotion—then conditions a U-Net-based conditional diffusion model on these descriptors to synthesize high-fidelity BGM. Cross-modal alignment and music representation learning are incorporated to ensure semantic consistency between visual input and generated audio. Contributions include: (1) the first unsupervised semantic-controllable BGM generation framework leveraging video-only supervision; (2) multi-dimensional, fine-grained stylistic control over generated music; and (3) superior subjective evaluation performance compared to supervised baselines, alongside publicly released code and an online demo system.

Technology Category

Application Category

📝 Abstract
We present SONIQUE, a model for generating background music tailored to video content. Unlike traditional video-to-music generation approaches, which rely heavily on paired audio-visual datasets, SONIQUE leverages unpaired data, combining royalty-free music and independent video sources. By utilizing large language models (LLMs) for video understanding and converting visual descriptions into musical tags, alongside a U-Net-based conditional diffusion model, SONIQUE enables customizable music generation. Users can control specific aspects of the music, such as instruments, genres, tempo, and melodies, ensuring the generated output fits their creative vision. SONIQUE is open-source, with a demo available online.
Problem

Research questions and friction points this paper is trying to address.

Generates background music for video content
Uses unpaired audio-visual datasets
Offers customizable music generation options
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses unpaired audio-visual data
Leverages large language models
Employs U-Net-based diffusion model