SONIQUE: Video Background Music Generation Using Unpaired Audio-Visual Data

📅 2024-10-04

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

171K/year

🤖 AI Summary

To address the challenge of background music (BGM) generation from unpaired audio-video data, this paper proposes the first video-driven BGM generation framework that requires no paired training samples. Methodologically, it first employs a large language model to parse video semantics and generate fine-grained musical descriptors—including instrumentation, genre, tempo, and emotion—then conditions a U-Net-based conditional diffusion model on these descriptors to synthesize high-fidelity BGM. Cross-modal alignment and music representation learning are incorporated to ensure semantic consistency between visual input and generated audio. Contributions include: (1) the first unsupervised semantic-controllable BGM generation framework leveraging video-only supervision; (2) multi-dimensional, fine-grained stylistic control over generated music; and (3) superior subjective evaluation performance compared to supervised baselines, alongside publicly released code and an online demo system.

Technology Category

Application Category

📝 Abstract

We present SONIQUE, a model for generating background music tailored to video content. Unlike traditional video-to-music generation approaches, which rely heavily on paired audio-visual datasets, SONIQUE leverages unpaired data, combining royalty-free music and independent video sources. By utilizing large language models (LLMs) for video understanding and converting visual descriptions into musical tags, alongside a U-Net-based conditional diffusion model, SONIQUE enables customizable music generation. Users can control specific aspects of the music, such as instruments, genres, tempo, and melodies, ensuring the generated output fits their creative vision. SONIQUE is open-source, with a demo available online.

Problem

Research questions and friction points this paper is trying to address.

Generates background music for video content

Uses unpaired audio-visual datasets

Offers customizable music generation options

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses unpaired audio-visual data

Leverages large language models

Employs U-Net-based diffusion model

🔎 Similar Papers

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

2024-06-06arXiv.orgCitations: 16

Apple

Cupertino, United States of America

2026 University Graduate - Research Scientist/Engineer

Adobe

San Francisco, California, United States of America

AI Research Scientist, Computer Vision - Facebook Video Intelligence