Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing video-to-audio generation methods struggle to selectively synthesize audio from specific sound sources in multi-object scenes under text guidance, producing only mixed audio and failing to meet multimedia production requirements for precise, isolated audio track control. This paper introduces SelVA, the first text-guided selective video-audio generation model: it employs text as a sound-source selector to dynamically modulate the video encoder for extracting semantically aligned visual features, and incorporates auxiliary tokens to enhance cross-modal attention while suppressing irrelevant activations. To address insufficient monaural supervision, SelVA integrates text-conditional generation, temporal-aware cross-attention, and self-augmentation. Evaluated on the VGG-MONOAUDIO benchmark, SelVA significantly improves audio fidelity, text-audio semantic alignment, and temporal synchronization accuracy, enabling fine-grained sound-source separation and generation in complex scenes.

Technology Category

Application Category

📝 Abstract

This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. However, current approaches generate single source-mixed sounds at once, largely because visual features are entangled, and region cues or prompts often fail to specify the source. We propose SelVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector of target source and modulates video encoder to distinctly extract prompt-relevant video features. The proposed supplementary tokens promote cross-attention by suppressing text-irrelevant activations with efficient parameter tuning, yielding robust semantic and temporal grounding. SelVA further employs a self-augmentation scheme to overcome the lack of mono audio track supervision. We evaluate SelVA on VGG-MONOAUDIO, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization. Code and demo are available at https://jnwnlee.github.io/selva-demo/.

Problem

Research questions and friction points this paper is trying to address.

Generates user-intended sound from multi-object videos

Addresses entangled visual features in video-to-audio generation

Overcomes lack of mono audio track supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-conditioned selective video-to-audio generation

Modulates video encoder to extract prompt-relevant features

Self-augmentation scheme overcomes lack of mono audio supervision

🔎 Similar Papers

Video-to-Audio Generation with Hidden Alignment