Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses video-driven multi-track audio synthesis, aiming to generate multiple semantically independent, high-fidelity audio tracks (e.g., footsteps, impact sounds) from a single input video to enhance realism and controllability of composite audio. We propose a negative-audio-guided stepwise generation framework built upon a pretrained video-audio foundation model. Leveraging text prompts and previously synthesized tracks as conditional inputs, the method generates each sound event track frame-by-frame and stage-by-stage. Inspired by concept negation, we design a negative guidance mechanism that enables semantic disentanglement and completeness optimization without requiring paired audio-video training data. Experiments demonstrate that our approach significantly outperforms existing baselines in track-level semantic separation, audio fidelity, and cross-track temporal consistency.

Technology Category

Application Category

📝 Abstract
We propose a novel step-by-step video-to-audio generation method that sequentially produces individual audio tracks, each corresponding to a specific sound event in the video. Our approach mirrors traditional Foley workflows, aiming to capture all sound events induced by a given video comprehensively. Each generation step is formulated as a guided video-to-audio synthesis task, conditioned on a target text prompt and previously generated audio tracks. This design is inspired by the idea of concept negation from prior compositional generation frameworks. To enable this guided generation, we introduce a training framework that leverages pre-trained video-to-audio models and eliminates the need for specialized paired datasets, allowing training on more accessible data. Experimental results demonstrate that our method generates multiple semantically distinct audio tracks for a single input video, leading to higher-quality composite audio synthesis than existing baselines.
Problem

Research questions and friction points this paper is trying to address.

Generates multiple audio tracks from video step-by-step
Mimics Foley workflows for comprehensive sound capture
Uses text prompts and prior audio for guided synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential video-to-audio synthesis with text prompts
Negative audio guidance for distinct sound generation
Pre-trained model adaptation without specialized datasets
🔎 Similar Papers
No similar papers found.