Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses video-driven multi-track audio synthesis, aiming to generate multiple semantically independent, high-fidelity audio tracks (e.g., footsteps, impact sounds) from a single input video to enhance realism and controllability of composite audio. We propose a negative-audio-guided stepwise generation framework built upon a pretrained video-audio foundation model. Leveraging text prompts and previously synthesized tracks as conditional inputs, the method generates each sound event track frame-by-frame and stage-by-stage. Inspired by concept negation, we design a negative guidance mechanism that enables semantic disentanglement and completeness optimization without requiring paired audio-video training data. Experiments demonstrate that our approach significantly outperforms existing baselines in track-level semantic separation, audio fidelity, and cross-track temporal consistency.

Technology Category

Application Category

📝 Abstract

We propose a novel step-by-step video-to-audio generation method that sequentially produces individual audio tracks, each corresponding to a specific sound event in the video. Our approach mirrors traditional Foley workflows, aiming to capture all sound events induced by a given video comprehensively. Each generation step is formulated as a guided video-to-audio synthesis task, conditioned on a target text prompt and previously generated audio tracks. This design is inspired by the idea of concept negation from prior compositional generation frameworks. To enable this guided generation, we introduce a training framework that leverages pre-trained video-to-audio models and eliminates the need for specialized paired datasets, allowing training on more accessible data. Experimental results demonstrate that our method generates multiple semantically distinct audio tracks for a single input video, leading to higher-quality composite audio synthesis than existing baselines.

Problem

Research questions and friction points this paper is trying to address.

Generates multiple audio tracks from video step-by-step

Mimics Foley workflows for comprehensive sound capture

Uses text prompts and prior audio for guided synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential video-to-audio synthesis with text prompts

Negative audio guidance for distinct sound generation

Pre-trained model adaptation without specialized datasets

🔎 Similar Papers

Video-to-Audio Generation with Hidden Alignment

2024-07-10arXiv.orgCitations: 7