🤖 AI Summary
This work addresses video-driven multi-track audio synthesis, aiming to generate multiple semantically independent, high-fidelity audio tracks (e.g., footsteps, impact sounds) from a single input video to enhance realism and controllability of composite audio. We propose a negative-audio-guided stepwise generation framework built upon a pretrained video-audio foundation model. Leveraging text prompts and previously synthesized tracks as conditional inputs, the method generates each sound event track frame-by-frame and stage-by-stage. Inspired by concept negation, we design a negative guidance mechanism that enables semantic disentanglement and completeness optimization without requiring paired audio-video training data. Experiments demonstrate that our approach significantly outperforms existing baselines in track-level semantic separation, audio fidelity, and cross-track temporal consistency.
📝 Abstract
We propose a novel step-by-step video-to-audio generation method that sequentially produces individual audio tracks, each corresponding to a specific sound event in the video. Our approach mirrors traditional Foley workflows, aiming to capture all sound events induced by a given video comprehensively. Each generation step is formulated as a guided video-to-audio synthesis task, conditioned on a target text prompt and previously generated audio tracks. This design is inspired by the idea of concept negation from prior compositional generation frameworks. To enable this guided generation, we introduce a training framework that leverages pre-trained video-to-audio models and eliminates the need for specialized paired datasets, allowing training on more accessible data. Experimental results demonstrate that our method generates multiple semantically distinct audio tracks for a single input video, leading to higher-quality composite audio synthesis than existing baselines.