🤖 AI Summary
This work addresses the challenge that existing video-and-text-to-audio generation models are overly dominated by visual cues when textual and visual prompts conflict, making it difficult to synthesize counterfactual sound effects that align with the text but contradict the visual content. To overcome this limitation, the authors propose a two-stage inference sampling method: in the first stage, the model leverages the video to establish temporal structure while suppressing vision-correlated audio sources; in the second stage, the video condition is removed, allowing audio generation conditioned solely on the text to produce the desired timbre. This approach achieves, for the first time, an effective decoupling of video-derived temporal structure and text-driven timbre without requiring model retraining. The study also introduces novel evaluation metrics to quantify both audio-text alignment and visual leakage. Experiments demonstrate that the proposed method significantly outperforms existing baselines, generating high-quality counterfactual foley sounds while maintaining textual fidelity and minimizing visual interference.
📝 Abstract
We investigate Counterfactual Video Foley Generation, which aims to adopt a sound-source identity that contradicts the visual evidence while remaining temporally synchronized to a silent video. Existing Video&Text-to-Audio (VT2A) models struggle with this, often remaining anchored to the visually implied sound source when video and text contents disagree. We present ConterFlow, an inference-time dual-phase sampling scheme for pretrained flow-matching VT2A models. Phase 1 builds a video-derived temporal structure while suppressing the visually implied source; Phase 2 drops video conditioning to focus entirely on shaping audio timbre toward the target prompt. ConterFlow substantially improves counterfactual Video Foley generation compared to naive negative prompting and state-of-the-art baselines. To evaluate replacement quality, we propose a metric leveraging a text-audio co-embedding space to measure both target-prompt evidence and residual visually implied source leakage. Video demonstrations and code are available at https://gyubin-lee.github.io/counterflow-demo/