🤖 AI Summary
This paper addresses the pervasive “insertional hallucination” problem in video-to-audio (V2A) generation—i.e., the model generating speech or music without visual grounding. We propose the first systematic framework for evaluating and mitigating this issue. First, we formally define insertional hallucination and introduce a novel evaluation protocol based on multi-audio-event detector voting, along with two new metrics: IH@vid (hallucination rate per video) and IH@dur (hallucination duration ratio). Second, we design a training-free, post-hoc feature correction method that operates in two stages: (1) localizing hallucinated audio segments via cross-modal inconsistency detection, and (2) masking the corresponding video features during re-synthesis to suppress spurious generation. Experiments across mainstream V2A benchmarks demonstrate that our approach reduces hallucination incidence and duration by over 50% on average, while preserving—or even improving—audio fidelity and audio-visual synchronization performance.
📝 Abstract
Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we propose Posterior Feature Correction, a novel training-free inference-time method that mitigates IH. PFC operates in a two-pass process: it first generates an initial audio output to detect hallucinated segments, and then regenerates the audio after masking the corresponding video features at those timestamps. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our PFC method reduces both the prevalence and duration of hallucinations by over 50% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.