๐ค AI Summary
Existing vision-sound source localization (VSSL) models suffer from three critical limitations: over-reliance on visible sound sources, implicit size priors for occluded objects, and poor robustness to negative audio samples (e.g., silence, noise, off-screen sounds), revealing insufficient audio utilization and weak discriminative capability. To address these issues, this work introduces the first benchmark dataset incorporating negative audio samples and proposes novel evaluation metricsโbreaking the conventional paradigm of positive-only assessment. We further develop a maximum-value distribution analysis framework for audio-visual similarity maps. Through cross-model consistency diagnostics (using SOTA models including AVSBench and MFAVC), threshold sensitivity analysis, and ablation studies, we reveal that mainstream VSSL models still generate spurious saliency maps under negative audio, exhibit non-discriminative similarity map responses, lack universal thresholding capability, and effectively rely on visual priors for localization decisions. This study advances VSSL toward real-world generalization.
๐ Abstract
The task of Visual Sound Source Localization (VSSL) involves identifying the location of sound sources in visual scenes, integrating audio-visual data for enhanced scene understanding. Despite advancements in state-of-the-art (SOTA) models, we observe three critical flaws: i) The evaluation of the models is mainly focused in sounds produced by objects that are visible in the image, ii) The evaluation often assumes a prior knowledge of the size of the sounding object, and iii) No universal threshold for localization in real-world scenarios is established, as previous approaches only consider positive examples without accounting for both positive and negative cases. In this paper, we introduce a novel test set and metrics designed to complete the current standard evaluation of VSSL models by testing them in scenarios where none of the objects in the image corresponds to the audio input, i.e. a negative audio. We consider three types of negative audio: silence, noise and offscreen. Our analysis reveals that numerous SOTA models fail to appropriately adjust their predictions based on audio input, suggesting that these models may not be leveraging audio information as intended. Additionally, we provide a comprehensive analysis of the range of maximum values in the estimated audio-visual similarity maps, in both positive and negative audio cases, and show that most of the models are not discriminative enough, making them unfit to choose a universal threshold appropriate to perform sound localization without any a priori information of the sounding object, that is, object size and visibility.