Learning from Silence and Noise for Visual Sound Source Localization

📅 2025-08-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual sound source localization methods exhibit poor robustness against negative audio interference—such as silence, background noise, or out-of-frame sounds—and are predominantly evaluated on single-source, in-frame scenarios, lacking systematic consideration of low audio-visual semantic correspondence. To address these limitations, we propose SSL-SaN, a self-supervised framework that innovatively integrates silence and noise modeling, introduces a negative-sample augmentation strategy, and designs a balanced evaluation metric for audio-visual feature alignment versus separation. Furthermore, we release IS3+, the first extended dataset featuring diverse negative audio conditions. Experiments demonstrate that SSL-SaN achieves state-of-the-art performance among self-supervised methods on both sound source localization and cross-modal retrieval tasks, significantly improving model robustness to negative audio and generalization across challenging acoustic conditions.

Technology Category

Application Category

📝 Abstract
Visual sound source localization is a fundamental perception task that aims to detect the location of sounding sources in a video given its audio. Despite recent progress, we identify two shortcomings in current methods: 1) most approaches perform poorly in cases with low audio-visual semantic correspondence such as silence, noise, and offscreen sounds, i.e. in the presence of negative audio; and 2) most prior evaluations are limited to positive cases, where both datasets and metrics convey scenarios with a single visible sound source in the scene. To address this, we introduce three key contributions. First, we propose a new training strategy that incorporates silence and noise, which improves performance in positive cases, while being more robust against negative sounds. Our resulting self-supervised model, SSL-SaN, achieves state-of-the-art performance compared to other self-supervised models, both in sound localization and cross-modal retrieval. Second, we propose a new metric that quantifies the trade-off between alignment and separability of auditory and visual features across positive and negative audio-visual pairs. Third, we present IS3+, an extended and improved version of the IS3 synthetic dataset with negative audio. Our data, metrics and code are available on the https://xavijuanola.github.io/SSL-SaN/.
Problem

Research questions and friction points this paper is trying to address.

Improving localization robustness against silence and noise
Addressing negative audio cases like offscreen sounds
Enhancing evaluation beyond single visible sources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training strategy incorporating silence and noise
Self-supervised model SSL-SaN for localization
New metric for audio-visual feature alignment
🔎 Similar Papers
No similar papers found.