Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?

📅 2025-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current audio-visual segmentation (AVS) models suffer from severe visual bias—over-relying on visual inputs while neglecting auditory cues—leading to high false-positive rates and poor robustness under adverse audio conditions (e.g., silence, noise, or off-screen audio). To address this, we introduce AVSBench-Robust, the first benchmark explicitly designed to evaluate AVS robustness. Our method leverages a Transformer-based architecture enhanced with visual foundation models (e.g., SAM), incorporating negative-sample augmentation, contrastive audio-visual alignment, and multi-stage cross-modal alignment. Crucially, we propose a balanced training strategy that jointly optimizes negative-sample learning and classifier-guided similarity learning. Experiments demonstrate that our approach maintains state-of-the-art performance on standard metrics (e.g., J&F) while reducing false positives under adverse audio conditions to near zero. It achieves substantial gains in robustness and consistently outperforms existing methods across all challenging scenarios.

Technology Category

Application Category

📝 Abstract
Unlike traditional visual segmentation, audio-visual segmentation (AVS) requires the model not only to identify and segment objects but also to determine whether they are sound sources. Recent AVS approaches, leveraging transformer architectures and powerful foundation models like SAM, have achieved impressive performance on standard benchmarks. Yet, an important question remains: Do these models genuinely integrate audio-visual cues to segment sounding objects? In this paper, we systematically investigate this issue in the context of robust AVS. Our study reveals a fundamental bias in current methods: they tend to generate segmentation masks based predominantly on visual salience, irrespective of the audio context. This bias results in unreliable predictions when sounds are absent or irrelevant. To address this challenge, we introduce AVSBench-Robust, a comprehensive benchmark incorporating diverse negative audio scenarios including silence, ambient noise, and off-screen sounds. We also propose a simple yet effective approach combining balanced training with negative samples and classifier-guided similarity learning. Our extensive experiments show that state-of-theart AVS methods consistently fail under negative audio conditions, demonstrating the prevalence of visual bias. In contrast, our approach achieves remarkable improvements in both standard metrics and robustness measures, maintaining near-perfect false positive rates while preserving highquality segmentation performance.
Problem

Research questions and friction points this paper is trying to address.

Audio-Visual Segmentation
Sound/Object Mismatch
Complex Acoustic Environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

AVSBench-Robust
Balanced Training
Classifier-guided Similarity Learning