Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Audio-visual segmentation faces two key challenges: (1) spatially adjacent objects with similar visual appearances but differing sound-emitting states, and (2) ambiguous audio-visual correspondence caused by frequent sound onsets/offsets, leading to over- and under-segmentation. To address these, we propose a dual-module framework comprising Audio-Guided Modality Alignment (AMA) and Uncertainty Estimation (UE). AMA groups and aggregates visual features based on audio response intensity and employs contrastive learning to discriminate sound-emitting from silent regions. UE models spatiotemporal joint uncertainty to dynamically suppress prediction confidence in high-uncertainty regions. Our method is the first to jointly optimize response-driven grouped interaction and uncertainty-aware segmentation. Extensive experiments demonstrate significant improvements over state-of-the-art methods across multiple benchmarks, particularly in scenarios involving rapid sound-state transitions and severe visual ambiguity among co-located objects.

Technology Category

Application Category

📝 Abstract

Accurately localizing audible objects based on audio-visual cues is the core objective of audio-visual segmentation. Most previous methods emphasize spatial or temporal multi-modal modeling, yet overlook challenges from ambiguous audio-visual correspondences such as nearby visually similar but acoustically different objects and frequent shifts in objects' sounding status. Consequently, they may struggle to reliably correlate audio and visual cues, leading to over- or under-segmentation. To address these limitations, we propose a novel framework with two primary components: an audio-guided modality alignment (AMA) module and an uncertainty estimation (UE) module. Instead of indiscriminately correlating audio-visual cues through a global attention mechanism, AMA performs audio-visual interactions within multiple groups and consolidates group features into compact representations based on their responsiveness to audio cues, effectively directing the model's attention to audio-relevant areas. Leveraging contrastive learning, AMA further distinguishes sounding regions from silent areas by treating features with strong audio responses as positive samples and weaker responses as negatives. Additionally, UE integrates spatial and temporal information to identify high-uncertainty regions caused by frequent changes in sound state, reducing prediction errors by lowering confidence in these areas. Experimental results demonstrate that our approach achieves superior accuracy compared to existing state-of-the-art methods, particularly in challenging scenarios where traditional approaches struggle to maintain reliable segmentation.

Problem

Research questions and friction points this paper is trying to address.

Localize audible objects using audio-visual cues accurately

Address ambiguous audio-visual correspondences and sound state shifts

Improve segmentation reliability in challenging scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-guided modality alignment for precise segmentation

Contrastive learning distinguishes sounding from silent regions

Uncertainty estimation reduces errors in dynamic sound states

🔎 Similar Papers

Progressive Confident Masking Attention Network for Audio-Visual Segmentation