🤖 AI Summary
This work addresses the challenge in referring expression audio-visual segmentation where dynamic inter-modal correlations are often disrupted by irrelevant or misleading modalities. Inspired by the biased competition theory from cognitive neuroscience, we propose an adaptive modality suppression framework that dynamically models the modulatory interaction between visual perception and linguistic priors. The framework employs a modality prior decoder and a competition-aware cross-modal fusion module to adaptively select reliance on audio, visual, or their joint cues. Furthermore, spatial semantic alignment loss and contrastive learning are integrated to enhance foreground-background discrimination. Our method achieves state-of-the-art performance on the Ref-AVS benchmark, demonstrating its effectiveness in leveraging multi-modal cues while suppressing interference.
📝 Abstract
Referring Audio-Visual Segmentation (Ref-AVS) seeks to localize and segment target objects in video frames based on visual, auditory, and textual referring cues. The task is challenging because the relevance of different modalities varies across referring expressions and scenes, while existing methods typically treat multimodal cues as homogeneous inputs for fusion, prompting, or reasoning, making them vulnerable to irrelevant or misleading modalities. To address this problem, we propose PRIMED, inspired by the biased competition theory in cognitive neuroscience, which explicitly models both visual perception and language-driven prior modulation, and enables more accurate Ref-AVS by adaptive modality suppression. Specifically, a Modality Prior Decoder first estimates whether the referring expression relies primarily on audio, vision, or their joint interaction, generating a modality prior to adaptively guide high-level attention. A Token Distiller further extracts compact global visual tokens from high-level features and shares them across Competition-aware Cross-modal Fusion modules to provide hierarchical global context. Additionally, we introduce a Spatial-Aware Semantic Alignment loss to further enhance foreground-background discrimination through contrastive learning. Extensive experiments on the Ref-AVS benchmark demonstrate that PRIMED achieves state-of-the-art overall performance.