🤖 AI Summary
Audio-visual segmentation (AVS) aims to generate pixel-level visual object masks guided by auditory cues; however, existing methods are limited by CNNs’ restricted local modeling capacity or Transformers’ insufficient modeling of multimodal temporal dynamics and cross-modal alignment. To address these limitations, we propose CCFormer—a novel framework featuring early parallel bidirectional fusion, dynamic audio query generation, and video-level bimodal contrastive learning. CCFormer synergistically integrates CNNs’ strong local representation capability with Transformers’ global contextual modeling strength. Through multi-scale feature fusion and multi-query attention, it significantly enhances cross-modal complementarity, spatiotemporal context modeling, and temporal consistency. Extensive experiments demonstrate state-of-the-art performance on three major benchmarks—S4, MS3, and AVSS—achieving substantial gains in both segmentation accuracy and robustness.
📝 Abstract
Audio-Visual Segmentation (AVS) aims to generate pixel-wise segmentation maps that correlate with the auditory signals of objects. This field has seen significant progress with numerous CNN and Transformer-based methods enhancing the segmentation accuracy and robustness. Traditional CNN approaches manage audio-visual interactions through basic operations like padding and multiplications but are restricted by CNNs' limited local receptive field. More recently, Transformer-based methods treat auditory cues as queries, utilizing attention mechanisms to enhance audio-visual cooperation within frames. Nevertheless, they typically struggle to extract multimodal coefficients and temporal dynamics adequately. To overcome these limitations, we present the Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information and capturing spatial-temporal context comprehensively. Our CCFormer initiates with the Early Integration Module (EIM) that employs a parallel bilateral architecture, merging multi-scale visual features with audio data to boost cross-modal complementarity. To extract the intra-frame spatial features and facilitate the perception of temporal coherence, we introduce the Multi-query Transformer Module (MTM), which dynamically endows audio queries with learning capabilities and models the frame and video-level relations simultaneously. Furthermore, we propose the Bi-modal Contrastive Learning (BCL) to promote the alignment across both modalities in the unified feature space. Through the effective combination of those designs, our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets. Our source code and model weights will be made publicly available at https://github.com/SitongGong/CCFormer