Complementary and Contrastive Learning for Audio-Visual Segmentation

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Audio-visual segmentation (AVS) aims to generate pixel-level visual object masks guided by auditory cues; however, existing methods are limited by CNNs’ restricted local modeling capacity or Transformers’ insufficient modeling of multimodal temporal dynamics and cross-modal alignment. To address these limitations, we propose CCFormer—a novel framework featuring early parallel bidirectional fusion, dynamic audio query generation, and video-level bimodal contrastive learning. CCFormer synergistically integrates CNNs’ strong local representation capability with Transformers’ global contextual modeling strength. Through multi-scale feature fusion and multi-query attention, it significantly enhances cross-modal complementarity, spatiotemporal context modeling, and temporal consistency. Extensive experiments demonstrate state-of-the-art performance on three major benchmarks—S4, MS3, and AVSS—achieving substantial gains in both segmentation accuracy and robustness.

Technology Category

Application Category

📝 Abstract

Audio-Visual Segmentation (AVS) aims to generate pixel-wise segmentation maps that correlate with the auditory signals of objects. This field has seen significant progress with numerous CNN and Transformer-based methods enhancing the segmentation accuracy and robustness. Traditional CNN approaches manage audio-visual interactions through basic operations like padding and multiplications but are restricted by CNNs' limited local receptive field. More recently, Transformer-based methods treat auditory cues as queries, utilizing attention mechanisms to enhance audio-visual cooperation within frames. Nevertheless, they typically struggle to extract multimodal coefficients and temporal dynamics adequately. To overcome these limitations, we present the Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information and capturing spatial-temporal context comprehensively. Our CCFormer initiates with the Early Integration Module (EIM) that employs a parallel bilateral architecture, merging multi-scale visual features with audio data to boost cross-modal complementarity. To extract the intra-frame spatial features and facilitate the perception of temporal coherence, we introduce the Multi-query Transformer Module (MTM), which dynamically endows audio queries with learning capabilities and models the frame and video-level relations simultaneously. Furthermore, we propose the Bi-modal Contrastive Learning (BCL) to promote the alignment across both modalities in the unified feature space. Through the effective combination of those designs, our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets. Our source code and model weights will be made publicly available at https://github.com/SitongGong/CCFormer

Problem

Research questions and friction points this paper is trying to address.

Enhancing pixel-wise segmentation accuracy with auditory-visual correlation

Overcoming limitations in extracting multimodal and temporal dynamics

Improving cross-modal feature alignment through complementary learning mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses parallel bilateral architecture for cross-modal fusion

Employs multi-query transformer for spatiotemporal modeling

Applies bi-modal contrastive learning for feature alignment

🔎 Similar Papers

Progressive Confident Masking Attention Network for Audio-Visual Segmentation