Look, Listen and Segment: Towards Weakly Supervised Audio-visual Semantic Segmentation

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the high cost of pixel-level frame-by-frame annotation in audio-visual semantic segmentation by proposing a weakly supervised method that relies solely on video-level labels to achieve pixel-wise segmentation of sounding objects. The core innovation lies in the Progressive Cross-modal Alignment for Semantics (PCAS) framework, which decouples the task into three stages: “seeing,” “hearing,” and “segmenting.” First, audio and visual encoders are jointly trained via a classification objective, with visual semantic cues enhancing audio representations. Subsequently, a progressive cross-modal contrastive alignment strategy precisely maps audio semantics onto relevant image regions. Evaluated on the AVS benchmark, the proposed approach significantly outperforms existing weakly supervised methods and achieves competitive performance even in fully supervised audio-visual semantic segmentation (AVSS) settings.

Technology Category

Application Category

📝 Abstract

Audio-Visual Semantic Segmentation (AVSS) aligns audio and video at the pixel level but requires costly per-frame annotations. We introduce Weakly Supervised Audio-Visual Semantic Segmentation (WSAVSS), which uses only video-level labels to generate per-frame semantic masks of sounding objects. We decompose WSAVSS into looking, listening, and segmentation, and propose Progressive Cross-modal Alignment for Semantics (PCAS) with two modules: *Looking-before-Listening* and *Listening-before-Segmentation*. PCAS builds a classification task to train the audio-visual encoder using video labels, injects visual semantic prompts to enhance frame-level audio understanding, and then applies progressive contrastive alignment to map audio categories to image regions without mask annotations. Experiments show PCAS achieves state-of-the-art performance among weakly supervised methods on AVS and remains competitive with fully supervised baselines on AVSS, validating its effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Audio-Visual Semantic Segmentation

Weakly Supervised Learning

Video-level Labels

Semantic Mask

Cross-modal Alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weakly Supervised Learning

Audio-Visual Semantic Segmentation

Cross-modal Alignment