Look, Listen and Segment: Towards Weakly Supervised Audio-visual Semantic Segmentation

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high cost of pixel-level frame-by-frame annotation in audio-visual semantic segmentation by proposing a weakly supervised method that relies solely on video-level labels to achieve pixel-wise segmentation of sounding objects. The core innovation lies in the Progressive Cross-modal Alignment for Semantics (PCAS) framework, which decouples the task into three stages: “seeing,” “hearing,” and “segmenting.” First, audio and visual encoders are jointly trained via a classification objective, with visual semantic cues enhancing audio representations. Subsequently, a progressive cross-modal contrastive alignment strategy precisely maps audio semantics onto relevant image regions. Evaluated on the AVS benchmark, the proposed approach significantly outperforms existing weakly supervised methods and achieves competitive performance even in fully supervised audio-visual semantic segmentation (AVSS) settings.

Technology Category

Application Category

📝 Abstract
Audio-Visual Semantic Segmentation (AVSS) aligns audio and video at the pixel level but requires costly per-frame annotations. We introduce Weakly Supervised Audio-Visual Semantic Segmentation (WSAVSS), which uses only video-level labels to generate per-frame semantic masks of sounding objects. We decompose WSAVSS into looking, listening, and segmentation, and propose Progressive Cross-modal Alignment for Semantics (PCAS) with two modules: *Looking-before-Listening* and *Listening-before-Segmentation*. PCAS builds a classification task to train the audio-visual encoder using video labels, injects visual semantic prompts to enhance frame-level audio understanding, and then applies progressive contrastive alignment to map audio categories to image regions without mask annotations. Experiments show PCAS achieves state-of-the-art performance among weakly supervised methods on AVS and remains competitive with fully supervised baselines on AVSS, validating its effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Audio-Visual Semantic Segmentation
Weakly Supervised Learning
Video-level Labels
Semantic Mask
Cross-modal Alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Weakly Supervised Learning
Audio-Visual Semantic Segmentation
Cross-modal Alignment
Progressive Contrastive Learning
Semantic Prompting
🔎 Similar Papers
No similar papers found.
C
Chengzhi Li
School of Computer Science, Beijing Institute of Technology, Beijing, China
H
Heyan Huang
School of Computer Science, Beijing Institute of Technology, Beijing, China
Ping Jian
Ping Jian
Beijing Institute of Technology
natural language processingmachine learning
Y
Yanghao Zhou
School of Computer Science, Beijing Institute of Technology, Beijing, China