🤖 AI Summary
This work addresses the high cost of pixel-level frame-by-frame annotation in audio-visual semantic segmentation by proposing a weakly supervised method that relies solely on video-level labels to achieve pixel-wise segmentation of sounding objects. The core innovation lies in the Progressive Cross-modal Alignment for Semantics (PCAS) framework, which decouples the task into three stages: “seeing,” “hearing,” and “segmenting.” First, audio and visual encoders are jointly trained via a classification objective, with visual semantic cues enhancing audio representations. Subsequently, a progressive cross-modal contrastive alignment strategy precisely maps audio semantics onto relevant image regions. Evaluated on the AVS benchmark, the proposed approach significantly outperforms existing weakly supervised methods and achieves competitive performance even in fully supervised audio-visual semantic segmentation (AVSS) settings.
📝 Abstract
Audio-Visual Semantic Segmentation (AVSS) aligns audio and video at the pixel level but requires costly per-frame annotations. We introduce Weakly Supervised Audio-Visual Semantic Segmentation (WSAVSS), which uses only video-level labels to generate per-frame semantic masks of sounding objects. We decompose WSAVSS into looking, listening, and segmentation, and propose Progressive Cross-modal Alignment for Semantics (PCAS) with two modules: *Looking-before-Listening* and *Listening-before-Segmentation*. PCAS builds a classification task to train the audio-visual encoder using video labels, injects visual semantic prompts to enhance frame-level audio understanding, and then applies progressive contrastive alignment to map audio categories to image regions without mask annotations. Experiments show PCAS achieves state-of-the-art performance among weakly supervised methods on AVS and remains competitive with fully supervised baselines on AVSS, validating its effectiveness.