Revisiting Audio-Visual Segmentation with Vision-Centric Transformer

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Audio-visual segmentation (AVS) faces two key challenges: perceptual ambiguity caused by audio mixture and degradation of dense prediction capability due to loss of fine-grained visual details. To address these, we propose a vision-centric Transformer framework that departs from conventional audio-dominant paradigms. Instead, it generates prototype prompt queries directly from visual features and enables audio-visual co-modeling via iterative cross-modal fusion. Our core contributions are: (1) a vision-derived prototype prompt query generation module, which enhances spatial localization robustness; and (2) a lightweight cross-modal information aggregation mechanism that preserves high-resolution visual details. Evaluated on all three subsets of AVSBench, our method achieves state-of-the-art performance, significantly improving both segmentation accuracy and boundary fidelity.

Technology Category

Application Category

📝 Abstract
Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal. Prevailing AVS methods typically adopt an audio-centric Transformer architecture, where object queries are derived from audio features. However, audio-centric Transformers suffer from two limitations: perception ambiguity caused by the mixed nature of audio, and weakened dense prediction ability due to visual detail loss. To address these limitations, we propose a new Vision-Centric Transformer (VCT) framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information, enabling queries to better distinguish between different sounding objects from mixed audio and accurately delineate their contours. Additionally, we also introduce a Prototype Prompted Query Generation (PPQG) module within our VCT framework to generate vision-derived queries that are both semantically aware and visually rich through audio prototype prompting and pixel context grouping, facilitating audio-visual information aggregation. Extensive experiments demonstrate that our VCT framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset. The code is available at https://github.com/spyflying/VCT_AVS.
Problem

Research questions and friction points this paper is trying to address.

Segments sound-producing objects in videos using audio signals
Overcomes audio ambiguity and visual detail loss in AVS
Enhances object distinction and contour accuracy via vision-centric queries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Centric Transformer for AVS
Prototype Prompted Query Generation
Audio-visual information aggregation
🔎 Similar Papers
No similar papers found.
Shaofei Huang
Shaofei Huang
University of Macau
Computer Vision
R
Rui Ling
School of Computer Science and Engineering, Beihang University
T
Tianrui Hui
School of Computer Science and Information Engineering, Hefei University of Technology
H
Hongyu Li
School of Artificial Intelligence, Beihang University
X
Xu Zhou
Sangfor Technologies
Shifeng Zhang
Shifeng Zhang
Institute of Automation, Chinese Academic of Sciences
Computer VisionObject DetectionFace DetectionPedestrian Detection
Si Liu
Si Liu
Fred Hutchinson Cancer Center
GenomicsBiostatisticsAnomaly DetectionOpen Category Detection
Richang Hong
Richang Hong
Hefei University of Technology
MultimediaPattern Recognition
M
Meng Wang
School of Computer Science and Information Engineering, Hefei University of Technology